AWS Data Wrangler

29 - S3 Select

AWS Data Wrangler supports Amazon S3 Select, enabling applications to use SQL statements in order to query and filter the contents of a single S3 object. It works on objects stored in CSV, JSON or Apache Parquet, including compressed and large files of several TBs.

With S3 Select, the query workload is delegated to Amazon S3, leading to lower latency and cost, and to higher performance (up to 400% improvement). This is in comparison with other Wrangler operations such as read_parquet where the S3 object is downloaded and filtered on the client-side.

This feature has a number of limitations however, and should be used for specific scenarios only: * It operates on a single S3 object * The maximum length of a record in the input or result is 1 MB * The maximum uncompressed row group size is 256 MB (Parquet only) * It can only emit nested data in JSON format * Certain SQL operations are not supported (e.g. ORDER BY)

Read full CSV file

[1]:
import awswrangler as wr

df = wr.s3.select_query(
    sql="SELECT * FROM s3object",
    path="s3://nyc-tlc/trip data/fhv_tripdata_2019-09.csv",  # 58 MB
    input_serialization="CSV",
    input_serialization_params={
        "FileHeaderInfo": "Use",
        "RecordDelimiter": "\r\n",
    },
    use_threads=True,
)
df.head()
[1]:
dispatching_base_num pickup_datetime dropoff_datetime PULocationID DOLocationID SR_Flag
0 B00009 2019-09-01 00:35:00 2019-09-01 00:59:00 264 264
1 B00009 2019-09-01 00:48:00 2019-09-01 01:09:00 264 264
2 B00014 2019-09-01 00:16:18 2019-09-02 00:35:37 264 264
3 B00014 2019-09-01 00:55:03 2019-09-01 01:09:35 264 264
4 B00014 2019-09-01 00:13:08 2019-09-02 01:12:31 264 264

Filter JSON file

[2]:
wr.s3.select_query(
    sql="SELECT * FROM s3object[*] s where s.\"family_name\" = \'Biden\'",
    path="s3://awsglue-datasets/examples/us-legislators/all/persons.json",
    input_serialization="JSON",
    input_serialization_params={
        "Type": "Document",
    },
)
[2]:
family_name contact_details name links gender image identifiers other_names sort_name images given_name birth_date id
0 Biden [{'type': 'twitter', 'value': 'joebiden'}] Joseph Biden, Jr. [{'note': 'Wikipedia (ace)', 'url': 'https://a... male https://theunitedstates.io/images/congress/ori... [{'scheme': 'bioguide', 'identifier': 'B000444... [{'note': 'alternate', 'name': 'Joe Biden'}, {... Biden, Joseph [{'url': 'https://theunitedstates.io/images/co... Joseph 1942-11-20 64239edf-8e06-4d2d-acc0-33d96bc79774

Read Snappy compressed Parquet

[3]:
df = wr.s3.select_query(
        sql="SELECT * FROM s3object s where s.\"star_rating\" >= 5",
        path="s3://amazon-reviews-pds/parquet/product_category=Gift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
        input_serialization="Parquet",
        input_serialization_params={},
        use_threads=True,
)
df.loc[:, df.columns != "product_title"].head()
[3]:
marketplace customer_id review_id product_id product_parent star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date year
0 US 52670295 RGPOFKORD8RTU B0002CZPPG 867256265 5 105 107 N N Excellent Gift Idea I wonder if the other reviewer actually read t... 2005-02-08 2005
1 US 29964102 R2U8X8V5KPB4J3 B00H5BMF00 373287760 5 0 0 N Y Five Stars convenience is the name of the game. 2015-05-03 2015
2 US 25173351 R15XV3LXUMLTXL B00PG40CO4 137115061 5 0 0 N Y Birthday Gift This gift card was handled with accuracy in de... 2015-05-03 2015
3 US 12516181 R3G6G7H8TX4H0T B0002CZPPG 867256265 5 6 6 N N Love 'em. Gotta love these iTunes Prepaid Card thingys. ... 2005-10-15 2005
4 US 38355314 R2NJ7WNBU16YTQ B00B2TFSO6 89375983 5 0 0 N Y Five Stars perfect 2015-05-03 2015