29 - S3 Select¶
AWS SDK for pandas supports Amazon S3 Select, enabling applications to use SQL statements in order to query and filter the contents of a single S3 object. It works on objects stored in CSV, JSON or Apache Parquet, including compressed and large files of several TBs.
With S3 Select, the query workload is delegated to Amazon S3, leading to lower latency and cost, and to higher performance (up to 400% improvement). This is in comparison with other awswrangler operations such as read_parquet
where the S3 object is downloaded and filtered on the client-side.
This feature has a number of limitations however, and should be used for specific scenarios only:
It operates on a single S3 object
The maximum length of a record in the input or result is 1 MB
The maximum uncompressed row group size is 256 MB (Parquet only)
It can only emit nested data in JSON format
Certain SQL operations are not supported (e.g. ORDER BY)
Read full CSV file¶
[2]:
import awswrangler as wr
df = wr.s3.select_query(
sql="SELECT * FROM s3object",
path="s3://nyc-tlc/csv_backup/fhv_tripdata_2019-09.csv", # 58 MB
input_serialization="CSV",
input_serialization_params={
"FileHeaderInfo": "Use",
"RecordDelimiter": "\r\n",
},
use_threads=True,
)
df.head()
[2]:
dispatching_base_num | pickup_datetime | dropoff_datetime | PULocationID | DOLocationID | SR_Flag | |
---|---|---|---|---|---|---|
0 | B00009 | 2019-09-01 00:35:00 | 2019-09-01 00:59:00 | 264 | 264 | |
1 | B00009 | 2019-09-01 00:48:00 | 2019-09-01 01:09:00 | 264 | 264 | |
2 | B00014 | 2019-09-01 00:16:18 | 2019-09-02 00:35:37 | 264 | 264 | |
3 | B00014 | 2019-09-01 00:55:03 | 2019-09-01 01:09:35 | 264 | 264 | |
4 | B00014 | 2019-09-01 00:13:08 | 2019-09-02 01:12:31 | 264 | 264 |
Filter JSON file¶
[2]:
wr.s3.select_query(
sql="SELECT * FROM s3object[*] s where s.\"family_name\" = \'Biden\'",
path="s3://awsglue-datasets/examples/us-legislators/all/persons.json",
input_serialization="JSON",
input_serialization_params={
"Type": "Document",
},
)
[2]:
family_name | contact_details | name | links | gender | image | identifiers | other_names | sort_name | images | given_name | birth_date | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Biden | [{'type': 'twitter', 'value': 'joebiden'}] | Joseph Biden, Jr. | [{'note': 'Wikipedia (ace)', 'url': 'https://a... | male | https://theunitedstates.io/images/congress/ori... | [{'scheme': 'bioguide', 'identifier': 'B000444... | [{'note': 'alternate', 'name': 'Joe Biden'}, {... | Biden, Joseph | [{'url': 'https://theunitedstates.io/images/co... | Joseph | 1942-11-20 | 64239edf-8e06-4d2d-acc0-33d96bc79774 |
Read Snappy compressed Parquet¶
[3]:
df = wr.s3.select_query(
sql="SELECT * FROM s3object s where s.\"star_rating\" >= 5",
path="s3://amazon-reviews-pds/parquet/product_category=Gift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
input_serialization="Parquet",
input_serialization_params={},
use_threads=True,
)
df.loc[:, df.columns != "product_title"].head()
[3]:
marketplace | customer_id | review_id | product_id | product_parent | star_rating | helpful_votes | total_votes | vine | verified_purchase | review_headline | review_body | review_date | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | US | 52670295 | RGPOFKORD8RTU | B0002CZPPG | 867256265 | 5 | 105 | 107 | N | N | Excellent Gift Idea | I wonder if the other reviewer actually read t... | 2005-02-08 | 2005 |
1 | US | 29964102 | R2U8X8V5KPB4J3 | B00H5BMF00 | 373287760 | 5 | 0 | 0 | N | Y | Five Stars | convenience is the name of the game. | 2015-05-03 | 2015 |
2 | US | 25173351 | R15XV3LXUMLTXL | B00PG40CO4 | 137115061 | 5 | 0 | 0 | N | Y | Birthday Gift | This gift card was handled with accuracy in de... | 2015-05-03 | 2015 |
3 | US | 12516181 | R3G6G7H8TX4H0T | B0002CZPPG | 867256265 | 5 | 6 | 6 | N | N | Love 'em. | Gotta love these iTunes Prepaid Card thingys. ... | 2005-10-15 | 2005 |
4 | US | 38355314 | R2NJ7WNBU16YTQ | B00B2TFSO6 | 89375983 | 5 | 0 | 0 | N | Y | Five Stars | perfect | 2015-05-03 | 2015 |