AWS SDK for pandas

29 - S3 Select

AWS SDK for pandas supports Amazon S3 Select, enabling applications to use SQL statements in order to query and filter the contents of a single S3 object. It works on objects stored in CSV, JSON or Apache Parquet, including compressed and large files of several TBs.

With S3 Select, the query workload is delegated to Amazon S3, leading to lower latency and cost, and to higher performance (up to 400% improvement). This is in comparison with other awswrangler operations such as read_parquet where the S3 object is downloaded and filtered on the client-side.

This feature has a number of limitations however:

  • The maximum length of a record in the input or result is 1 MB

  • The maximum uncompressed row group size is 256 MB (Parquet only)

  • It can only emit nested data in JSON format

  • Certain SQL operations are not supported (e.g. ORDER BY)

Read multiple Parquet files from an S3 prefix

[1]:
import awswrangler as wr

df = wr.s3.select_query(
    sql='SELECT * FROM s3object s where s."trip_distance" > 30',
    path="s3://ursa-labs-taxi-data/2019/01/",
    input_serialization="Parquet",
    input_serialization_params={},
)

df.head()
[1]:
vendor_id pickup_at dropoff_at passenger_count trip_distance rate_code_id store_and_fwd_flag pickup_location_id dropoff_location_id payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge
0 2 2019-01-01T00:48:10.000Z 2019-01-01T01:36:58.000Z 1 31.570000 1 N 138 138 2 82.5 0.5 0.5 0.00 0.00 0.3 83.800003 NaN
1 2 2019-01-01T00:38:36.000Z 2019-01-01T01:21:33.000Z 2 33.189999 5 N 107 265 1 121.0 0.0 0.0 0.08 10.50 0.3 131.880005 NaN
2 2 2019-01-01T00:10:43.000Z 2019-01-01T01:23:59.000Z 1 33.060001 1 N 243 42 2 92.0 0.5 0.5 0.00 5.76 0.3 99.059998 NaN
3 1 2019-01-01T00:13:17.000Z 2019-01-01T01:06:13.000Z 1 44.099998 5 N 132 265 2 150.0 0.0 0.0 0.00 0.00 0.3 150.300003 NaN
4 2 2019-01-01T00:29:11.000Z 2019-01-01T01:29:05.000Z 2 31.100000 1 N 169 201 1 85.5 0.5 0.5 0.00 7.92 0.3 94.720001 NaN

Read full CSV file

[5]:
df = wr.s3.select_query(
    sql="SELECT * FROM s3object",
    path="s3://humor-detection-pds/Humorous.csv",
    input_serialization="CSV",
    input_serialization_params={
        "FileHeaderInfo": "Use",
        "RecordDelimiter": "\r\n",
    },
    scan_range_chunk_size=1024 * 1024 * 32,  # override range of bytes to query, by default 1Mb
    use_threads=True,
)
df.head()
[5]:
question product_description image_url label
0 Will the volca sample get me a girlfriend? Korg Amplifier Part VOLCASAMPLE http://ecx.images-amazon.com/images/I/81I1XZea... 1
1 Can u communicate with spirits even on Saturday? Winning Moves Games Classic Ouija http://ecx.images-amazon.com/images/I/81kcYEG5... 1
2 I won't get hunted right? Winning Moves Games Classic Ouija http://ecx.images-amazon.com/images/I/81kcYEG5... 1
3 I have a few questions.. Can you get possessed... Winning Moves Games Classic Ouija http://ecx.images-amazon.com/images/I/81kcYEG5... 1
4 Has anyone asked where the treasure is? What w... Winning Moves Games Classic Ouija http://ecx.images-amazon.com/images/I/81kcYEG5... 1

Filter JSON file

[3]:
wr.s3.select_query(
    sql="SELECT * FROM s3object[*] s where s.\"family_name\" = 'Biden'",
    path="s3://awsglue-datasets/examples/us-legislators/all/persons.json",
    input_serialization="JSON",
    input_serialization_params={
        "Type": "Document",
    },
)
[3]:
family_name contact_details name links gender image identifiers other_names sort_name images given_name birth_date id
0 Biden [{'type': 'twitter', 'value': 'joebiden'}] Joseph Biden, Jr. [{'note': 'Wikipedia (ace)', 'url': 'https://a... male https://theunitedstates.io/images/congress/ori... [{'identifier': 'B000444', 'scheme': 'bioguide... [{'lang': None, 'name': 'Joe Biden', 'note': '... Biden, Joseph [{'url': 'https://theunitedstates.io/images/co... Joseph 1942-11-20 64239edf-8e06-4d2d-acc0-33d96bc79774