awswrangler.s3.select_query

awswrangler.s3.select_query(sql: str, path: str, input_serialization: str, input_serialization_params: Dict[str, Union[bool, str]], compression: Optional[str] = None, use_threads: Union[bool, int] = False, boto3_session: Optional[Session] = None, s3_additional_kwargs: Optional[Dict[str, Any]] = None) DataFrame

Filter contents of an Amazon S3 object based on SQL statement.

Note: Scan ranges are only supported for uncompressed CSV/JSON, CSV (without quoted delimiters) and JSON objects (in LINES mode only). It means scanning cannot be split across threads if the latter conditions are not met, leading to lower performance.

Parameters
  • sql (str) – SQL statement used to query the object.

  • path (str) – S3 path to the object (e.g. s3://bucket/key).

  • input_serialization (str,) – Format of the S3 object queried. Valid values: “CSV”, “JSON”, or “Parquet”. Case sensitive.

  • input_serialization_params (Dict[str, Union[bool, str]]) – Dictionary describing the serialization of the S3 object.

  • compression (Optional[str]) – Compression type of the S3 object. Valid values: None, “gzip”, or “bzip2”. gzip and bzip2 are only valid for CSV and JSON objects.

  • use_threads (Union[bool, int]) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() is used as the max number of threads. If integer is provided, specified number is used.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session is used if none is provided.

  • s3_additional_kwargs (Optional[Dict[str, Any]]) – Forwarded to botocore requests. Valid values: “SSECustomerAlgorithm”, “SSECustomerKey”, “ExpectedBucketOwner”. e.g. s3_additional_kwargs={‘SSECustomerAlgorithm’: ‘md5’}

Returns

Pandas DataFrame with results from query.

Return type

pandas.DataFrame

Examples

Reading a gzip compressed JSON document

>>> import awswrangler as wr
>>> df = wr.s3.select_query(
...     sql='SELECT * FROM s3object[*][*]',
...     path='s3://bucket/key.json.gzip',
...     input_serialization='JSON',
...     input_serialization_params={
...         'Type': 'Document',
...     },
...     compression="gzip",
... )

Reading an entire CSV object using threads

>>> import awswrangler as wr
>>> df = wr.s3.select_query(
...     sql='SELECT * FROM s3object',
...     path='s3://bucket/key.csv',
...     input_serialization='CSV',
...     input_serialization_params={
...         'FileHeaderInfo': 'Use',
...         'RecordDelimiter': '\r\n'
...     },
...     use_threads=True,
... )

Reading a single column from Parquet object with pushdown filter

>>> import awswrangler as wr
>>> df = wr.s3.select_query(
...     sql='SELECT s.\"id\" FROM s3object s where s.\"id\" = 1.0',
...     path='s3://bucket/key.snappy.parquet',
...     input_serialization='Parquet',
... )