awswrangler.s3.select_query

awswrangler.s3.select_query(sql: str, path: str | list[str], input_serialization: str, input_serialization_params: dict[str, bool | str], compression: str | None = None, scan_range_chunk_size: int | None = None, path_suffix: str | list[str] | None = None, path_ignore_suffix: str | list[str] | None = None, ignore_empty: bool = True, use_threads: bool | int = True, last_modified_begin: datetime | None = None, last_modified_end: datetime | None = None, dtype_backend: Literal['numpy_nullable', 'pyarrow'] = 'numpy_nullable', boto3_session: Session | None = None, s3_additional_kwargs: dict[str, Any] | None = None, pyarrow_additional_kwargs: dict[str, Any] | None = None) DataFrame

Filter contents of Amazon S3 objects based on SQL statement.

Note: Scan ranges are only supported for uncompressed CSV/JSON, CSV (without quoted delimiters) and JSON objects (in LINES mode only). It means scanning cannot be split across threads if the aforementioned conditions are not met, leading to lower performance.

Note

Following arguments are not supported in distributed mode with engine EngineEnum.RAY:

  • boto3_session

Parameters:
  • sql (str) – SQL statement used to query the object.

  • path (Union[str, List[str]]) – S3 prefix (accepts Unix shell-style wildcards) (e.g. s3://bucket/prefix) or list of S3 objects paths (e.g. [s3://bucket/key0, s3://bucket/key1]).

  • input_serialization (str,) – Format of the S3 object queried. Valid values: “CSV”, “JSON”, or “Parquet”. Case sensitive.

  • input_serialization_params (Dict[str, Union[bool, str]]) – Dictionary describing the serialization of the S3 object.

  • compression (str, optional) – Compression type of the S3 object. Valid values: None, “gzip”, or “bzip2”. gzip and bzip2 are only valid for CSV and JSON objects.

  • scan_range_chunk_size (int, optional) – Chunk size used to split the S3 object into scan ranges. 1,048,576 by default.

  • path_suffix (Union[str, List[str], None]) – Suffix or List of suffixes to be read (e.g. [“.csv”]). If None, read all files. (default)

  • path_ignore_suffix (Union[str, List[str], None]) – Suffix or List of suffixes for S3 keys to be ignored. (e.g. [“_SUCCESS”]). If None, read all files. (default)

  • ignore_empty (bool, default True) – Ignore files with 0 bytes.

  • use_threads (Union[bool, int]) – True (default) to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() is used as the max number of threads. If integer is provided, specified number is used.

  • last_modified_begin (datetime, optional) – Filter S3 objects by Last modified date. Filter is only applied after listing all objects.

  • last_modified_end (datetime, optional) – Filter S3 objects by Last modified date. Filter is only applied after listing all objects.

  • dtype_backend (str, optional) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential. The “pyarrow” backend is only supported with Pandas 2.0 or above.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session is used if none is provided.

  • s3_additional_kwargs (Dict[str, Any], optional) – Forwarded to botocore requests. Valid values: “SSECustomerAlgorithm”, “SSECustomerKey”, “ExpectedBucketOwner”. e.g. s3_additional_kwargs={‘SSECustomerAlgorithm’: ‘md5’}.

  • pyarrow_additional_kwargs (Dict[str, Any], optional) – Forwarded to to_pandas method converting from PyArrow tables to Pandas DataFrame. Valid values include “split_blocks”, “self_destruct”, “ignore_metadata”. e.g. pyarrow_additional_kwargs={‘split_blocks’: True}.

Returns:

Pandas DataFrame with results from query.

Return type:

pandas.DataFrame

Examples

Reading a gzip compressed JSON document

>>> import awswrangler as wr
>>> df = wr.s3.select_query(
...     sql='SELECT * FROM s3object[*][*]',
...     path='s3://bucket/key.json.gzip',
...     input_serialization='JSON',
...     input_serialization_params={
...         'Type': 'Document',
...     },
...     compression="gzip",
... )

Reading multiple CSV objects from a prefix

>>> import awswrangler as wr
>>> df = wr.s3.select_query(
...     sql='SELECT * FROM s3object',
...     path='s3://bucket/prefix/',
...     input_serialization='CSV',
...     input_serialization_params={
...         'FileHeaderInfo': 'Use',
...         'RecordDelimiter': '\r\n'
...     },
... )

Reading a single column from Parquet object with pushdown filter

>>> import awswrangler as wr
>>> df = wr.s3.select_query(
...     sql='SELECT s.\"id\" FROM s3object s where s.\"id\" = 1.0',
...     path='s3://bucket/key.snappy.parquet',
...     input_serialization='Parquet',
... )