awswrangler.s3.read_orc_table¶

awswrangler.s3.read_orc_table(table: str, database: str, filename_suffix: str | list[str] | None = None, filename_ignore_suffix: str | list[str] | None = None, catalog_id: str | None = None, partition_filter: Callable[[dict[str, str]], bool] | None = None, columns: list[str] | None = None, validate_schema: bool = True, dtype_backend: Literal['numpy_nullable', 'pyarrow'] = 'numpy_nullable', use_threads: bool | int = True, ray_args: RaySettings | None = None, boto3_session: Session | None = None, s3_additional_kwargs: dict[str, Any] | None = None, pyarrow_additional_kwargs: dict[str, Any] | None = None) → DataFrame¶

Read Apache ORC table registered in the AWS Glue Catalog.

Note

If use_threads=True, the number of threads is obtained from os.cpu_count().

Note

This function has arguments which can be configured globally through wr.config or environment variables:

catalog_id
database
dtype_backend

Check out the Global Configurations Tutorial for details.

Note

Following arguments are not supported in distributed mode with engine EngineEnum.RAY:

boto3_session
s3_additional_kwargs
dtype_backend

Parameters:

table (str) – AWS Glue Catalog table name.
database (str) – AWS Glue Catalog database name.
filename_suffix (Union[str, List[str], None]) – Suffix or List of suffixes to be read (e.g. [“.gz.orc”, “.snappy.orc”]). If None, read all files. (default)
filename_ignore_suffix (Union[str, List[str], None]) – Suffix or List of suffixes for S3 keys to be ignored.(e.g. [“.csv”, “_SUCCESS”]). If None, read all files. (default)
catalog_id (str, optional) – The ID of the Data Catalog from which to retrieve Databases. If none is provided, the AWS account ID is used by default.
partition_filter (Optional[Callable[[Dict[str, str]], bool]]) – Callback Function filters to apply on PARTITION columns (PUSH-DOWN filter). This function must receive a single argument (Dict[str, str]) where keys are partitions names and values are partitions values. Partitions values must be strings and the function must return a bool, True to read the partition or False to ignore it. Ignored if dataset=False. E.g lambda x: True if x["year"] == "2020" and x["month"] == "1" else False https://aws-sdk-pandas.readthedocs.io/en/3.7.3/tutorials/023%20-%20Flexible%20Partitions%20Filter.html
columns (List[str], optional) – List of columns to read from the file(s).
validate_schema (bool, default False) – Check that the schema is consistent across individual files.
dtype_backend (str, optional) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

The dtype_backends are still experimential. The “pyarrow” backend is only supported with Pandas 2.0 or above.
use_threads (Union[bool, int], default True) – True to enable concurrent requests, False to disable multiple threads. If enabled, os.cpu_count() is used as the max number of threads. If integer is provided, specified number is used.
ray_args (RaySettings, optional) – Parameters of the Ray Modin settings. Only used when distributed computing is used with Ray and Modin installed.
boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session is used if None is received.
s3_additional_kwargs (dict[str, Any], optional) – Forward to S3 botocore requests.
pyarrow_additional_kwargs (Dict[str, Any], optional) – Forwarded to to_pandas method converting from PyArrow tables to Pandas DataFrame. Valid values include “split_blocks”, “self_destruct”, “ignore_metadata”. e.g. pyarrow_additional_kwargs={‘split_blocks’: True}.

Returns:

Pandas DataFrame.

Return type:

pandas.DataFrame

Examples

Reading ORC Table

>>> import awswrangler as wr
>>> df = wr.s3.read_orc_table(database='...', table='...')

Reading ORC Dataset with PUSH-DOWN filter over partitions

>>> import awswrangler as wr
>>> my_filter = lambda x: True if x["city"].startswith("new") else False
>>> df = wr.s3.read_orc_table(path, dataset=True, partition_filter=my_filter)