awswrangler.s3.read_orc_metadata

awswrangler.s3.read_orc_metadata(path: str | list[str], dataset: bool = False, version_id: str | dict[str, str] | None = None, path_suffix: str | None = None, path_ignore_suffix: str | list[str] | None = None, ignore_empty: bool = True, ignore_null: bool = False, dtype: dict[str, str] | None = None, sampling: float = 1.0, use_threads: bool | int = True, boto3_session: Session | None = None, s3_additional_kwargs: dict[str, Any] | None = None) _ReadTableMetadataReturnValue

Read Apache ORC file(s) metadata from an S3 prefix or list of S3 objects paths.

The concept of dataset enables more complex features like partitioning and catalog integration (AWS Glue Catalog).

This function accepts Unix shell-style wildcards in the path argument. * (matches everything), ? (matches any single character), [seq] (matches any character in seq), [!seq] (matches any character not in seq). If you want to use a path which includes Unix shell-style wildcard characters (*, ?, []), you can use glob.escape(path) before passing the argument to this function.

Note

If use_threads=True, the number of threads is obtained from os.cpu_count().

Note

This function has arguments which can be configured globally through wr.config or environment variables:

Check out the Global Configurations Tutorial for details.

Note

Following arguments are not supported in distributed mode with engine EngineEnum.RAY:

  • boto3_session

Parameters:
  • path (Union[str, List[str]]) – S3 prefix (accepts Unix shell-style wildcards) (e.g. s3://bucket/prefix) or list of S3 objects paths (e.g. [s3://bucket/key0, s3://bucket/key1]).

  • dataset (bool, default False) – If True, read an ORC dataset instead of individual file(s), loading all related partitions as columns.

  • version_id (Union[str, Dict[str, str]], optional) – Version id of the object or mapping of object path to version id. (e.g. {‘s3://bucket/key0’: ‘121212’, ‘s3://bucket/key1’: ‘343434’})

  • path_suffix (Union[str, List[str], None]) – Suffix or List of suffixes to be read (e.g. [“.gz.orc”, “.snappy.orc”]). If None, reads all files. (default)

  • path_ignore_suffix (Union[str, List[str], None]) – Suffix or List of suffixes to be ignored.(e.g. [“.csv”, “_SUCCESS”]). If None, reads all files. (default)

  • ignore_empty (bool, default True) – Ignore files with 0 bytes.

  • ignore_null (bool, default False) – Ignore columns with null type.

  • dtype (Dict[str, str], optional) – Dictionary of columns names and Athena/Glue types to cast. Use when you have columns with undetermined data types as partitions columns. (e.g. {‘col name’: ‘bigint’, ‘col2 name’: ‘int’})

  • sampling (float) – Ratio of files metadata to inspect. Must be 0.0 < sampling <= 1.0. The higher, the more accurate. The lower, the faster.

  • use_threads (bool, int) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.

  • s3_additional_kwargs (dict[str, Any], optional) – Forward to S3 botocore requests.

Returns:

columns_types: Dictionary with keys as column names and values as data types (e.g. {‘col0’: ‘bigint’, ‘col1’: ‘double’}). / partitions_types: Dictionary with keys as partition names and values as data types (e.g. {‘col2’: ‘date’}).

Return type:

Tuple[Dict[str, str], Optional[Dict[str, str]]]

Examples

Reading all ORC files (with partitions) metadata under a prefix

>>> import awswrangler as wr
>>> columns_types, partitions_types = wr.s3.read_orc_metadata(path='s3://bucket/prefix/', dataset=True)

Reading all ORC files metadata from a list

>>> import awswrangler as wr
>>> columns_types, partitions_types = wr.s3.read_orc_metadata(path=[
...     's3://bucket/filename0.orc',
...     's3://bucket/filename1.orc',
... ])