awswrangler.s3.read_parquet_table

awswrangler.s3.read_parquet_table(table: str, database: str, filename_suffix: Optional[Union[str, List[str]]] = None, filename_ignore_suffix: Optional[Union[str, List[str]]] = None, catalog_id: Optional[str] = None, partition_filter: Optional[Callable[[Dict[str, str]], bool]] = None, columns: Optional[List[str]] = None, validate_schema: bool = True, categories: Optional[List[str]] = None, safe: bool = True, map_types: bool = True, chunked: Union[bool, int] = False, use_threads: Union[bool, int] = True, boto3_session: Optional[Session] = None, s3_additional_kwargs: Optional[Dict[str, Any]] = None) Any

Read Apache Parquet table registered on AWS Glue Catalog.

Note

Batching (chunked argument) (Memory Friendly):

Will enable the function to return an Iterable of DataFrames instead of a regular DataFrame.

There are two batching strategies on Wrangler:

  • If chunked=True, a new DataFrame will be returned for each file in your path/dataset.

  • If chunked=INTEGER, Wrangler will paginate through files slicing and concatenating to return DataFrames with the number of row igual the received INTEGER.

P.S. chunked=True if faster and uses less memory while chunked=INTEGER is more precise in number of rows for each Dataframe.

Note

In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().

Note

This function has arguments which can be configured globally through wr.config or environment variables:

  • catalog_id

  • database

Check out the Global Configurations Tutorial for details.

Parameters
  • table (str) – AWS Glue Catalog table name.

  • database (str) – AWS Glue Catalog database name.

  • filename_suffix (Union[str, List[str], None]) – Suffix or List of suffixes to be read (e.g. [“.gz.parquet”, “.snappy.parquet”]). If None, will try to read all files. (default)

  • filename_ignore_suffix (Union[str, List[str], None]) – Suffix or List of suffixes for S3 keys to be ignored.(e.g. [“.csv”, “_SUCCESS”]). If None, will try to read all files. (default)

  • catalog_id (str, optional) – The ID of the Data Catalog from which to retrieve Databases. If none is provided, the AWS account ID is used by default.

  • partition_filter (Optional[Callable[[Dict[str, str]], bool]]) – Callback Function filters to apply on PARTITION columns (PUSH-DOWN filter). This function MUST receive a single argument (Dict[str, str]) where keys are partition names and values are partition values. Partition values will be always strings extracted from S3. This function MUST return a bool, True to read the partition or False to ignore it. Ignored if dataset=False. E.g lambda x: True if x["year"] == "2020" and x["month"] == "1" else False https://aws-data-wrangler.readthedocs.io/en/2.15.0/tutorials/023%20-%20Flexible%20Partitions%20Filter.html

  • columns (List[str], optional) – Names of columns to read from the file(s).

  • validate_schema – Check that individual file schemas are all the same / compatible. Schemas within a folder prefix should all be the same. Disable if you have schemas that are different and want to disable this check.

  • categories (Optional[List[str]], optional) – List of columns names that should be returned as pandas.Categorical. Recommended for memory restricted environments.

  • safe (bool, default True) – For certain data types, a cast is needed in order to store the data in a pandas DataFrame or Series (e.g. timestamps are always stored as nanoseconds in pandas). This option controls whether it is a safe cast or not.

  • map_types (bool, default True) – True to convert pyarrow DataTypes to pandas ExtensionDtypes. It is used to override the default pandas type for conversion of built-in pyarrow types or in absence of pandas_metadata in the Table schema.

  • chunked (bool) – If True will break the data in smaller DataFrames (Non-deterministic number of lines). Otherwise return a single DataFrame with the whole data.

  • use_threads (Union[bool, int]) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.

  • s3_additional_kwargs (Optional[Dict[str, Any]]) – Forward to botocore requests, only “SSECustomerAlgorithm” and “SSECustomerKey” arguments will be considered.

Returns

Pandas DataFrame or a Generator in case of chunked=True.

Return type

Union[pandas.DataFrame, Generator[pandas.DataFrame, None, None]]

Examples

Reading Parquet Table

>>> import awswrangler as wr
>>> df = wr.s3.read_parquet_table(database='...', table='...')

Reading Parquet Table encrypted

>>> import awswrangler as wr
>>> df = wr.s3.read_parquet_table(
...     database='...',
...     table='...'
...     s3_additional_kwargs={
...         'ServerSideEncryption': 'aws:kms',
...         'SSEKMSKeyId': 'YOUR_KMS_KEY_ARN'
...     }
... )

Reading Parquet Table in chunks (Chunk by file)

>>> import awswrangler as wr
>>> dfs = wr.s3.read_parquet_table(database='...', table='...', chunked=True)
>>> for df in dfs:
>>>     print(df)  # Smaller Pandas DataFrame

Reading Parquet Dataset with PUSH-DOWN filter over partitions

>>> import awswrangler as wr
>>> my_filter = lambda x: True if x["city"].startswith("new") else False
>>> df = wr.s3.read_parquet_table(path, dataset=True, partition_filter=my_filter)