awswrangler.s3.read_parquet_table

awswrangler.s3.read_parquet_table(table: str, database: str, filters: Union[List[Tuple], List[List[Tuple]], None] = None, columns: Optional[List[str]] = None, categories: List[str] = None, chunked: Union[bool, int] = False, use_threads: bool = True, boto3_session: Optional[boto3.session.Session] = None, s3_additional_kwargs: Optional[Dict[str, str]] = None) → Union[pandas.core.frame.DataFrame, Iterator[pandas.core.frame.DataFrame]]

Read Apache Parquet table registered on AWS Glue Catalog.

Note

Batching (chunked argument) (Memory Friendly):

Will anable the function to return a Iterable of DataFrames instead of a regular DataFrame.

There are two batching strategies on Wrangler:

  • If chunked=True, a new DataFrame will be returned for each file in your path/dataset.

  • If chunked=INTEGER, Wrangler will paginate through files slicing and concatenating to return DataFrames with the number of row igual the received INTEGER.

P.S. chunked=True if faster and uses less memory while chunked=INTEGER is more precise in number of rows for each Dataframe.

Note

In case of use_threads=True the number of threads that will be spawned will be get from os.cpu_count().

Parameters
  • table (str) – AWS Glue Catalog table name.

  • database (str) – AWS Glue Catalog database name.

  • filters (Union[List[Tuple], List[List[Tuple]]], optional) – List of filters to apply, like [[('x', '=', 0), ...], ...].

  • columns (List[str], optional) – Names of columns to read from the file(s).

  • categories (List[str], optional) – List of columns names that should be returned as pandas.Categorical. Recommended for memory restricted environments.

  • chunked (bool) – If True will break the data in smaller DataFrames (Non deterministic number of lines). Otherwise return a single DataFrame with the whole data.

  • use_threads (bool) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.

  • s3_additional_kwargs – Forward to s3fs, useful for server side encryption https://s3fs.readthedocs.io/en/latest/#serverside-encryption

Returns

Pandas DataFrame or a Generator in case of chunked=True.

Return type

Union[pandas.DataFrame, Generator[pandas.DataFrame, None, None]]

Examples

Reading Parquet Table

>>> import awswrangler as wr
>>> df = wr.s3.read_parquet_table(database='...', table='...')

Reading Parquet Table encrypted

>>> import awswrangler as wr
>>> df = wr.s3.read_parquet_table(
...     database='...',
...     table='...'
...     s3_additional_kwargs={
...         'ServerSideEncryption': 'aws:kms',
...         'SSEKMSKeyId': 'YOUR_KMY_KEY_ARN'
...     }
... )

Reading Parquet Table in chunks (Chunk by file)

>>> import awswrangler as wr
>>> dfs = wr.s3.read_parquet_table(database='...', table='...', chunked=True)
>>> for df in dfs:
>>>     print(df)  # Smaller Pandas DataFrame

Reading in chunks (Chunk by 1MM rows)

>>> import awswrangler as wr
>>> dfs = wr.s3.read_parquet(path=['s3://bucket/filename0.csv', 's3://bucket/filename1.csv'], chunked=1_000_000)
>>> for df in dfs:
>>>     print(df)  # 1MM Pandas DataFrame