awswrangler.s3.read_parquet_table(table: str, database: str, filters: Optional[Union[List[Tuple], List[List[Tuple]]]] = None, columns: Optional[List[str]] = None, validate_schema: bool = True, categories: List[str] = None, safe: bool = True, chunked: Union[bool, int] = False, use_threads: bool = True, boto3_session: Optional[boto3.session.Session] = None, s3_additional_kwargs: Optional[Dict[str, str]] = None) → Union[pandas.core.frame.DataFrame, Iterator[pandas.core.frame.DataFrame]]

Read Apache Parquet table registered on AWS Glue Catalog.


Batching (chunked argument) (Memory Friendly):

Will anable the function to return a Iterable of DataFrames instead of a regular DataFrame.

There are two batching strategies on Wrangler:

  • If chunked=True, a new DataFrame will be returned for each file in your path/dataset.

  • If chunked=INTEGER, Wrangler will paginate through files slicing and concatenating to return DataFrames with the number of row igual the received INTEGER.

P.S. chunked=True if faster and uses less memory while chunked=INTEGER is more precise in number of rows for each Dataframe.


In case of use_threads=True the number of threads that will be spawned will be get from os.cpu_count().

  • table (str) – AWS Glue Catalog table name.

  • database (str) – AWS Glue Catalog database name.

  • filters (Union[List[Tuple], List[List[Tuple]]], optional) – List of filters to apply, like [[('x', '=', 0), ...], ...].

  • columns (List[str], optional) – Names of columns to read from the file(s).

  • validate_schema – Check that individual file schemas are all the same / compatible. Schemas within a folder prefix should all be the same. Disable if you have schemas that are different and want to disable this check.

  • categories (List[str], optional) – List of columns names that should be returned as pandas.Categorical. Recommended for memory restricted environments.

  • safe (bool, default True) – For certain data types, a cast is needed in order to store the data in a pandas DataFrame or Series (e.g. timestamps are always stored as nanoseconds in pandas). This option controls whether it is a safe cast or not.

  • chunked (bool) – If True will break the data in smaller DataFrames (Non deterministic number of lines). Otherwise return a single DataFrame with the whole data.

  • use_threads (bool) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.

  • s3_additional_kwargs – Forward to s3fs, useful for server side encryption


Pandas DataFrame or a Generator in case of chunked=True.

Return type

Union[pandas.DataFrame, Generator[pandas.DataFrame, None, None]]


Reading Parquet Table

>>> import awswrangler as wr
>>> df = wr.s3.read_parquet_table(database='...', table='...')

Reading Parquet Table encrypted

>>> import awswrangler as wr
>>> df = wr.s3.read_parquet_table(
...     database='...',
...     table='...'
...     s3_additional_kwargs={
...         'ServerSideEncryption': 'aws:kms',
...         'SSEKMSKeyId': 'YOUR_KMY_KEY_ARN'
...     }
... )

Reading Parquet Table in chunks (Chunk by file)

>>> import awswrangler as wr
>>> dfs = wr.s3.read_parquet_table(database='...', table='...', chunked=True)
>>> for df in dfs:
>>>     print(df)  # Smaller Pandas DataFrame

Reading in chunks (Chunk by 1MM rows)

>>> import awswrangler as wr
>>> dfs = wr.s3.read_parquet(path=['s3://bucket/filename0.csv', 's3://bucket/filename1.csv'], chunked=1_000_000)
>>> for df in dfs:
>>>     print(df)  # 1MM Pandas DataFrame