awswrangler.athena.read_sql_table¶

awswrangler.athena.read_sql_table(table: str, database: str, ctas_approach: bool = True, categories: Optional[List[str]] = None, chunksize: Optional[Union[int, bool]] = None, s3_output: Optional[str] = None, workgroup: Optional[str] = None, encryption: Optional[str] = None, kms_key: Optional[str] = None, keep_files: bool = True, ctas_database_name: Optional[str] = None, ctas_temp_table_name: Optional[str] = None, ctas_bucketing_info: Optional[Tuple[List[str], int]] = None, use_threads: Union[bool, int] = True, boto3_session: Optional[Session] = None, max_cache_seconds: int = 0, max_cache_query_inspections: int = 50, max_remote_cache_entries: int = 50, max_local_cache_entries: int = 100, data_source: Optional[str] = None, s3_additional_kwargs: Optional[Dict[str, Any]] = None, pyarrow_additional_kwargs: Optional[Dict[str, Any]] = None) → Any¶

Extract the full table AWS Athena and return the results as a Pandas DataFrame.

Related tutorial:

There are two approaches to be defined through ctas_approach parameter:

1 - ctas_approach=True (Default):

Wrap the query with a CTAS and then reads the table data as parquet directly from s3.

PROS:

Faster for mid and big result sizes.
Can handle some level of nested types.

CONS:

Requires create/delete table permissions on Glue.
Does not support timestamp with time zone
Does not support columns with repeated names.
Does not support columns with undefined data types.
A temporary table will be created and then deleted immediately.

2 - ctas_approach=False:

Does a regular query on Athena and parse the regular CSV result on s3.

PROS:

Faster for small result sizes (less latency).
Does not require create/delete table permissions on Glue
Supports timestamp with time zone.

CONS:

Slower for big results (But stills faster than other libraries that uses the regular Athena’s API)
Does not handle nested types at all.

Note

The resulting DataFrame (or every DataFrame in the returned Iterator for chunked queries) have a query_metadata attribute, which brings the query result metadata returned by Boto3/Athena .

For a practical example check out the related tutorial!

Note

Valid encryption modes: [None, ‘SSE_S3’, ‘SSE_KMS’].

P.S. ‘CSE_KMS’ is not supported.

Note

Create the default Athena bucket if it doesn’t exist and s3_output is None.

(E.g. s3://aws-athena-query-results-ACCOUNT-REGION/)

Note

chunksize argument (Memory Friendly) (i.e batching):

Return an Iterable of DataFrames instead of a regular DataFrame.

There are two batching strategies:

If chunksize=True, a new DataFrame will be returned for each file in the query result.
If chunksize=INTEGER, Wrangler will iterate on the data by number of rows igual the received INTEGER.

P.S. chunksize=True is faster and uses less memory while chunksize=INTEGER is more precise in number of rows for each Dataframe.

P.P.S. If ctas_approach=False and chunksize=True, you will always receive an interador with a single DataFrame because regular Athena queries only produces a single output file.

Note

In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().

Note

This function has arguments which can be configured globally through wr.config or environment variables:

ctas_approach
database
max_cache_query_inspections
max_cache_seconds
max_remote_cache_entries
max_local_cache_entries
workgroup
chunksize

Check out the Global Configurations Tutorial for details.

Parameters

table (str) – Table name.
database (str) – AWS Glue/Athena database name.
ctas_approach (bool) – Wraps the query using a CTAS, and read the resulted parquet data on S3. If false, read the regular CSV on S3.
categories (List[str], optional) – List of columns names that should be returned as pandas.Categorical. Recommended for memory restricted environments.
chunksize (Union[int, bool], optional) – If passed will split the data in a Iterable of DataFrames (Memory friendly). If True wrangler will iterate on the data by files in the most efficient way without guarantee of chunksize. If an INTEGER is passed Wrangler will iterate on the data by number of rows igual the received INTEGER.
s3_output (str, optional) – AWS S3 path.
workgroup (str, optional) – Athena workgroup.
encryption (str, optional) – Valid values: [None, ‘SSE_S3’, ‘SSE_KMS’]. Notice: ‘CSE_KMS’ is not supported.
kms_key (str, optional) – For SSE-KMS, this is the KMS key ARN or ID.
keep_files (bool) – Should Wrangler delete or keep the staging files produced by Athena?
ctas_database_name (str, optional) – The name of the alternative database where the CTAS temporary table is stored. If None, the default database is used.
ctas_temp_table_name (str, optional) – The name of the temporary table and also the directory name on S3 where the CTAS result is stored. If None, it will use the follow random pattern: f”temp_table_{uuid.uuid4().hex}”. On S3 this directory will be under under the pattern: f”{s3_output}/{ctas_temp_table_name}/”.
ctas_bucketing_info (Tuple[List[str], int], optional) – Tuple consisting of the column names used for bucketing as the first element and the number of buckets as the second element. Only str, int and bool are supported as column data types for bucketing.
use_threads (bool, int) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.
boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.
max_cache_seconds (int) – Wrangler can look up in Athena’s history if this table has been read before. If so, and its completion time is less than max_cache_seconds before now, wrangler skips query execution and just returns the same results as last time. If cached results are valid, wrangler ignores the ctas_approach, s3_output, encryption, kms_key, keep_files and ctas_temp_table_name params. If reading cached data fails for any reason, execution falls back to the usual query run path.
max_cache_query_inspections (int) – Max number of queries that will be inspected from the history to try to find some result to reuse. The bigger the number of inspection, the bigger will be the latency for not cached queries. Only takes effect if max_cache_seconds > 0.
max_remote_cache_entries (int) – Max number of queries that will be retrieved from AWS for cache inspection. The bigger the number of inspection, the bigger will be the latency for not cached queries. Only takes effect if max_cache_seconds > 0 and default value is 50.
max_local_cache_entries (int) – Max number of queries for which metadata will be cached locally. This will reduce the latency and also enables keeping more than max_remote_cache_entries available for the cache. This value should not be smaller than max_remote_cache_entries. Only takes effect if max_cache_seconds > 0 and default value is 100.
data_source (str, optional) – Data Source / Catalog name. If None, ‘AwsDataCatalog’ will be used by default.
s3_additional_kwargs (Optional[Dict[str, Any]]) – Forwarded to botocore requests. e.g. s3_additional_kwargs={‘RequestPayer’: ‘requester’}
pyarrow_additional_kwargs (Optional[Dict[str, Any]]) – Forward to the ParquetFile class or converting an Arrow table to Pandas, currently only an “coerce_int96_timestamp_unit” or “timestamp_as_object” argument will be considered. If reading parquet fileswhere you cannot convert a timestamp to pandas Timestamp[ns] consider setting timestamp_as_object=True, to allow for timestamp units > NS. If reading parquet data that still uses INT96 (like Athena outputs) you can use coerce_int96_timestamp_unit to specify what timestamp unit to encode INT96 to (by default this is “ns”, if you know the output parquet came from a system that encodes timestamp to a particular unit then set this to that same unit e.g. coerce_int96_timestamp_unit=”ms”).

Returns

Pandas DataFrame or Generator of Pandas DataFrames if chunksize is passed.

Return type

Union[pd.DataFrame, Iterator[pd.DataFrame]]

Examples

>>> import awswrangler as wr
>>> df = wr.athena.read_sql_table(table="...", database="...")
>>> scanned_bytes = df.query_metadata["Statistics"]["DataScannedInBytes"]