awswrangler.athena.read_sql_query

awswrangler.athena.read_sql_query(sql: str, database: str, ctas_approach: bool = True, categories: Optional[List[str]] = None, chunksize: Optional[Union[int, bool]] = None, s3_output: Optional[str] = None, workgroup: Optional[str] = None, encryption: Optional[str] = None, kms_key: Optional[str] = None, keep_files: bool = True, ctas_temp_table_name: Optional[str] = None, use_threads: bool = True, boto3_session: Optional[boto3.session.Session] = None, max_cache_seconds: int = 0, max_cache_query_inspections: int = 50) → Union[pandas.core.frame.DataFrame, Iterator[pandas.core.frame.DataFrame]]

Execute any SQL query on AWS Athena and return the results as a Pandas DataFrame.

There are two approaches to be defined through ctas_approach parameter:

1 - ctas_approach=True (Default):

Wrap the query with a CTAS and then reads the table data as parquet directly from s3.

PROS:

  • Faster for mid and big result sizes.

  • Can handle some level of nested types.

CONS:

  • Requires create/delete table permissions on Glue.

  • Does not support timestamp with time zone

  • Does not support columns with repeated names.

  • Does not support columns with undefined data types.

  • A temporary table will be created and then deleted immediately.

2 - ctas_approach=False:

Does a regular query on Athena and parse the regular CSV result on s3.

PROS:

  • Faster for small result sizes (less latency).

  • Does not require create/delete table permissions on Glue

  • Supports timestamp with time zone.

CONS:

  • Slower for big results (But stills faster than other libraries that uses the regular Athena’s API)

  • Does not handle nested types at all.

Note

The resulting DataFrame (or every DataFrame in the returned Iterator for chunked queries) have a query_metadata attribute, which brings the query result metadata returned by Boto3/Athena. The expected query_metadata format is the same as returned by: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/athena.html#Athena.Client.get_query_execution

Note

Valid encryption modes: [None, ‘SSE_S3’, ‘SSE_KMS’].

P.S. ‘CSE_KMS’ is not supported.

Note

Create the default Athena bucket if it doesn’t exist and s3_output is None.

(E.g. s3://aws-athena-query-results-ACCOUNT-REGION/)

Note

Batching (chunksize argument) (Memory Friendly):

Will anable the function to return a Iterable of DataFrames instead of a regular DataFrame.

There are two batching strategies on Wrangler:

  • If chunksize=True, a new DataFrame will be returned for each file in the query result.

  • If chunked=INTEGER, Wrangler will iterate on the data by number of rows igual the received INTEGER.

P.S. chunksize=True if faster and uses less memory while chunksize=INTEGER is more precise in number of rows for each Dataframe.

Note

In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().

Note

This functions has arguments that can has default values configured globally through wr.config or environment variables:

  • ctas_approach

  • database

  • max_cache_query_inspections

  • max_cache_seconds

Check out the Global Configurations Tutorial for details.

Parameters
  • sql (str) – SQL query.

  • database (str) – AWS Glue/Athena database name.

  • ctas_approach (bool) – Wraps the query using a CTAS, and read the resulted parquet data on S3. If false, read the regular CSV on S3.

  • categories (List[str], optional) – List of columns names that should be returned as pandas.Categorical. Recommended for memory restricted environments.

  • chunksize (Union[int, bool], optional) – If passed will split the data in a Iterable of DataFrames (Memory friendly). If True wrangler will iterate on the data by files in the most efficient way without guarantee of chunksize. If an INTEGER is passed Wrangler will iterate on the data by number of rows igual the received INTEGER.

  • s3_output (str, optional) – Amazon S3 path.

  • workgroup (str, optional) – Athena workgroup.

  • encryption (str, optional) – Valid values: [None, ‘SSE_S3’, ‘SSE_KMS’]. Notice: ‘CSE_KMS’ is not supported.

  • kms_key (str, optional) – For SSE-KMS, this is the KMS key ARN or ID.

  • keep_files (bool) – Should Wrangler delete or keep the staging files produced by Athena?

  • ctas_temp_table_name (str, optional) – The name of the temporary table and also the directory name on S3 where the CTAS result is stored. If None, it will use the follow random pattern: f”temp_table_{uuid.uuid4().hex()}”. On S3 this directory will be under under the pattern: f”{s3_output}/{ctas_temp_table_name}/”.

  • use_threads (bool) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.

  • max_cache_seconds (int) – Wrangler can look up in Athena’s history if this query has been run before. If so, and its completion time is less than max_cache_seconds before now, wrangler skips query execution and just returns the same results as last time. If cached results are valid, wrangler ignores the ctas_approach, s3_output, encryption, kms_key, keep_files and ctas_temp_table_name params. If reading cached data fails for any reason, execution falls back to the usual query run path.

  • max_cache_query_inspections (int) – Max number of queries that will be inspected from the history to try to find some result to reuse. The bigger the number of inspection, the bigger will be the latency for not cached queries. Only takes effect if max_cache_seconds > 0.

Returns

Pandas DataFrame or Generator of Pandas DataFrames if chunksize is passed.

Return type

Union[pd.DataFrame, Iterator[pd.DataFrame]]

Examples

>>> import awswrangler as wr
>>> df = wr.athena.read_sql_query(sql="...", database="...")
>>> scanned_bytes = df.query_metadata["Statistics"]["DataScannedInBytes"]