awswrangler.athena.read_sql_query

awswrangler.athena.read_sql_query(sql: str, database: str, ctas_approach: bool = True, categories: Optional[List[str]] = None, chunksize: Optional[Union[int, bool]] = None, s3_output: Optional[str] = None, workgroup: Optional[str] = None, encryption: Optional[str] = None, kms_key: Optional[str] = None, keep_files: bool = True, ctas_database_name: Optional[str] = None, ctas_temp_table_name: Optional[str] = None, use_threads: bool = True, boto3_session: Optional[boto3.session.Session] = None, max_cache_seconds: int = 0, max_cache_query_inspections: int = 50, max_remote_cache_entries: int = 50, max_local_cache_entries: int = 100, data_source: Optional[str] = None, params: Optional[Dict[str, Any]] = None, s3_additional_kwargs: Optional[Dict[str, Any]] = None)Any

Execute any SQL query on AWS Athena and return the results as a Pandas DataFrame.

Related tutorial:

There are two approaches to be defined through ctas_approach parameter:

1 - ctas_approach=True (Default):

Wrap the query with a CTAS and then reads the table data as parquet directly from s3.

PROS:

  • Faster for mid and big result sizes.

  • Can handle some level of nested types.

CONS:

  • Requires create/delete table permissions on Glue.

  • Does not support timestamp with time zone

  • Does not support columns with repeated names.

  • Does not support columns with undefined data types.

  • A temporary table will be created and then deleted immediately.

  • Does not support custom data_source/catalog_id.

2 - ctas_approach=False:

Does a regular query on Athena and parse the regular CSV result on s3.

PROS:

  • Faster for small result sizes (less latency).

  • Does not require create/delete table permissions on Glue

  • Supports timestamp with time zone.

  • Support custom data_source/catalog_id.

CONS:

  • Slower for big results (But stills faster than other libraries that uses the regular Athena’s API)

  • Does not handle nested types at all.

Note

The resulting DataFrame (or every DataFrame in the returned Iterator for chunked queries) have a query_metadata attribute, which brings the query result metadata returned by Boto3/Athena .

For a practical example check out the related tutorial!

Note

Valid encryption modes: [None, ‘SSE_S3’, ‘SSE_KMS’].

P.S. ‘CSE_KMS’ is not supported.

Note

Create the default Athena bucket if it doesn’t exist and s3_output is None.

(E.g. s3://aws-athena-query-results-ACCOUNT-REGION/)

Note

chunksize argument (Memory Friendly) (i.e batching):

Return an Iterable of DataFrames instead of a regular DataFrame.

There are two batching strategies:

  • If chunksize=True, a new DataFrame will be returned for each file in the query result.

  • If chunksize=INTEGER, Wrangler will iterate on the data by number of rows igual the received INTEGER.

P.S. chunksize=True is faster and uses less memory while chunksize=INTEGER is more precise in number of rows for each Dataframe.

P.P.S. If ctas_approach=False and chunksize=True, you will always receive an interador with a single DataFrame because regular Athena queries only produces a single output file.

Note

In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().

Note

This function has arguments which can be configured globally through wr.config or environment variables:

  • ctas_approach

  • database

  • max_cache_query_inspections

  • max_cache_seconds

  • max_remote_cache_entries

  • max_local_cache_entries

  • workgroup

  • chunksize

Check out the Global Configurations Tutorial for details.

Parameters
  • sql (str) – SQL query.

  • database (str) – AWS Glue/Athena database name - It is only the origin database from where the query will be launched. You can still using and mixing several databases writing the full table name within the sql (e.g. database.table).

  • ctas_approach (bool) – Wraps the query using a CTAS, and read the resulted parquet data on S3. If false, read the regular CSV on S3.

  • categories (List[str], optional) – List of columns names that should be returned as pandas.Categorical. Recommended for memory restricted environments.

  • chunksize (Union[int, bool], optional) – If passed will split the data in a Iterable of DataFrames (Memory friendly). If True wrangler will iterate on the data by files in the most efficient way without guarantee of chunksize. If an INTEGER is passed Wrangler will iterate on the data by number of rows igual the received INTEGER.

  • s3_output (str, optional) – Amazon S3 path.

  • workgroup (str, optional) – Athena workgroup.

  • encryption (str, optional) – Valid values: [None, ‘SSE_S3’, ‘SSE_KMS’]. Notice: ‘CSE_KMS’ is not supported.

  • kms_key (str, optional) – For SSE-KMS, this is the KMS key ARN or ID.

  • keep_files (bool) – Should Wrangler delete or keep the staging files produced by Athena?

  • ctas_database_name (str, optional) – The name of the alternative database where the CTAS temporary table is stored. If None, the default database is used.

  • ctas_temp_table_name (str, optional) – The name of the temporary table and also the directory name on S3 where the CTAS result is stored. If None, it will use the follow random pattern: f”temp_table_{uuid.uuid4().hex()}”. On S3 this directory will be under under the pattern: f”{s3_output}/{ctas_temp_table_name}/”.

  • use_threads (bool) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.

  • max_cache_seconds (int) – Wrangler can look up in Athena’s history if this query has been run before. If so, and its completion time is less than max_cache_seconds before now, wrangler skips query execution and just returns the same results as last time. If cached results are valid, wrangler ignores the ctas_approach, s3_output, encryption, kms_key, keep_files and ctas_temp_table_name params. If reading cached data fails for any reason, execution falls back to the usual query run path.

  • max_cache_query_inspections (int) – Max number of queries that will be inspected from the history to try to find some result to reuse. The bigger the number of inspection, the bigger will be the latency for not cached queries. Only takes effect if max_cache_seconds > 0.

  • max_remote_cache_entries (int) – Max number of queries that will be retrieved from AWS for cache inspection. The bigger the number of inspection, the bigger will be the latency for not cached queries. Only takes effect if max_cache_seconds > 0 and default value is 50.

  • max_local_cache_entries (int) – Max number of queries for which metadata will be cached locally. This will reduce the latency and also enables keeping more than max_remote_cache_entries available for the cache. This value should not be smaller than max_remote_cache_entries. Only takes effect if max_cache_seconds > 0 and default value is 100.

  • data_source (str, optional) – Data Source / Catalog name. If None, ‘AwsDataCatalog’ will be used by default.

  • params (Dict[str, any], optional) – Dict of parameters that will be used for constructing the SQL query. Only named parameters are supported. The dict needs to contain the information in the form {‘name’: ‘value’} and the SQL query needs to contain :name;. Note that for varchar columns and similar, you must surround the value in single quotes.

  • s3_additional_kwargs (Optional[Dict[str, Any]]) – Forward to botocore requests. Valid parameters: “RequestPayer”, “ExpectedBucketOwner”. e.g. s3_additional_kwargs={‘RequestPayer’: ‘requester’}

Returns

Pandas DataFrame or Generator of Pandas DataFrames if chunksize is passed.

Return type

Union[pd.DataFrame, Iterator[pd.DataFrame]]

Examples

>>> import awswrangler as wr
>>> df = wr.athena.read_sql_query(sql="...", database="...")
>>> scanned_bytes = df.query_metadata["Statistics"]["DataScannedInBytes"]
>>> import awswrangler as wr
>>> df = wr.athena.read_sql_query(
...     sql="SELECT * FROM my_table WHERE name=:name; AND city=:city;",
...     params={"name": "'filtered_name'", "city": "'filtered_city'"}
... )