awswrangler.athena.read_sql_query(sql: str, database: str, ctas_approach: bool = True, categories: List[str] = None, chunksize: Union[int, bool, None] = None, s3_output: Optional[str] = None, workgroup: Optional[str] = None, encryption: Optional[str] = None, kms_key: Optional[str] = None, keep_files: bool = True, ctas_temp_table_name: Optional[str] = None, use_threads: bool = True, boto3_session: Optional[boto3.session.Session] = None) → Union[pandas.core.frame.DataFrame, Iterator[pandas.core.frame.DataFrame]]

Execute any SQL query on AWS Athena and return the results as a Pandas DataFrame.

There are two approaches to be defined through ctas_approach parameter:

1 - ctas_approach=True (Default): Wrap the query with a CTAS and then reads the table data as parquet directly from s3. PROS: Faster and can handle some level of nested types. CONS: Requires create/delete table permissions on Glue and Does not support timestamp with time zone (A temporary table will be created and then deleted immediately).

2 - ctas_approach False: Does a regular query on Athena and parse the regular CSV result on s3. PROS: Does not require create/delete table permissions on Glue and supports timestamp with time zone. CONS: Slower (But stills faster than other libraries that uses the regular Athena API) and does not handle nested types at all.


Valid encryption modes: [None, ‘SSE_S3’, ‘SSE_KMS’].

P.S. ‘CSE_KMS’ is not supported.


Create the default Athena bucket if it doesn’t exist and s3_output is None.

(E.g. s3://aws-athena-query-results-ACCOUNT-REGION/)


Batching (chunksize argument) (Memory Friendly):

Will anable the function to return a Iterable of DataFrames instead of a regular DataFrame.

There are two batching strategies on Wrangler:

  • If chunksize=True, a new DataFrame will be returned for each file in the query result.

  • If chunked=INTEGER, Wrangler will iterate on the data by number of rows igual the received INTEGER.

P.S. chunksize=True if faster and uses less memory while chunksize=INTEGER is more precise in number of rows for each Dataframe.


In case of use_threads=True the number of threads that will be spawned will be get from os.cpu_count().

  • sql (str) – SQL query.

  • database (str) – AWS Glue/Athena database name.

  • ctas_approach (bool) – Wraps the query using a CTAS, and read the resulted parquet data on S3. If false, read the regular CSV on S3.

  • categories (List[str], optional) – List of columns names that should be returned as pandas.Categorical. Recommended for memory restricted environments.

  • chunksize (Union[int, bool], optional) – If passed will split the data in a Iterable of DataFrames (Memory friendly). If True wrangler will iterate on the data by files in the most efficient way without guarantee of chunksize. If an INTEGER is passed Wrangler will iterate on the data by number of rows igual the received INTEGER.

  • s3_output (str, optional) – AWS S3 path.

  • workgroup (str, optional) – Athena workgroup.

  • encryption (str, optional) – Valid values: [None, ‘SSE_S3’, ‘SSE_KMS’]. Notice: ‘CSE_KMS’ is not supported.

  • kms_key (str, optional) – For SSE-KMS, this is the KMS key ARN or ID.

  • keep_files (bool) – Should Wrangler delete or keep the staging files produced by Athena?

  • ctas_temp_table_name (str, optional) – The name of the temporary table and also the directory name on S3 where the CTAS result is stored. If None, it will use the follow random pattern: f”temp_table_{pyarrow.compat.guid()}”. On S3 this directory will be under under the pattern: f”{s3_output}/{ctas_temp_table_name}/”.

  • use_threads (bool) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.


Pandas DataFrame or Generator of Pandas DataFrames if chunksize is passed.

Return type

Union[pd.DataFrame, Iterator[pd.DataFrame]]


>>> import awswrangler as wr
>>> df = wr.athena.read_sql_query(sql='...', database='...')