awswrangler.db.unload_redshift

awswrangler.db.unload_redshift(sql: str, path: str, con: sqlalchemy.engine.base.Engine, iam_role: str, region: Optional[str] = None, max_file_size: Optional[float] = None, kms_key_id: Optional[str] = None, categories: List[str] = None, chunked: Union[bool, int] = False, keep_files: bool = False, use_threads: bool = True, boto3_session: Optional[boto3.session.Session] = None, s3_additional_kwargs: Optional[Dict[str, str]] = None) → Union[pandas.core.frame.DataFrame, Iterator[pandas.core.frame.DataFrame]]

Load Pandas DataFrame from a Amazon Redshift query result using Parquet files on s3 as stage.

This is a HIGH latency and HIGH throughput alternative to wr.db.read_sql_query()/wr.db.read_sql_table() to extract large Amazon Redshift data into a Pandas DataFrames through the UNLOAD command.

This strategy has more overhead and requires more IAM privileges than the regular wr.db.read_sql_query()/wr.db.read_sql_table() function, so it is only recommended to fetch +1MM rows at once.

https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html

Note

Batching (chunked argument) (Memory Friendly):

Will anable the function to return a Iterable of DataFrames instead of a regular DataFrame.

There are two batching strategies on Wrangler:

  • If chunked=True, a new DataFrame will be returned for each file in your path/dataset.

  • If chunked=INTEGER, Wrangler will iterate on the data by number of rows igual the received INTEGER.

P.S. chunked=True if faster and uses less memory while chunked=INTEGER is more precise in number of rows for each Dataframe.

Note

In case of use_threads=True the number of threads that will be spawned will be get from os.cpu_count().

Parameters
  • sql (str) – SQL query.

  • path (Union[str, List[str]]) – S3 path to write stage files (e.g. s3://bucket_name/any_name/)

  • con (sqlalchemy.engine.Engine) – SQLAlchemy Engine. Please use, wr.db.get_engine(), wr.db.get_redshift_temp_engine() or wr.catalog.get_engine()

  • iam_role (str) – AWS IAM role with the related permissions.

  • region (str, optional) – Specifies the AWS Region where the target Amazon S3 bucket is located. REGION is required for UNLOAD to an Amazon S3 bucket that isn’t in the same AWS Region as the Amazon Redshift cluster. By default, UNLOAD assumes that the target Amazon S3 bucket is located in the same AWS Region as the Amazon Redshift cluster.

  • max_file_size (float, optional) – Specifies the maximum size (MB) of files that UNLOAD creates in Amazon S3. Specify a decimal value between 5.0 MB and 6200.0 MB. If None, the default maximum file size is 6200.0 MB.

  • kms_key_id (str, optional) – Specifies the key ID for an AWS Key Management Service (AWS KMS) key to be used to encrypt data files on Amazon S3.

  • categories (List[str], optional) – List of columns names that should be returned as pandas.Categorical. Recommended for memory restricted environments.

  • keep_files (bool) – Should keep the stage files?

  • chunked (Union[int, bool]) – If passed will split the data in a Iterable of DataFrames (Memory friendly). If True wrangler will iterate on the data by files in the most efficient way without guarantee of chunksize. If an INTEGER is passed Wrangler will iterate on the data by number of rows igual the received INTEGER.

  • use_threads (bool) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.

  • s3_additional_kwargs – Forward to s3fs, useful for server side encryption https://s3fs.readthedocs.io/en/latest/#serverside-encryption

Returns

Pandas DataFrame

Return type

pandas.DataFrame

Examples

>>> import awswrangler as wr
>>> import pandas as pd
>>> df = wr.db.unload_redshift(
...     sql="SELECT * FROM public.mytable",
...     path="s3://bucket/extracted_parquet_files/",
...     con=wr.catalog.get_engine(connection="my_glue_connection"),
...     iam_role="arn:aws:iam::XXX:role/XXX"
... )