- awswrangler.redshift.unload(sql: str, path: str, con: redshift_connector.core.Connection, iam_role: Optional[str] = None, aws_access_key_id: Optional[str] = None, aws_secret_access_key: Optional[str] = None, aws_session_token: Optional[str] = None, region: Optional[str] = None, max_file_size: Optional[float] = None, kms_key_id: Optional[str] = None, categories: Optional[List[str]] = None, chunked: Union[bool, int] = False, keep_files: bool = False, use_threads: bool = True, boto3_session: Optional[boto3.session.Session] = None, s3_additional_kwargs: Optional[Dict[str, str]] = None) Union[pandas.core.frame.DataFrame, Iterator[pandas.core.frame.DataFrame]] ¶
Load Pandas DataFrame from a Amazon Redshift query result using Parquet files on s3 as stage.
This is a HIGH latency and HIGH throughput alternative to wr.redshift.read_sql_query()/wr.redshift.read_sql_table() to extract large Amazon Redshift data into a Pandas DataFrames through the UNLOAD command.
This strategy has more overhead and requires more IAM privileges than the regular wr.redshift.read_sql_query()/wr.redshift.read_sql_table() function, so it is only recommended to fetch 1k+ rows at once.
Batching(chunked argument) (Memory Friendly):
Will enable the function to return an Iterable of DataFrames instead of a regular DataFrame.
There are two batching strategies on Wrangler:
If chunked=True, a new DataFrame will be returned for each file in your path/dataset.
If chunked=INTEGER, Wrangler will iterate on the data by number of rows (equal to the received INTEGER).
P.S. chunked=True is faster and uses less memory while chunked=INTEGER is more precise in the number of rows for each Dataframe.
In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().
sql (str) – SQL query.
path (Union[str, List[str]]) – S3 path to write stage files (e.g. s3://bucket_name/any_name/)
con (redshift_connector.Connection) – Use redshift_connector.connect() to use ” “credentials directly or wr.redshift.connect() to fetch it from the Glue Catalog.
iam_role (str, optional) – AWS IAM role with the related permissions.
aws_access_key_id (str, optional) – The access key for your AWS account.
aws_secret_access_key (str, optional) – The secret key for your AWS account.
aws_session_token (str, optional) – The session key for your AWS account. This is only needed when you are using temporary credentials.
region (str, optional) – Specifies the AWS Region where the target Amazon S3 bucket is located. REGION is required for UNLOAD to an Amazon S3 bucket that isn’t in the same AWS Region as the Amazon Redshift cluster. By default, UNLOAD assumes that the target Amazon S3 bucket is located in the same AWS Region as the Amazon Redshift cluster.
max_file_size (float, optional) – Specifies the maximum size (MB) of files that UNLOAD creates in Amazon S3. Specify a decimal value between 5.0 MB and 6200.0 MB. If None, the default maximum file size is 6200.0 MB.
kms_key_id (str, optional) – Specifies the key ID for an AWS Key Management Service (AWS KMS) key to be used to encrypt data files on Amazon S3.
categories (List[str], optional) – List of columns names that should be returned as pandas.Categorical. Recommended for memory restricted environments.
keep_files (bool) – Should keep stage files?
chunked (Union[int, bool]) – If passed will split the data in a Iterable of DataFrames (Memory friendly). If True wrangler will iterate on the data by files in the most efficient way without guarantee of chunksize. If an INTEGER is passed Wrangler will iterate on the data by number of rows igual the received INTEGER.
use_threads (bool) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads.
boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.
s3_additional_kwargs – Forward to botocore requests, only “SSECustomerAlgorithm” and “SSECustomerKey” arguments will be considered.
Result as Pandas DataFrame(s).
- Return type
>>> import awswrangler as wr >>> con = wr.redshift.connect("MY_GLUE_CONNECTION") >>> df = wr.redshift.unload( ... sql="SELECT * FROM public.mytable", ... path="s3://bucket/extracted_parquet_files/", ... con=con, ... iam_role="arn:aws:iam::XXX:role/XXX" ... ) >>> con.close()