awswrangler.db.copy_files_to_redshift

awswrangler.db.copy_files_to_redshift(path: Union[str, List[str]], manifest_directory: str, con: sqlalchemy.engine.base.Engine, table: str, schema: str, iam_role: str, parquet_infer_sampling: float = 1.0, mode: str = 'append', diststyle: str = 'AUTO', distkey: Optional[str] = None, sortstyle: str = 'COMPOUND', sortkey: Optional[List[str]] = None, primary_keys: Optional[List[str]] = None, varchar_lengths_default: int = 256, varchar_lengths: Optional[Dict[str, int]] = None, use_threads: bool = True, boto3_session: Optional[boto3.session.Session] = None, s3_additional_kwargs: Optional[Dict[str, str]] = None) → None

Load Parquet files from S3 to a Table on Amazon Redshift (Through COPY command).

https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

This function accepts Unix shell-style wildcards in the path argument. * (matches everything), ? (matches any single character), [seq] (matches any character in seq), [!seq] (matches any character not in seq).

Note

If the table does not exist yet, it will be automatically created for you using the Parquet metadata to infer the columns data types.

Note

In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().

Parameters
  • path (Union[str, List[str]]) – S3 prefix (accepts Unix shell-style wildcards) (e.g. s3://bucket/prefix) or list of S3 objects paths (e.g. [s3://bucket/key0, s3://bucket/key1]).

  • manifest_directory (str) – S3 prefix (e.g. s3://bucket/prefix)

  • con (sqlalchemy.engine.Engine) – SQLAlchemy Engine. Please use, wr.db.get_engine(), wr.db.get_redshift_temp_engine() or wr.catalog.get_engine()

  • table (str) – Table name

  • schema (str) – Schema name

  • iam_role (str) – AWS IAM role with the related permissions.

  • parquet_infer_sampling (float) – Random sample ratio of files that will have the metadata inspected. Must be 0.0 < sampling <= 1.0. The higher, the more accurate. The lower, the faster.

  • mode (str) – Append, overwrite or upsert.

  • diststyle (str) – Redshift distribution styles. Must be in [“AUTO”, “EVEN”, “ALL”, “KEY”]. https://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html

  • distkey (str, optional) – Specifies a column name or positional number for the distribution key.

  • sortstyle (str) – Sorting can be “COMPOUND” or “INTERLEAVED”. https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html

  • sortkey (List[str], optional) – List of columns to be sorted.

  • primary_keys (List[str], optional) – Primary keys.

  • varchar_lengths_default (int) – The size that will be set for all VARCHAR columns not specified with varchar_lengths.

  • varchar_lengths (Dict[str, int], optional) – Dict of VARCHAR length by columns. (e.g. {“col1”: 10, “col5”: 200}).

  • use_threads (bool) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.

  • s3_additional_kwargs – Forward to boto3.client(‘s3’).put_object when writing manifest, useful for server side encryption

Returns

None.

Return type

None

Examples

>>> import awswrangler as wr
>>> wr.db.copy_files_to_redshift(
...     path="s3://bucket/my_parquet_files/",
...     con=wr.catalog.get_engine(connection="my_glue_conn_name"),
...     table="my_table",
...     schema="public"
...     iam_role="arn:aws:iam::XXX:role/XXX"
... )