Quick Start¶
>>> pip install awswrangler
import awswrangler as wr
import pandas as pd
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})
# Storing data on Data Lake
wr.s3.to_parquet(
df=df,
path="s3://bucket/dataset/",
dataset=True,
database="my_db",
table="my_table"
)
# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)
# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")
# Get Redshift connection (SQLAlchemy) from Glue and retrieving data from Redshift Spectrum
engine = wr.catalog.get_engine("my-redshift-connection")
df = wr.db.read_sql_query("SELECT * FROM external_schema.my_table", con=engine)
# Get MySQL connection (SQLAlchemy) from Glue Catalog and LOAD the data into MySQL
engine = wr.catalog.get_engine("my-mysql-connection")
wr.db.to_sql(df, engine, schema="test", name="my_table")
# Get PostgreSQL connection (SQLAlchemy) from Glue Catalog and LOAD the data into PostgreSQL
engine = wr.catalog.get_engine("my-postgresql-connection")
wr.db.to_sql(df, engine, schema="test", name="my_table")
Read The Docs¶
What is AWS Data Wrangler?¶
An open-source Python package that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services (Amazon Redshift, AWS Glue, Amazon Athena, Amazon EMR, Amazon QuickSight, etc).
Built on top of other open-source projects like Pandas, Apache Arrow, Boto3, s3fs, SQLAlchemy, Psycopg2 and PyMySQL, it offers abstracted functions to execute usual ETL tasks like load/unload data from Data Lakes, Data Warehouses and Databases.
Check our tutorials or the list of functionalities.
Install¶
AWS Data Wrangler runs with Python 3.6
, 3.7
and 3.8
and on several platforms (AWS Lambda, AWS Glue Python Shell, EMR, EC2,
on-premises, Amazon SageMaker, local, etc).
- Some good practices for most of the methods bellow are:
Use new and individual Virtual Environments for each project (venv).
On Notebooks, always restart your kernel after installations.
PyPI (pip)¶
>>> pip install awswrangler
Conda¶
>>> conda install -c conda-forge awswrangler
AWS Lambda Layer¶
1 - Go to GitHub’s release section and download the layer zip related to the desired version.
2 - Go to the AWS Lambda Panel, open the layer section (left side) and click create layer.
3 - Set name and python version, upload your fresh downloaded zip file and press create to create the layer.
4 - Go to your Lambda and select your new layer!
AWS Glue Wheel¶
Note
AWS Data Wrangler has compiled dependencies (C/C++) so there is only support for Glue Python Shell
, not for Glue PySpark
.
1 - Go to GitHub’s release page and download the wheel file (.whl) related to the desired version.
2 - Upload the wheel file to any Amazon S3 location.
3 - Go to your Glue Python Shell job and point to the new file on S3.
Amazon SageMaker Notebook¶
Run this command in any Python 3 notebook paragraph and then make sure to restart the kernel before import the awswrangler package.
>>> !pip install awswrangler
Amazon SageMaker Notebook Lifecycle¶
Open SageMaker console, go to the lifecycle section and use the follow snippet to configure AWS Data Wrangler for all compatible SageMaker kernels (Reference).
#!/bin/bash
set -e
# OVERVIEW
# This script installs a single pip package in all SageMaker conda environments, apart from the JupyterSystemEnv which
# is a system environment reserved for Jupyter.
# Note this may timeout if the package installations in all environments take longer than 5 mins, consider using
# "nohup" to run this as a background process in that case.
sudo -u ec2-user -i <<'EOF'
# PARAMETERS
PACKAGE=awswrangler
# Note that "base" is special environment name, include it there as well.
for env in base /home/ec2-user/anaconda3/envs/*; do
source /home/ec2-user/anaconda3/bin/activate $(basename "$env")
if [ $env = 'JupyterSystemEnv' ]; then
continue
fi
nohup pip install --upgrade "$PACKAGE" &
source /home/ec2-user/anaconda3/bin/deactivate
done
EOF
EMR Cluster¶
Even not being a distributed library, AWS Data Wrangler could be a good helper to complement Big Data pipelines.
Configure Python 3 as the default interpreter for PySpark under your cluster configuration
[ { "Classification": "spark-env", "Configurations": [ { "Classification": "export", "Properties": { "PYSPARK_PYTHON": "/usr/bin/python3" } } ] } ]
Keep the bootstrap script above on S3 and reference it on your cluster.
#!/usr/bin/env bash set -ex sudo pip-3.6 install awswrangler
Note
Make sure to freeze the Wrangler version in the bootstrap for productive environments (e.g. awswrangler==1.0.0)
From Source¶
>>> git clone https://github.com/awslabs/aws-data-wrangler.git
>>> cd aws-data-wrangler
>>> pip install .
API Reference¶
Amazon S3¶
|
Copy a list of S3 objects to another S3 directory. |
|
Delete Amazon S3 objects from a received S3 prefix or list of S3 objects paths. |
|
Describe Amazon S3 objects from a received S3 prefix or list of S3 objects paths. |
|
Check if object exists on S3. |
|
Get bucket region name. |
|
List Amazon S3 objects from a prefix. |
|
List Amazon S3 objects from a prefix. |
|
Merge a source dataset into a target dataset. |
|
Read CSV file(s) from from a received S3 prefix or list of S3 objects paths. |
|
Read fixed-width formatted file(s) from from a received S3 prefix or list of S3 objects paths. |
|
Read JSON file(s) from from a received S3 prefix or list of S3 objects paths. |
|
Read Apache Parquet file(s) from from a received S3 prefix or list of S3 objects paths. |
|
Read Apache Parquet file(s) metadata from from a received S3 prefix or list of S3 objects paths. |
|
Read Apache Parquet table registered on AWS Glue Catalog. |
|
Get the size (ContentLength) in bytes of Amazon S3 objects from a received S3 prefix or list of S3 objects paths. |
|
Infer and store parquet metadata on AWS Glue Catalog. |
|
Write CSV file or dataset on Amazon S3. |
|
Write JSON file on Amazon S3. |
|
Write Parquet file or dataset on Amazon S3. |
|
Wait Amazon S3 objects exist. |
|
Wait Amazon S3 objects not exist. |
AWS Glue Catalog¶
|
Add partitions (metadata) to a CSV Table in the AWS Glue Catalog. |
|
Add partitions (metadata) to a Parquet Table in the AWS Glue Catalog. |
|
Create a CSV Table (Metadata Only) in the AWS Glue Catalog. |
|
Create a database in AWS Glue Catalog. |
|
Create a Parquet Table (Metadata Only) in the AWS Glue Catalog. |
|
Get a Pandas DataFrame with all listed databases. |
|
Create a database in AWS Glue Catalog. |
|
Delete Glue table if exists. |
|
Check if the table exists. |
Drop all repeated columns (duplicated names). |
|
|
Extract columns and partitions types (Amazon Athena) from Pandas DataFrame. |
|
Get all columns comments. |
|
Get all partitions from a Table in the AWS Glue Catalog. |
|
Get an iterator of databases. |
|
Return a SQLAlchemy Engine from a Glue Catalog Connection. |
|
Get all partitions from a Table in the AWS Glue Catalog. |
|
Get table description. |
|
Get table’s location on Glue catalog. |
|
Get all parameters. |
|
Get all columns and types from a table. |
|
Get an iterator of tables. |
|
Overwrite all existing parameters. |
|
Convert the column name to be compatible with Amazon Athena. |
Normalize all columns names to be compatible with Amazon Athena. |
|
|
Convert the table name to be compatible with Amazon Athena. |
|
Get Pandas DataFrame of tables filtered by a search string. |
|
Get table details as Pandas DataFrame. |
|
Get a DataFrame with tables filtered by a search term, prefix, suffix. |
|
Insert or Update the received parameters. |
Amazon Athena¶
|
Create the default Athena bucket if it doesn’t exist. |
|
Get the data type of all columns queried. |
|
Return information about the workgroup with the specified name. |
|
Execute any SQL query on AWS Athena and return the results as a Pandas DataFrame. |
|
Extract the full table AWS Athena and return the results as a Pandas DataFrame. |
|
Run the Hive’s metastore consistency check: ‘MSCK REPAIR TABLE table;’. |
|
Start a SQL Query against AWS Athena. |
|
Stop a query execution. |
|
Wait for the query end. |
Databases (Amazon Redshift, PostgreSQL, MySQL)¶
|
Load Parquet files from S3 to a Table on Amazon Redshift (Through COPY command). |
|
Load Pandas DataFrame as a Table on Amazon Redshift using parquet files on S3 as stage. |
|
Return a SQLAlchemy Engine from the given arguments. |
|
Get Glue connection details. |
|
Return a DataFrame corresponding to the result set of the query string. |
|
Return a DataFrame corresponding to the result set of the query string. |
|
Write records stored in a DataFrame to a SQL database. |
|
Load Pandas DataFrame from a Amazon Redshift query result using Parquet files on s3 as stage. |
|
Unload Parquet files from a Amazon Redshift query result to parquet files on s3 (Through UNLOAD command). |
|
Write Redshift copy manifest and return its structure. |
Amazon EMR¶
|
Build the Step structure (dictionary). |
|
Build the Step structure (dictionary). |
|
Create a EMR cluster with instance fleets configuration. |
|
Get the EMR cluster state. |
|
Get EMR step state. |
|
Update internal ECR credentials. |
|
Submit Spark Step. |
|
Submit new job in the EMR Cluster. |
|
Submit a list of steps. |
|
Terminate EMR cluster. |
Amazon CloudWatch Logs¶
|
Run a query against AWS CloudWatchLogs Insights and convert the results to Pandas DataFrame. |
|
Run a query against AWS CloudWatchLogs Insights and wait the results. |
|
Run a query against AWS CloudWatchLogs Insights. |
|
Wait query ends. |
Amazon QuickSight¶
|
Cancel an ongoing ingestion of data into SPICE. |
|
Create a QuickSight data source pointing to an Athena/Workgroup. |
|
Create a QuickSight dataset. |
|
Create and starts a new SPICE ingestion on a dataset. |
|
Delete all dashboards. |
|
Delete all data sources. |
|
Delete all datasets. |
|
Delete all templates. |
|
Delete a dashboard. |
|
Delete a data source. |
|
Delete a dataset. |
|
Delete a tamplate. |
|
Describe a QuickSight dashboard by name or ID. |
|
Describe a QuickSight data source by name or ID. |
|
Describe a QuickSight data source permissions by name or ID. |
|
Describe a QuickSight dataset by name or ID. |
|
Describe a QuickSight ingestion by ID. |
|
Get QuickSight dashboard ID given a name and fails if there is more than 1 ID associated with this name. |
|
Get QuickSight dashboard IDs given a name. |
|
Get QuickSight data source ARN given a name and fails if there is more than 1 ARN associated with this name. |
|
Get QuickSight Data source ARNs given a name. |
|
Get QuickSight data source ID given a name and fails if there is more than 1 ID associated with this name. |
|
Get QuickSight data source IDs given a name. |
|
Get QuickSight Dataset ID given a name and fails if there is more than 1 ID associated with this name. |
|
Get QuickSight dataset IDs given a name. |
|
Get QuickSight template ID given a name and fails if there is more than 1 ID associated with this name. |
|
Get QuickSight template IDs given a name. |
|
List dashboards in an AWS account. |
|
List all QuickSight Data sources summaries. |
|
List all QuickSight datasets summaries. |
|
List all QuickSight Groups. |
|
List all QuickSight Group memberships. |
|
List IAM policy assignments in the current Amazon QuickSight account. |
|
List all the IAM policy assignments. |
|
List the history of SPICE ingestions for a dataset. |
|
List all QuickSight templates. |
|
Return a list of all of the Amazon QuickSight users belonging to this account. |
|
List the Amazon QuickSight groups that an Amazon QuickSight user is a member of. |
AWS STS¶
|
Get Account ID. |
|
Get current user/role ARN. |
|
Get current user/role name. |