API Reference

Amazon S3

copy_objects(paths, source_path, target_path)

Copy a list of S3 objects to another S3 directory.

delete_objects(path[, use_threads, …])

Delete Amazon S3 objects from a received S3 prefix or list of S3 objects paths.

describe_objects(path[, wait_time, …])

Describe Amazon S3 objects from a received S3 prefix or list of S3 objects paths.

does_object_exist(path[, boto3_session])

Check if object exists on S3.

get_bucket_region(bucket[, boto3_session])

Get bucket region name.

list_directories(path[, boto3_session])

List Amazon S3 objects from a prefix.

list_objects(path[, suffix, boto3_session])

List Amazon S3 objects from a prefix.

merge_datasets(source_path, target_path[, …])

Merge a source dataset into a target dataset.

read_csv(path[, use_threads, boto3_session, …])

Read CSV file(s) from from a received S3 prefix or list of S3 objects paths.

read_fwf(path[, use_threads, boto3_session, …])

Read fixed-width formatted file(s) from from a received S3 prefix or list of S3 objects paths.

read_json(path[, use_threads, …])

Read JSON file(s) from from a received S3 prefix or list of S3 objects paths.

read_parquet(path[, filters, columns, …])

Read Apache Parquet file(s) from from a received S3 prefix or list of S3 objects paths.

read_parquet_metadata(path[, dtype, …])

Read Apache Parquet file(s) metadata from from a received S3 prefix or list of S3 objects paths.

read_parquet_table(table, database[, …])

Read Apache Parquet table registered on AWS Glue Catalog.

size_objects(path[, wait_time, use_threads, …])

Get the size (ContentLength) in bytes of Amazon S3 objects from a received S3 prefix or list of S3 objects paths.

store_parquet_metadata(path, database, table)

Infer and store parquet metadata on AWS Glue Catalog.

to_csv(df, path[, sep, index, columns, …])

Write CSV file or dataset on Amazon S3.

to_json(df, path[, boto3_session, …])

Write JSON file on Amazon S3.

to_parquet(df, path[, index, compression, …])

Write Parquet file or dataset on Amazon S3.

wait_objects_exist(paths[, delay, …])

Wait Amazon S3 objects exist.

wait_objects_not_exist(paths[, delay, …])

Wait Amazon S3 objects not exist.

AWS Glue Catalog

add_csv_partitions(database, table, …[, …])

Add partitions (metadata) to a CSV Table in the AWS Glue Catalog.

add_parquet_partitions(database, table, …)

Add partitions (metadata) to a Parquet Table in the AWS Glue Catalog.

create_csv_table(database, table, path, …)

Create a CSV Table (Metadata Only) in the AWS Glue Catalog.

create_parquet_table(database, table, path, …)

Create a Parquet Table (Metadata Only) in the AWS Glue Catalog.

databases([limit, catalog_id, boto3_session])

Get a Pandas DataFrame with all listed databases.

delete_table_if_exists(database, table[, …])

Delete Glue table if exists.

does_table_exist(database, table[, …])

Check if the table exists.

drop_duplicated_columns(df)

Drop all repeated columns (duplicated names).

extract_athena_types(df[, index, …])

Extract columns and partitions types (Amazon Athena) from Pandas DataFrame.

get_columns_comments(database, table[, …])

Get all columns comments.

get_csv_partitions(database, table[, …])

Get all partitions from a Table in the AWS Glue Catalog.

get_databases([catalog_id, boto3_session])

Get an iterator of databases.

get_engine(connection[, catalog_id, …])

Return a SQLAlchemy Engine from a Glue Catalog Connection.

get_parquet_partitions(database, table[, …])

Get all partitions from a Table in the AWS Glue Catalog.

get_table_description(database, table[, …])

Get table description.

get_table_location(database, table[, …])

Get table’s location on Glue catalog.

get_table_parameters(database, table[, …])

Get all parameters.

get_table_types(database, table[, boto3_session])

Get all columns and types from a table.

get_tables([catalog_id, database, …])

Get an iterator of tables.

overwrite_table_parameters(parameters, …)

Overwrite all existing parameters.

sanitize_column_name(column)

Convert the column name to be compatible with Amazon Athena.

sanitize_dataframe_columns_names(df)

Normalize all columns names to be compatible with Amazon Athena.

sanitize_table_name(table)

Convert the table name to be compatible with Amazon Athena.

search_tables(text[, catalog_id, boto3_session])

Get Pandas DataFrame of tables filtered by a search string.

table(database, table[, catalog_id, …])

Get table details as Pandas DataFrame.

tables([limit, catalog_id, database, …])

Get a DataFrame with tables filtered by a search term, prefix, suffix.

upsert_table_parameters(parameters, …[, …])

Insert or Update the received parameters.

Amazon Athena

create_athena_bucket([boto3_session])

Create the default Athena bucket if it doesn’t exist.

get_query_columns_types(query_execution_id)

Get the data type of all columns queried.

get_work_group(workgroup[, boto3_session])

Return information about the workgroup with the specified name.

read_sql_query(sql, database[, …])

Execute any SQL query on AWS Athena and return the results as a Pandas DataFrame.

read_sql_table(table, database[, …])

Extract the full table AWS Athena and return the results as a Pandas DataFrame.

repair_table(table[, database, s3_output, …])

Run the Hive’s metastore consistency check: ‘MSCK REPAIR TABLE table;’.

start_query_execution(sql[, database, …])

Start a SQL Query against AWS Athena.

stop_query_execution(query_execution_id[, …])

Stop a query execution.

wait_query(query_execution_id[, boto3_session])

Wait for the query end.

Databases (Redshift, PostgreSQL, MySQL)

copy_files_to_redshift(path, …[, mode, …])

Load Parquet files from S3 to a Table on Amazon Redshift (Through COPY command).

copy_to_redshift(df, path, con, table, …)

Load Pandas DataFrame as a Table on Amazon Redshift using parquet files on S3 as stage.

get_engine(db_type, host, port, database, …)

Return a SQLAlchemy Engine from the given arguments.

get_redshift_temp_engine(cluster_identifier, …)

Get Glue connection details.

read_sql_query(sql, con[, index_col, …])

Return a DataFrame corresponding to the result set of the query string.

read_sql_table(table, con[, schema, …])

Return a DataFrame corresponding to the result set of the query string.

to_sql(df, con, **pandas_kwargs)

Write records stored in a DataFrame to a SQL database.

unload_redshift(sql, path, con, iam_role[, …])

Load Pandas DataFrame from a Amazon Redshift query result using Parquet files on s3 as stage.

unload_redshift_to_files(sql, path, con, …)

Unload Parquet files from a Amazon Redshift query result to parquet files on s3 (Through UNLOAD command).

write_redshift_copy_manifest(manifest_path, …)

Write Redshift copy manifest and return its structure.

EMR

build_spark_step(path[, deploy_mode, …])

Build the Step structure (dictionary).

build_step(command[, name, …])

Build the Step structure (dictionary).

create_cluster(subnet_id[, cluster_name, …])

Create a EMR cluster with instance fleets configuration.

get_cluster_state(cluster_id[, boto3_session])

Get the EMR cluster state.

get_step_state(cluster_id, step_id[, …])

Get EMR step state.

submit_ecr_credentials_refresh(cluster_id, path)

Update internal ECR credentials.

submit_spark_step(cluster_id, path[, …])

Submit Spark Step.

submit_step(cluster_id, command[, name, …])

Submit new job in the EMR Cluster.

submit_steps(cluster_id, steps[, boto3_session])

Submit a list of steps.

terminate_cluster(cluster_id[, boto3_session])

Terminate EMR cluster.

CloudWatch Logs

read_logs(query, log_group_names[, …])

Run a query against AWS CloudWatchLogs Insights and convert the results to Pandas DataFrame.

run_query(query, log_group_names[, …])

Run a query against AWS CloudWatchLogs Insights and wait the results.

start_query(query, log_group_names[, …])

Run a query against AWS CloudWatchLogs Insights.

wait_query(query_id[, boto3_session])

Wait query ends.