awswrangler.emr.create_cluster

awswrangler.emr.create_cluster(cluster_name: str, logging_s3_path: str, emr_release: str, subnet_id: str, emr_ec2_role: str, emr_role: str, instance_type_master: str, instance_type_core: str, instance_type_task: str, instance_ebs_size_master: int, instance_ebs_size_core: int, instance_ebs_size_task: int, instance_num_on_demand_master: int, instance_num_on_demand_core: int, instance_num_on_demand_task: int, instance_num_spot_master: int, instance_num_spot_core: int, instance_num_spot_task: int, spot_bid_percentage_of_on_demand_master: int, spot_bid_percentage_of_on_demand_core: int, spot_bid_percentage_of_on_demand_task: int, spot_provisioning_timeout_master: int, spot_provisioning_timeout_core: int, spot_provisioning_timeout_task: int, spot_timeout_to_on_demand_master: bool = True, spot_timeout_to_on_demand_core: bool = True, spot_timeout_to_on_demand_task: bool = True, python3: bool = True, spark_glue_catalog: bool = True, hive_glue_catalog: bool = True, presto_glue_catalog: bool = True, consistent_view: bool = False, consistent_view_retry_seconds: int = 10, consistent_view_retry_count: int = 5, consistent_view_table_name: str = 'EmrFSMetadata', bootstraps_paths: Optional[List[str]] = None, debugging: bool = True, applications: Optional[List[str]] = None, visible_to_all_users: bool = True, key_pair_name: Optional[str] = None, security_group_master: Optional[str] = None, security_groups_master_additional: Optional[List[str]] = None, security_group_slave: Optional[str] = None, security_groups_slave_additional: Optional[List[str]] = None, security_group_service_access: Optional[str] = None, spark_log_level: str = 'WARN', spark_jars_path: Optional[List[str]] = None, spark_defaults: Optional[Dict[str, str]] = None, spark_pyarrow: bool = False, maximize_resource_allocation: bool = False, steps: Optional[List[Dict[str, Any]]] = None, keep_cluster_alive_when_no_steps: bool = True, termination_protected: bool = False, tags: Optional[Dict[str, str]] = None, boto3_session: Optional[boto3.session.Session] = None) → str

Create a EMR cluster with instance fleets configuration.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html

Parameters
  • cluster_name (str) – Cluster name.

  • logging_s3_path (str) – Logging s3 path (e.g. s3://BUCKET_NAME/DIRECTORY_NAME/).

  • emr_release (str) – EMR release (e.g. emr-5.28.0).

  • emr_ec2_role (str) – IAM role name.

  • emr_role (str) – IAM role name.

  • subnet_id (str) – VPC subnet ID.

  • instance_type_master (str) – EC2 instance type.

  • instance_type_core (str) – EC2 instance type.

  • instance_type_task (str) – EC2 instance type.

  • instance_ebs_size_master (int) – Size of EBS in GB.

  • instance_ebs_size_core (int) – Size of EBS in GB.

  • instance_ebs_size_task (int) – Size of EBS in GB.

  • instance_num_on_demand_master (int) – Number of on demand instances.

  • instance_num_on_demand_core (int) – Number of on demand instances.

  • instance_num_on_demand_task (int) – Number of on demand instances.

  • instance_num_spot_master (int) – Number of spot instances.

  • instance_num_spot_core (int) – Number of spot instances.

  • instance_num_spot_task (int) – Number of spot instances.

  • spot_bid_percentage_of_on_demand_master (int) – The bid price, as a percentage of On-Demand price.

  • spot_bid_percentage_of_on_demand_core (int) – The bid price, as a percentage of On-Demand price.

  • spot_bid_percentage_of_on_demand_task (int) – The bid price, as a percentage of On-Demand price.

  • spot_provisioning_timeout_master (int) – The spot provisioning timeout period in minutes. If Spot instances are not provisioned within this time period, the TimeOutAction is taken. Minimum value is 5 and maximum value is 1440. The timeout applies only during initial provisioning, when the cluster is first created.

  • spot_provisioning_timeout_core (int) – The spot provisioning timeout period in minutes. If Spot instances are not provisioned within this time period, the TimeOutAction is taken. Minimum value is 5 and maximum value is 1440. The timeout applies only during initial provisioning, when the cluster is first created.

  • spot_provisioning_timeout_task (int) – The spot provisioning timeout period in minutes. If Spot instances are not provisioned within this time period, the TimeOutAction is taken. Minimum value is 5 and maximum value is 1440. The timeout applies only during initial provisioning, when the cluster is first created.

  • spot_timeout_to_on_demand_master (bool) – After a provisioning timeout should the cluster switch to on demand or shutdown?

  • spot_timeout_to_on_demand_core (bool) – After a provisioning timeout should the cluster switch to on demand or shutdown?

  • spot_timeout_to_on_demand_task (bool) – After a provisioning timeout should the cluster switch to on demand or shutdown?

  • python3 (bool) – Python 3 Enabled?

  • spark_glue_catalog (bool) – Spark integration with Glue Catalog?

  • hive_glue_catalog (bool) – Hive integration with Glue Catalog?

  • presto_glue_catalog (bool) – Presto integration with Glue Catalog?

  • consistent_view (bool) – Consistent view allows EMR clusters to check for list and read-after-write consistency for Amazon S3 objects written by or synced with EMRFS. https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html

  • consistent_view_retry_seconds (int) – Delay between the tries (seconds).

  • consistent_view_retry_count (int) – Number of tries.

  • consistent_view_table_name (str) – Name of the DynamoDB table to store the consistent view data.

  • bootstraps_paths (List[str], optional) – Bootstraps paths (e.g [“s3://BUCKET_NAME/script.sh”]).

  • debugging (bool) – Debugging enabled?

  • applications (List[str], optional) – List of applications (e.g [“Hadoop”, “Spark”, “Ganglia”, “Hive”]).

  • visible_to_all_users (bool) – True or False.

  • key_pair_name (str, optional) – Key pair name.

  • security_group_master (str, optional) – The identifier of the Amazon EC2 security group for the master node.

  • security_groups_master_additional (str, optional) – A list of additional Amazon EC2 security group IDs for the master node.

  • security_group_slave (str, optional) – The identifier of the Amazon EC2 security group for the core and task nodes.

  • security_groups_slave_additional (str, optional) – A list of additional Amazon EC2 security group IDs for the core and task nodes.

  • security_group_service_access (str, optional) – The identifier of the Amazon EC2 security group for the Amazon EMR service to access clusters in VPC private subnets.

  • spark_log_level (str) – log4j.rootCategory log level (ALL, DEBUG, INFO, WARN, ERROR, FATAL, OFF, TRACE).

  • spark_jars_path (List[str], optional) – spark.jars e.g. [s3://…/foo.jar, s3://…/boo.jar] https://spark.apache.org/docs/latest/configuration.html

  • spark_defaults (Dict[str, str], optional) – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#spark-defaults

  • spark_pyarrow (bool) – Enable PySpark to use PyArrow behind the scenes. P.S. You must install pyarrow by your self via bootstrap

  • maximize_resource_allocation (bool) – Configure your executors to utilize the maximum resources possible https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation

  • steps (List[Dict[str, Any]], optional) – Steps definitions (Obs : str Use EMR.build_step() to build it)

  • keep_cluster_alive_when_no_steps (bool) – Specifies whether the cluster should remain available after completing all steps

  • termination_protected (bool) – Specifies whether the Amazon EC2 instances in the cluster are protected from termination by API calls, user intervention, or in the event of a job-flow error.

  • tags (Dict[str, str], optional) – Key/Value collection to put on the Cluster. e.g. {“foo”: “boo”, “bar”: “xoo”})

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.

Returns

Cluster ID.

Return type

str

Examples

>>> import awswrangler as wr
>>> cluster_id = wr.emr.create_cluster(
...     cluster_name="wrangler_cluster",
...     logging_s3_path=f"s3://BUCKET_NAME/emr-logs/",
...     emr_release="emr-5.28.0",
...     subnet_id="SUBNET_ID",
...     emr_ec2_role="EMR_EC2_DefaultRole",
...     emr_role="EMR_DefaultRole",
...     instance_type_master="m5.xlarge",
...     instance_type_core="m5.xlarge",
...     instance_type_task="m5.xlarge",
...     instance_ebs_size_master=50,
...     instance_ebs_size_core=50,
...     instance_ebs_size_task=50,
...     instance_num_on_demand_master=1,
...     instance_num_on_demand_core=1,
...     instance_num_on_demand_task=1,
...     instance_num_spot_master=0,
...     instance_num_spot_core=1,
...     instance_num_spot_task=1,
...     spot_bid_percentage_of_on_demand_master=100,
...     spot_bid_percentage_of_on_demand_core=100,
...     spot_bid_percentage_of_on_demand_task=100,
...     spot_provisioning_timeout_master=5,
...     spot_provisioning_timeout_core=5,
...     spot_provisioning_timeout_task=5,
...     spot_timeout_to_on_demand_master=True,
...     spot_timeout_to_on_demand_core=True,
...     spot_timeout_to_on_demand_task=True,
...     python3=True,
...     spark_glue_catalog=True,
...     hive_glue_catalog=True,
...     presto_glue_catalog=True,
...     bootstraps_paths=None,
...     debugging=True,
...     applications=["Hadoop", "Spark", "Ganglia", "Hive"],
...     visible_to_all_users=True,
...     key_pair_name=None,
...     spark_jars_path=[f"s3://...jar"],
...     maximize_resource_allocation=True,
...     keep_cluster_alive_when_no_steps=True,
...     termination_protected=False,
...     spark_pyarrow=True,
...     tags={
...         "foo": "boo"
...     })