AWS SDK for pandas

18 - QuickSight

For this tutorial we will use the public AWS COVID-19 data lake.

References:

Please, install the CloudFormation template above to have access to the public data lake.

P.S. To be able to access the public data lake, you must allow explicitly QuickSight to access the related external bucket.

[1]:
from time import sleep

import awswrangler as wr

List users of QuickSight account

[2]:
[{"username": user["UserName"], "role": user["Role"]} for user in wr.quicksight.list_users("default")]
[2]:
[{'username': 'dev', 'role': 'ADMIN'}]
[3]:
wr.catalog.databases()
[3]:
Database Description
0 aws_sdk_pandas AWS SDK for pandas Test Arena - Glue Database
1 awswrangler_test
2 covid-19
3 default Default Hive database
[4]:
wr.catalog.tables(database="covid-19")
[4]:
Database Table Description Columns Partitions
0 covid-19 alleninstitute_comprehend_medical Comprehend Medical results run against Allen I... paper_id, date, dx_name, test_name, procedure_...
1 covid-19 alleninstitute_metadata Metadata on papers pulled from the Allen Insti... cord_uid, sha, source_x, title, doi, pmcid, pu...
2 covid-19 country_codes Lookup table for country codes country, alpha-2 code, alpha-3 code, numeric c...
3 covid-19 county_populations Lookup table for population for each county ba... id, id2, county, state, population estimate 2018
4 covid-19 covid_knowledge_graph_edges AWS Knowledge Graph for COVID-19 data id, label, from, to, score
5 covid-19 covid_knowledge_graph_nodes_author AWS Knowledge Graph for COVID-19 data id, label, first, last, full_name
6 covid-19 covid_knowledge_graph_nodes_concept AWS Knowledge Graph for COVID-19 data id, label, entity, concept
7 covid-19 covid_knowledge_graph_nodes_institution AWS Knowledge Graph for COVID-19 data id, label, institution, country, settlement
8 covid-19 covid_knowledge_graph_nodes_paper AWS Knowledge Graph for COVID-19 data id, label, doi, sha_code, publish_time, source...
9 covid-19 covid_knowledge_graph_nodes_topic AWS Knowledge Graph for COVID-19 data id, label, topic, topic_num
10 covid-19 covid_testing_states_daily USA total test daily trend by state. Sourced ... date, state, positive, negative, pending, hosp...
11 covid-19 covid_testing_us_daily USA total test daily trend. Sourced from covi... date, states, positive, negative, posneg, pend...
12 covid-19 covid_testing_us_total USA total tests. Sourced from covidtracking.c... positive, negative, posneg, hospitalized, deat...
13 covid-19 covidcast_data CMU Delphi's COVID-19 Surveillance Data data_source, signal, geo_type, time_value, geo...
14 covid-19 covidcast_metadata CMU Delphi's COVID-19 Surveillance Metadata data_source, signal, time_type, geo_type, min_...
15 covid-19 enigma_jhu Johns Hopkins University Consolidated data on ... fips, admin2, province_state, country_region, ...
16 covid-19 enigma_jhu_timeseries Johns Hopkins University data on COVID-19 case... uid, fips, iso2, iso3, code3, admin2, latitude...
17 covid-19 hospital_beds Data on hospital beds and their utilization in... objectid, hospital_name, hospital_type, hq_add...
18 covid-19 nytimes_counties Data on COVID-19 cases from NY Times at US cou... date, county, state, fips, cases, deaths
19 covid-19 nytimes_states Data on COVID-19 cases from NY Times at US sta... date, state, fips, cases, deaths
20 covid-19 prediction_models_county_predictions County-level Predictions Data. Sourced from Yu... countyfips, countyname, statename, severity_co...
21 covid-19 prediction_models_severity_index Severity Index models. Sourced from Yu Group a... severity_1-day, severity_2-day, severity_3-day...
22 covid-19 tableau_covid_datahub COVID-19 data that has been gathered and unifi... country_short_name, country_alpha_3_code, coun...
23 covid-19 tableau_jhu Johns Hopkins University data on COVID-19 case... case_type, cases, difference, date, country_re...
24 covid-19 us_state_abbreviations Lookup table for US state abbreviations state, abbreviation
25 covid-19 world_cases_deaths_testing Data on confirmed cases, deaths, and testing. ... iso_code, location, date, total_cases, new_cas...

Create data source of QuickSight Note: data source stores the connection information.

[5]:
wr.quicksight.create_athena_data_source(
    name="covid-19",
    workgroup="primary",
    allowed_to_manage={"users": ["dev"]},
)
[6]:
wr.catalog.tables(database="covid-19", name_contains="nyt")
[6]:
Database Table Description Columns Partitions
0 covid-19 nytimes_counties Data on COVID-19 cases from NY Times at US cou... date, county, state, fips, cases, deaths
1 covid-19 nytimes_states Data on COVID-19 cases from NY Times at US sta... date, state, fips, cases, deaths
[7]:
wr.athena.read_sql_query("SELECT * FROM nytimes_counties limit 10", database="covid-19", ctas_approach=False)
[7]:
date county state fips cases deaths
0 2020-01-21 Snohomish Washington 53061 1 0
1 2020-01-22 Snohomish Washington 53061 1 0
2 2020-01-23 Snohomish Washington 53061 1 0
3 2020-01-24 Cook Illinois 17031 1 0
4 2020-01-24 Snohomish Washington 53061 1 0
5 2020-01-25 Orange California 06059 1 0
6 2020-01-25 Cook Illinois 17031 1 0
7 2020-01-25 Snohomish Washington 53061 1 0
8 2020-01-26 Maricopa Arizona 04013 1 0
9 2020-01-26 Los Angeles California 06037 1 0
[8]:
sql = """
SELECT
  j.*,
  co.Population,
  co.county AS county2,
  hb.*
FROM
  (
    SELECT
      date,
      county,
      state,
      fips,
      cases as confirmed,
      deaths
    FROM "covid-19".nytimes_counties
  ) j
  LEFT OUTER JOIN (
    SELECT
      DISTINCT county,
      state,
      "population estimate 2018" AS Population
    FROM
      "covid-19".county_populations
    WHERE
      state IN (
        SELECT
          DISTINCT state
        FROM
          "covid-19".nytimes_counties
      )
      AND county IN (
        SELECT
          DISTINCT county as county
        FROM "covid-19".nytimes_counties
      )
  ) co ON co.county = j.county
  AND co.state = j.state
  LEFT OUTER JOIN (
    SELECT
      count(objectid) as Hospital,
      fips as hospital_fips,
      sum(num_licensed_beds) as licensed_beds,
      sum(num_staffed_beds) as staffed_beds,
      sum(num_icu_beds) as icu_beds,
      avg(bed_utilization) as bed_utilization,
      sum(
        potential_increase_in_bed_capac
      ) as potential_increase_bed_capacity
    FROM "covid-19".hospital_beds
    WHERE
      fips in (
        SELECT
          DISTINCT fips
        FROM
          "covid-19".nytimes_counties
      )
    GROUP BY
      2
  ) hb ON hb.hospital_fips = j.fips
"""

wr.athena.read_sql_query(sql, database="covid-19", ctas_approach=False)
[8]:
date county state fips confirmed deaths population county2 Hospital hospital_fips licensed_beds staffed_beds icu_beds bed_utilization potential_increase_bed_capacity
0 2020-04-12 Park Montana 30067 7 0 16736 Park 0 30067 25 25 4 0.432548 0
1 2020-04-12 Ravalli Montana 30081 3 0 43172 Ravalli 0 30081 25 25 5 0.567781 0
2 2020-04-12 Silver Bow Montana 30093 11 0 34993 Silver Bow 0 30093 98 71 11 0.551457 27
3 2020-04-12 Clay Nebraska 31035 2 0 6214 Clay <NA> <NA> <NA> <NA> <NA> NaN <NA>
4 2020-04-12 Cuming Nebraska 31039 2 0 8940 Cuming 0 31039 25 25 4 0.204493 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
227684 2020-06-11 Hockley Texas 48219 28 1 22980 Hockley 0 48219 48 48 8 0.120605 0
227685 2020-06-11 Hudspeth Texas 48229 11 0 4795 Hudspeth <NA> <NA> <NA> <NA> <NA> NaN <NA>
227686 2020-06-11 Jones Texas 48253 633 0 19817 Jones 0 48253 45 7 1 0.718591 38
227687 2020-06-11 La Salle Texas 48283 4 0 7531 La Salle <NA> <NA> <NA> <NA> <NA> NaN <NA>
227688 2020-06-11 Limestone Texas 48293 36 1 23519 Limestone 0 48293 78 69 9 0.163940 9

227689 rows × 15 columns

Create Dataset with custom SQL option

[9]:
wr.quicksight.create_athena_dataset(
    name="covid19-nytimes-usa",
    sql=sql,
    sql_name="CustomSQL",
    data_source_name="covid-19",
    import_mode="SPICE",
    allowed_to_manage={"users": ["dev"]},
)
[10]:
ingestion_id = wr.quicksight.create_ingestion("covid19-nytimes-usa")

Wait ingestion

[11]:
while wr.quicksight.describe_ingestion(ingestion_id=ingestion_id, dataset_name="covid19-nytimes-usa")[
    "IngestionStatus"
] not in ["COMPLETED", "FAILED"]:
    sleep(1)

Describe last ingestion

[12]:
wr.quicksight.describe_ingestion(ingestion_id=ingestion_id, dataset_name="covid19-nytimes-usa")["RowInfo"]
[12]:
{'RowsIngested': 227689, 'RowsDropped': 0}

List all ingestions

[13]:
[
    {"time": user["CreatedTime"], "source": user["RequestSource"]}
    for user in wr.quicksight.list_ingestions("covid19-nytimes-usa")
]
[13]:
[{'time': datetime.datetime(2020, 6, 12, 15, 13, 46, 996000, tzinfo=tzlocal()),
  'source': 'MANUAL'},
 {'time': datetime.datetime(2020, 6, 12, 15, 13, 42, 344000, tzinfo=tzlocal()),
  'source': 'MANUAL'}]

Create new dataset from a table directly

[14]:
wr.quicksight.create_athena_dataset(
    name="covid-19-tableau_jhu",
    table="tableau_jhu",
    data_source_name="covid-19",
    database="covid-19",
    import_mode="DIRECT_QUERY",
    rename_columns={"cases": "Count_of_Cases", "combined_key": "County"},
    cast_columns_types={"Count_of_Cases": "INTEGER"},
    tag_columns={"combined_key": [{"ColumnGeographicRole": "COUNTY"}]},
    allowed_to_manage={"users": ["dev"]},
)

Cleaning up

[15]:
wr.quicksight.delete_data_source("covid-19")
wr.quicksight.delete_dataset("covid19-nytimes-usa")
wr.quicksight.delete_dataset("covid-19-tableau_jhu")