31 - OpenSearch¶
Table of Contents¶
1. Initialize
Connect to your Amazon OpenSearch domain
Enter your bucket name
Initialize sample data
2. Indexing (load)
Index documents (no Pandas)
Index json file
3. Search
3.1 Search by DSL
3.2 Search by SQL
4. Delete Indices
5. Bonus - Prepare data and index from DataFrame
Prepare the data for indexing
Create index with mapping
Index dataframe
Execute geo query
1. Initialize¶
[ ]:
# Install the optional modules first
!pip install 'awswrangler[opensearch]'
[1]:
import awswrangler as wr
Connect to your Amazon OpenSearch domain¶
[2]:
client = wr.opensearch.connect(
host='OPENSEARCH-ENDPOINT',
# username='FGAC-USERNAME(OPTIONAL)',
# password='FGAC-PASSWORD(OPTIONAL)'
)
client.info()
Enter your bucket name¶
[3]:
bucket = 'BUCKET'
Initialize sample data¶
[4]:
sf_restaurants_inspections = [
{
"inspection_id": "24936_20160609",
"business_address": "315 California St",
"business_city": "San Francisco",
"business_id": "24936",
"business_location": {"lon": -122.400152, "lat": 37.793199},
"business_name": "San Francisco Soup Company",
"business_postal_code": "94104",
"business_state": "CA",
"inspection_date": "2016-06-09T00:00:00.000",
"inspection_score": 77,
"inspection_type": "Routine - Unscheduled",
"risk_category": "Low Risk",
"violation_description": "Improper food labeling or menu misrepresentation",
"violation_id": "24936_20160609_103141",
},
{
"inspection_id": "60354_20161123",
"business_address": "10 Mason St",
"business_city": "San Francisco",
"business_id": "60354",
"business_location": {"lon": -122.409061, "lat": 37.783527},
"business_name": "Soup Unlimited",
"business_postal_code": "94102",
"business_state": "CA",
"inspection_date": "2016-11-23T00:00:00.000",
"inspection_type": "Routine",
"inspection_score": 95,
},
{
"inspection_id": "1797_20160705",
"business_address": "2872 24th St",
"business_city": "San Francisco",
"business_id": "1797",
"business_location": {"lon": -122.409752, "lat": 37.752807},
"business_name": "TIO CHILOS GRILL",
"business_postal_code": "94110",
"business_state": "CA",
"inspection_date": "2016-07-05T00:00:00.000",
"inspection_score": 90,
"inspection_type": "Routine - Unscheduled",
"risk_category": "Low Risk",
"violation_description": "Unclean nonfood contact surfaces",
"violation_id": "1797_20160705_103142",
},
{
"inspection_id": "66198_20160527",
"business_address": "1661 Tennessee St Suite 3B",
"business_city": "San Francisco Whard Restaurant",
"business_id": "66198",
"business_location": {"lon": -122.388478, "lat": 37.75072},
"business_name": "San Francisco Restaurant",
"business_postal_code": "94107",
"business_state": "CA",
"inspection_date": "2016-05-27T00:00:00.000",
"inspection_type": "Routine",
"inspection_score": 56,
},
{
"inspection_id": "5794_20160907",
"business_address": "2162 24th Ave",
"business_city": "San Francisco",
"business_id": "5794",
"business_location": {"lon": -122.481299, "lat": 37.747228},
"business_name": "Soup House",
"business_phone_number": "+14155752700",
"business_postal_code": "94116",
"business_state": "CA",
"inspection_date": "2016-09-07T00:00:00.000",
"inspection_score": 96,
"inspection_type": "Routine - Unscheduled",
"risk_category": "Low Risk",
"violation_description": "Unapproved or unmaintained equipment or utensils",
"violation_id": "5794_20160907_103144",
},
# duplicate record
{
"inspection_id": "5794_20160907",
"business_address": "2162 24th Ave",
"business_city": "San Francisco",
"business_id": "5794",
"business_location": {"lon": -122.481299, "lat": 37.747228},
"business_name": "Soup-or-Salad",
"business_phone_number": "+14155752700",
"business_postal_code": "94116",
"business_state": "CA",
"inspection_date": "2016-09-07T00:00:00.000",
"inspection_score": 96,
"inspection_type": "Routine - Unscheduled",
"risk_category": "Low Risk",
"violation_description": "Unapproved or unmaintained equipment or utensils",
"violation_id": "5794_20160907_103144",
},
]
2. Indexing (load)¶
Index documents (no Pandas)¶
[5]:
# index documents w/o providing keys (_id is auto-generated)
wr.opensearch.index_documents(
client,
documents=sf_restaurants_inspections,
index="sf_restaurants_inspections"
)
Indexing: 100% (6/6)|####################################|Elapsed Time: 0:00:01
[5]:
{'success': 6, 'errors': []}
[6]:
# read all documents. There are total 6 documents
wr.opensearch.search(
client,
index="sf_restaurants_inspections",
_source=["inspection_id", "business_name", "business_location"]
)
[6]:
_id | business_name | inspection_id | business_location.lon | business_location.lat | |
---|---|---|---|---|---|
0 | 663dd72d-0da4-495b-b0ae-ed000105ae73 | TIO CHILOS GRILL | 1797_20160705 | -122.409752 | 37.752807 |
1 | ff2f50f6-5415-4706-9bcb-af7c5eb0afa3 | Soup House | 5794_20160907 | -122.481299 | 37.747228 |
2 | b9e8f6a2-8fd1-4660-b041-2997a1a80984 | San Francisco Soup Company | 24936_20160609 | -122.400152 | 37.793199 |
3 | 56b352e6-102b-4eff-8296-7e1fb2459bab | Soup Unlimited | 60354_20161123 | -122.409061 | 37.783527 |
4 | 6fec5411-f79a-48e4-be7b-e0e44d5ebbab | San Francisco Restaurant | 66198_20160527 | -122.388478 | 37.750720 |
5 | 7ba4fb17-f9a9-49da-b90e-8b3553d6d97c | Soup-or-Salad | 5794_20160907 | -122.481299 | 37.747228 |
Index json file¶
[ ]:
import pandas as pd
df = pd.DataFrame(sf_restaurants_inspections)
path = f"s3://{bucket}/json/sf_restaurants_inspections.json"
wr.s3.to_json(df, path,orient='records',lines=True)
[8]:
# index json w/ providing keys
wr.opensearch.index_json(
client,
path=path, # path can be s3 or local
index="sf_restaurants_inspections_dedup",
id_keys=["inspection_id"] # can be multiple fields. arg applicable to all index_* functions
)
Indexing: 100% (6/6)|####################################|Elapsed Time: 0:00:00
[8]:
{'success': 6, 'errors': []}
[9]:
# now there are no duplicates. There are total 5 documents
wr.opensearch.search(
client,
index="sf_restaurants_inspections_dedup",
_source=["inspection_id", "business_name", "business_location"]
)
[9]:
_id | business_name | inspection_id | business_location.lon | business_location.lat | |
---|---|---|---|---|---|
0 | 24936_20160609 | San Francisco Soup Company | 24936_20160609 | -122.400152 | 37.793199 |
1 | 66198_20160527 | San Francisco Restaurant | 66198_20160527 | -122.388478 | 37.750720 |
2 | 5794_20160907 | Soup-or-Salad | 5794_20160907 | -122.481299 | 37.747228 |
3 | 60354_20161123 | Soup Unlimited | 60354_20161123 | -122.409061 | 37.783527 |
4 | 1797_20160705 | TIO CHILOS GRILL | 1797_20160705 | -122.409752 | 37.752807 |
Index CSV¶
[11]:
wr.opensearch.index_csv(
client,
index="nyc_restaurants_inspections_sample",
path='https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv?accessType=DOWNLOAD', # index_csv supports local, s3 and url path
id_keys=["CAMIS"],
pandas_kwargs={'na_filter': True, 'nrows': 1000}, # pandas.read_csv() args - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
bulk_size=500 # modify based on your cluster size
)
Indexing: 100% (1000/1000)|##############################|Elapsed Time: 0:00:00
[11]:
{'success': 1000, 'errors': []}
[12]:
wr.opensearch.search(
client,
index="nyc_restaurants_inspections_sample",
size=5
)
[12]:
_id | CAMIS | DBA | BORO | BUILDING | STREET | ZIPCODE | PHONE | CUISINE DESCRIPTION | INSPECTION DATE | ... | RECORD DATE | INSPECTION TYPE | Latitude | Longitude | Community Board | Council District | Census Tract | BIN | BBL | NTA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41610426 | 41610426 | GLOW THAI RESTAURANT | Brooklyn | 7107 | 3 AVENUE | 11209.0 | 7187481920 | Thai | 02/26/2020 | ... | 10/04/2021 | Cycle Inspection / Re-inspection | 40.633865 | -74.026798 | 310.0 | 43.0 | 6800.0 | 3146519.0 | 3.058910e+09 | BK31 |
1 | 40811162 | 40811162 | CARMINE'S | Manhattan | 2450 | BROADWAY | 10024.0 | 2123622200 | Italian | 05/28/2019 | ... | 10/04/2021 | Cycle Inspection / Initial Inspection | 40.791168 | -73.974308 | 107.0 | 6.0 | 17900.0 | 1033560.0 | 1.012380e+09 | MN12 |
2 | 50012113 | 50012113 | TANG | Queens | 196-50 | NORTHERN BOULEVARD | 11358.0 | 7182797080 | Korean | 08/16/2018 | ... | 10/04/2021 | Cycle Inspection / Initial Inspection | 40.757850 | -73.784593 | 411.0 | 19.0 | 145101.0 | 4124565.0 | 4.055200e+09 | QN48 |
3 | 50014618 | 50014618 | TOTTO RAMEN | Manhattan | 248 | EAST 52 STREET | 10022.0 | 2124210052 | Japanese | 08/20/2018 | ... | 10/04/2021 | Cycle Inspection / Re-inspection | 40.756596 | -73.968749 | 106.0 | 4.0 | 9800.0 | 1038490.0 | 1.013250e+09 | MN19 |
4 | 50045782 | 50045782 | OLLIE'S CHINESE RESTAURANT | Manhattan | 2705 | BROADWAY | 10025.0 | 2129323300 | Chinese | 10/21/2019 | ... | 10/04/2021 | Cycle Inspection / Re-inspection | 40.799318 | -73.968440 | 107.0 | 6.0 | 19100.0 | 1056562.0 | 1.018750e+09 | MN12 |
5 rows × 27 columns
3. Search¶
Search results are returned as Pandas DataFrame
3.1 Search by DSL¶
[13]:
# add a search query. search all soup businesses
wr.opensearch.search(
client,
index="sf_restaurants_inspections",
_source=["inspection_id", "business_name", "business_location"],
filter_path=["hits.hits._id","hits.hits._source"],
search_body={
"query": {
"match": {
"business_name": "soup"
}
}
}
)
[13]:
_id | business_name | inspection_id | business_location.lon | business_location.lat | |
---|---|---|---|---|---|
0 | ff2f50f6-5415-4706-9bcb-af7c5eb0afa3 | Soup House | 5794_20160907 | -122.481299 | 37.747228 |
1 | 7ba4fb17-f9a9-49da-b90e-8b3553d6d97c | Soup-or-Salad | 5794_20160907 | -122.481299 | 37.747228 |
2 | b9e8f6a2-8fd1-4660-b041-2997a1a80984 | San Francisco Soup Company | 24936_20160609 | -122.400152 | 37.793199 |
3 | 56b352e6-102b-4eff-8296-7e1fb2459bab | Soup Unlimited | 60354_20161123 | -122.409061 | 37.783527 |
3.1 Search by SQL¶
[14]:
wr.opensearch.search_by_sql(
client,
sql_query="""SELECT business_name, inspection_score
FROM sf_restaurants_inspections_dedup
WHERE business_name LIKE '%soup%'
ORDER BY inspection_score DESC LIMIT 5"""
)
[14]:
_index | _type | _id | _score | business_name | inspection_score | |
---|---|---|---|---|---|---|
0 | sf_restaurants_inspections_dedup | _doc | 5794_20160907 | None | Soup-or-Salad | 96 |
1 | sf_restaurants_inspections_dedup | _doc | 60354_20161123 | None | Soup Unlimited | 95 |
2 | sf_restaurants_inspections_dedup | _doc | 24936_20160609 | None | San Francisco Soup Company | 77 |
4. Delete Indices¶
[15]:
wr.opensearch.delete_index(
client=client,
index="sf_restaurants_inspections"
)
[15]:
{'acknowledged': True}
5. Bonus - Prepare data and index from DataFrame¶
For this exercise we’ll use DOHMH New York City Restaurant Inspection Results dataset
[16]:
import pandas as pd
[17]:
df = pd.read_csv('https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv?accessType=DOWNLOAD')
Prepare the data for indexing¶
[18]:
# fields names underscore casing
df.columns = [col.lower().replace(' ', '_') for col in df.columns]
# convert lon/lat to OpenSearch geo_point
df['business_location'] = "POINT (" + df.longitude.fillna('0').astype(str) + " " + df.latitude.fillna('0').astype(str) + ")"
Create index with mapping¶
[19]:
# delete index if exists
wr.opensearch.delete_index(
client=client,
index="nyc_restaurants"
)
# use dynamic_template to map date fields
# define business_location as geo_point
wr.opensearch.create_index(
client=client,
index="nyc_restaurants_inspections",
mappings={
"dynamic_templates" : [
{
"dates" : {
"match" : "*date",
"mapping" : {
"type" : "date",
"format" : 'MM/dd/yyyy'
}
}
}
],
"properties": {
"business_location": {
"type": "geo_point"
}
}
}
)
[19]:
{'acknowledged': True,
'shards_acknowledged': True,
'index': 'nyc_restaurants_inspections'}
Index dataframe¶
[20]:
wr.opensearch.index_df(
client,
df=df,
index="nyc_restaurants_inspections",
id_keys=["camis"],
bulk_size=1000
)
Indexing: 100% (382655/382655)|##########################|Elapsed Time: 0:04:15
[20]:
{'success': 382655, 'errors': []}
Execute geo query¶
Sort restaurants by distance from Times-Square¶
[21]:
wr.opensearch.search(
client,
index="nyc_restaurants_inspections",
filter_path=["hits.hits._source"],
size=100,
search_body={
"query": {
"match_all": {}
},
"sort": [
{
"_geo_distance": {
"business_location": { # Times-Square - https://geojson.io/#map=16/40.7563/-73.9862
"lat": 40.75613228383523,
"lon": -73.9865791797638
},
"order": "asc"
}
}
]
}
)
[21]:
camis | dba | boro | building | street | zipcode | phone | cuisine_description | inspection_date | action | ... | inspection_type | latitude | longitude | community_board | council_district | census_tract | bin | bbl | nta | business_location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41551304 | THE COUNTER | Manhattan | 7 | TIMES SQUARE | 10036.0 | 2129976801 | American | 12/22/2016 | Violations were cited in the following area(s). | ... | Cycle Inspection / Initial Inspection | 40.755908 | -73.986681 | 105.0 | 3.0 | 11300.0 | 1086069.0 | 1.009940e+09 | MN17 | POINT (-73.986680953809 40.755907817312) |
1 | 50055665 | ANN INC CAFE | Manhattan | 7 | TIMES SQUARE | 10036.0 | 2125413287 | American | 12/11/2019 | Violations were cited in the following area(s). | ... | Cycle Inspection / Initial Inspection | 40.755908 | -73.986681 | 105.0 | 3.0 | 11300.0 | 1086069.0 | 1.009940e+09 | MN17 | POINT (-73.986680953809 40.755907817312) |
2 | 50049552 | ERNST AND YOUNG | Manhattan | 5 | TIMES SQ | 10036.0 | 2127739994 | Coffee/Tea | 11/30/2018 | Violations were cited in the following area(s). | ... | Cycle Inspection / Initial Inspection | 40.755702 | -73.987208 | 105.0 | 3.0 | 11300.0 | 1024656.0 | 1.010130e+09 | MN17 | POINT (-73.987207980138 40.755702020307) |
3 | 50014078 | RED LOBSTER | Manhattan | 5 | TIMES SQ | 10036.0 | 2127306706 | Seafood | 10/03/2017 | Violations were cited in the following area(s). | ... | Cycle Inspection / Initial Inspection | 40.755702 | -73.987208 | 105.0 | 3.0 | 11300.0 | 1024656.0 | 1.010130e+09 | MN17 | POINT (-73.987207980138 40.755702020307) |
4 | 50015171 | NEW AMSTERDAM THEATER | Manhattan | 214 | WEST 42 STREET | 10036.0 | 2125825472 | American | 06/26/2018 | Violations were cited in the following area(s). | ... | Cycle Inspection / Re-inspection | 40.756317 | -73.987652 | 105.0 | 3.0 | 11300.0 | 1024660.0 | 1.010130e+09 | MN17 | POINT (-73.987651832547 40.756316895053) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 41552060 | PROSKAUER ROSE | Manhattan | 11 | TIMES SQUARE | 10036.0 | 2129695493 | American | 08/11/2017 | Violations were cited in the following area(s). | ... | Administrative Miscellaneous / Initial Inspection | 40.756891 | -73.990023 | 105.0 | 3.0 | 11300.0 | 1087978.0 | 1.010138e+09 | MN17 | POINT (-73.990023200823 40.756890780426) |
96 | 41242148 | GABBY O'HARA'S | Manhattan | 123 | WEST 39 STREET | 10018.0 | 2122788984 | Irish | 07/30/2019 | Violations were cited in the following area(s). | ... | Cycle Inspection / Re-inspection | 40.753405 | -73.986602 | 105.0 | 4.0 | 11300.0 | 1080611.0 | 1.008150e+09 | MN17 | POINT (-73.986602050292 40.753404587174) |
97 | 50095860 | THE TIMES EATERY | Manhattan | 680 | 8 AVENUE | 10036.0 | 6463867787 | American | 02/28/2020 | Violations were cited in the following area(s). | ... | Pre-permit (Operational) / Initial Inspection | 40.757991 | -73.989218 | 105.0 | 3.0 | 11900.0 | 1024703.0 | 1.010150e+09 | MN17 | POINT (-73.989218092096 40.757991356019) |
98 | 50072861 | ITSU | Manhattan | 530 | 7 AVENUE | 10018.0 | 9176393645 | Asian/Asian Fusion | 09/10/2018 | Violations were cited in the following area(s). | ... | Pre-permit (Operational) / Initial Inspection | 40.753844 | -73.988551 | 105.0 | 3.0 | 11300.0 | 1014485.0 | 1.007880e+09 | MN17 | POINT (-73.988551029682 40.753843959794) |
99 | 50068109 | LUKE'S LOBSTER | Manhattan | 1407 | BROADWAY | 10018.0 | 9174759192 | Seafood | 09/06/2017 | Violations were cited in the following area(s). | ... | Pre-permit (Operational) / Initial Inspection | 40.753432 | -73.987151 | 105.0 | 3.0 | 11300.0 | 1015265.0 | 1.008140e+09 | MN17 | POINT (-73.98715066791 40.753432097521) |
100 rows × 27 columns