23 - Flexible Partitions Filter (PUSH-DOWN)¶
partition_filter
argument:- Callback Function filters to apply on PARTITION columns (PUSH-DOWN filter). - This function MUST receive a single argument (Dict[str, str]) where keys are partitions names and values are partitions values. - This function MUST return a bool, True to read the partition or False to ignore it. - Ignored if `dataset=False`.
P.S. Check the function API doc to see it has some argument that can be configured through Global configurations.
[1]:
import awswrangler as wr
import pandas as pd
Enter your bucket name:¶
[2]:
import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/dataset/"
············
Creating the Dataset (Parquet)¶
[3]:
df = pd.DataFrame({
"id": [1, 2, 3],
"value": ["foo", "boo", "bar"],
})
wr.s3.to_parquet(
df=df,
path=path,
dataset=True,
mode="overwrite",
partition_cols=["value"]
)
wr.s3.read_parquet(path, dataset=True)
[3]:
id | value | |
---|---|---|
0 | 3 | bar |
1 | 2 | boo |
2 | 1 | foo |
Parquet Example 1¶
[4]:
my_filter = lambda x: x["value"].endswith("oo")
wr.s3.read_parquet(path, dataset=True, partition_filter=my_filter)
[4]:
id | value | |
---|---|---|
0 | 2 | boo |
1 | 1 | foo |
Parquet Example 2¶
[5]:
from Levenshtein import distance
def my_filter(partitions):
return distance("boo", partitions["value"]) <= 1
wr.s3.read_parquet(path, dataset=True, partition_filter=my_filter)
[5]:
id | value | |
---|---|---|
0 | 2 | boo |
1 | 1 | foo |
Creating the Dataset (CSV)¶
[6]:
df = pd.DataFrame({
"id": [1, 2, 3],
"value": ["foo", "boo", "bar"],
})
wr.s3.to_csv(
df=df,
path=path,
dataset=True,
mode="overwrite",
partition_cols=["value"],
compression="gzip",
index=False
)
wr.s3.read_csv(path, dataset=True)
[6]:
id | value | |
---|---|---|
0 | 3 | bar |
1 | 2 | boo |
2 | 1 | foo |
CSV Example 1¶
[7]:
my_filter = lambda x: x["value"].endswith("oo")
wr.s3.read_csv(path, dataset=True, partition_filter=my_filter)
[7]:
id | value | |
---|---|---|
0 | 2 | boo |
1 | 1 | foo |
CSV Example 2¶
[8]:
from Levenshtein import distance
def my_filter(partitions):
return distance("boo", partitions["value"]) <= 1
wr.s3.read_csv(path, dataset=True, partition_filter=my_filter)
[8]:
id | value | |
---|---|---|
0 | 2 | boo |
1 | 1 | foo |