Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding pushdown filter predicate #174

Open
ankit11519 opened this issue Mar 14, 2023 · 3 comments
Open

Regarding pushdown filter predicate #174

ankit11519 opened this issue Mar 14, 2023 · 3 comments

Comments

@ankit11519
Copy link

ankit11519 commented Mar 14, 2023

We are using apache spark to connect to Dynamodb using emr-dynamodb-connector.
Now, below are my code statements in pyspark:--

dynamoDf = spark.read.option('region', 'REGION')
.option("tableName", "TABLE_NAME")
.format("dynamodb")
.load()
dynamoDfFilter = dynamoDf.filter((F.col("colFilter").startswith('ABC')) | (F.col("colFilter").startswith('XYZ')))
print(dynamoDfFilter.count())

So, wanted to know if these filter conditions will also push down to the DynamoDB for server side filtering or will they be applied after full scan operation data being loaded into "dynamoDf"?

@kevnzhao
Copy link
Contributor

Which connector are you using?
Predicate Pushdown is only available for Hive table in EMR DDB Hive Connector package. You can find the supported data types and operators at HERE.

@mimaomao feel free to comment if I miss anything.

@ankit11519
Copy link
Author

We are using this connector:--
[Accessing data in Amazon DynamoDB with Apache Spark]

So, we are accessing amazon DynamoDB with apache spark for scan operation. My question is if query is like this :--

dynamoDfFilter = dynamoDf.filter((F.col("colFilter").startswith('ABC')) | (F.col("colFilter").startswith('XYZ')))

print(dynamoDfFilter.count())

When spark 'filter' is being applied in scan query, whether query which is being sent to dynamoDB has 'filterExpressions' associated with it for server side filtering Or it load entire data in scan operation in a dataFrame first and then apply the filter over that dataFrame.

@kevnzhao
Copy link
Contributor

Then you are using Hadoop Connector. Predicate push-down is not available yet. So all data is loaded from DynamoDB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants