Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace test_data_quality_at_scale.ipynb #208

Closed
wants to merge 3 commits into from

Conversation

komashk
Copy link
Contributor

@komashk komashk commented Jun 17, 2024

Updated the dataset (amazon products reviews replaced with a synthetic data), added a couple of new examples

issue #207 issue #209

Description of changes:

Two updates:

  1. For test_data_quality_at_scale.ipynb: Updated the tutorial accompanying the blog post "Testing data quality at scale with PyDeequ". The blog has been recently updated and published.
  2. For the other ipynb tutorials (analyzers, profiles, repository, suggestions, verifications) updated S3 links, declaration of SPARK version before loading the library.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Updated the dataset (amazon products reviews replaced with a synthetic data), added a couple of new examples
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Replaced Amazon Reviews with a synthetically generated reviews dataset; added declaration of the SPARK version
@komashk komashk marked this pull request as draft June 17, 2024 21:15
updated tutorials to use a new dataset
@komashk komashk marked this pull request as ready for review June 17, 2024 21:23
@@ -6,14 +6,115 @@
"source": [
Copy link
Contributor

@chenliu0831 chenliu0831 Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason the same code won't work with pydeequ 1.4.0/Spark 3.5. I think it might be fine to insert something like "Tested on pydeequ 1.2.0/Spark 3.3. Code should run on all supported pydeequ versions".


Reply via ReviewNB

@@ -6,14 +6,115 @@
"source": [
Copy link
Contributor

@chenliu0831 chenliu0831 Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #2.    os.environ["SPARK_VERSION"] = '3.3'

Maybe add ad comment that mention setting this to 3.5 if one use pydeequ 1.4.0


Reply via ReviewNB

@chenliu0831
Copy link
Contributor

Minor comment - I can approve when addressed.

@komashk
Copy link
Contributor Author

komashk commented Aug 16, 2024

Created a new pull request #230 to address the comments above.

@komashk komashk closed this Aug 16, 2024
komashk added a commit to komashk/python-deequ that referenced this pull request Aug 21, 2024
chenliu0831 pushed a commit that referenced this pull request Aug 21, 2024
chenliu0831 pushed a commit that referenced this pull request Sep 6, 2024
…rials new data (#233)

* updated the notebooks to use a new synthetic data for demonstration; addressed PR comments #208 and #230

* added notebooks and python module that outline generation of the synthetic data used for 2 AWS blogs on PyDeequ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants