Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdf-processing-1 example updated #998

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from
Open

Conversation

sujee
Copy link
Contributor

@sujee sujee commented Jan 29, 2025

Why are these changes needed?

Update examples/notebook/intro example with

  • using newer / simpler APIs
  • using v1.0.0
  • updated workflow (pdf --> pq --> docid --> ededupe --> fuzzy dedupe --> doc quality)
  • renamed the example as pdf-processing-1

Related issue number (if any).

#997

sujee added 5 commits January 21, 2025 00:22
- upgraded to simpler API
- redid the processing flow
- expanded the examples to include doc-quality plugin

Signed-off-by: Sujee Maniyam <[email protected]>
- updated the diagram
- simplified notebook

Signed-off-by: Sujee Maniyam <[email protected]>
Signed-off-by: Sujee Maniyam <[email protected]>
- Updated RAY version to newer/simpler API
- Added a troubleshooting section to README
- Misc cleanups

Signed-off-by: Sujee Maniyam <[email protected]>
@sujee
Copy link
Contributor Author

sujee commented Jan 29, 2025

BTW, this will be a 2 stage merge.
Once this is merged, I will update the URLs to reflect the main repo so it works on Google colab ..etc

@shahrokhDaijavad shahrokhDaijavad self-requested a review January 30, 2025 01:35
@shahrokhDaijavad
Copy link
Member

@sujee An update: I have tested both Python and Ray notebooks on the local machine successfully. I have also tested the Python notebook on Google Colab successfully, but the ray version of pdf2parquet fails on Google Colab with the usual message of:

ERROR - Exception during execution out of 2 created actors only 0 alive: None

Don't know if you got the Ray version to work on Colab or not.

One thing that we did this week, and I would like you to do for this example, is to move all the files in your input folder to a new place where we want to keep all the input data files for all examples. The new folder is here:
https://github.com/IBM/data-prep-kit/tree/dev/examples/data-files/ and you should create a folder named pdf-processing-1 in this directory and bring the input data files you need for this example (and delete the other input folder). Does it make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants