pdf-processing-1 example updated #998

sujee · 2025-01-29T19:41:45Z

Why are these changes needed?

Update examples/notebook/intro example with

using newer / simpler APIs
using v1.0.0
updated workflow (pdf --> pq --> docid --> ededupe --> fuzzy dedupe --> doc quality)
renamed the example as pdf-processing-1

Related issue number (if any).

- upgraded to simpler API - redid the processing flow - expanded the examples to include doc-quality plugin Signed-off-by: Sujee Maniyam <[email protected]>

- updated the diagram - simplified notebook Signed-off-by: Sujee Maniyam <[email protected]>

Signed-off-by: Sujee Maniyam <[email protected]>

- Updated RAY version to newer/simpler API - Added a troubleshooting section to README - Misc cleanups Signed-off-by: Sujee Maniyam <[email protected]>

sujee · 2025-01-29T19:49:49Z

BTW, this will be a 2 stage merge.
Once this is merged, I will update the URLs to reflect the main repo so it works on Google colab ..etc

shahrokhDaijavad · 2025-01-31T22:58:17Z

@sujee An update: I have tested both Python and Ray notebooks on the local machine successfully. I have also tested the Python notebook on Google Colab successfully, but the ray version of pdf2parquet fails on Google Colab with the usual message of:

ERROR - Exception during execution out of 2 created actors only 0 alive: None

Don't know if you got the Ray version to work on Colab or not.

One thing that we did this week, and I would like you to do for this example, is to move all the files in your input folder to a new place where we want to keep all the input data files for all examples. The new folder is here:
https://github.com/IBM/data-prep-kit/tree/dev/examples/data-files/ and you should create a folder named pdf-processing-1 in this directory and bring the input data files you need for this example (and delete the other input folder). Does it make sense?

sujee added 5 commits January 21, 2025 00:22

examples/notebooks/pdf-processing-1

afffaa2

- upgraded to simpler API - redid the processing flow - expanded the examples to include doc-quality plugin Signed-off-by: Sujee Maniyam <[email protected]>

updated example

1af9b09

- updated the diagram - simplified notebook Signed-off-by: Sujee Maniyam <[email protected]>

updated to run on Google colab

b1f6701

Signed-off-by: Sujee Maniyam <[email protected]>

Updated pdf-processing-example

9a7c830

- Updated RAY version to newer/simpler API - Added a troubleshooting section to README - Misc cleanups Signed-off-by: Sujee Maniyam <[email protected]>

Merge branch 'dev_upstream' into process-pdf-1

e2ff148

shahrokhDaijavad self-requested a review January 30, 2025 01:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf-processing-1 example updated #998

pdf-processing-1 example updated #998

sujee commented Jan 29, 2025

sujee commented Jan 29, 2025

shahrokhDaijavad commented Jan 31, 2025

pdf-processing-1 example updated #998

Are you sure you want to change the base?

pdf-processing-1 example updated #998

Conversation

sujee commented Jan 29, 2025

Why are these changes needed?

Related issue number (if any).

sujee commented Jan 29, 2025

shahrokhDaijavad commented Jan 31, 2025