Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawlers should support parsing PDF files #16

Open
ekryski opened this issue Mar 18, 2014 · 3 comments
Open

Crawlers should support parsing PDF files #16

ekryski opened this issue Mar 18, 2014 · 3 comments
Labels
Milestone

Comments

@ekryski
Copy link
Contributor

ekryski commented Mar 18, 2014

We could just use the ruby parser that has already been written and try to integrate that into roach as a job.

@ekryski ekryski added this to the Phase 2 milestone Mar 18, 2014
@ADunfield
Copy link

When you get to this you should probably have a deep look at the Ruby PDF parser we have. I feel like that guy has worked out some serious magic.

@bredele
Copy link

bredele commented Apr 6, 2014

@ADunfield I will @ekryski good point, we could have a ruby version of a job in order to use what's already been done.

@bredele bredele modified the milestones: Phase 1, Phase 2 Apr 7, 2014
@ekryski ekryski modified the milestones: Phase 2, Phase 1 Apr 11, 2014
@ekryski
Copy link
Contributor Author

ekryski commented Apr 11, 2014

We have 2 ruby scripts for PDFs:

  1. To crawl through the IAR site and grab the PDF's, which get sent to our FTP server
  2. To parse the PDF's and turn them into JSON

What I think we can do is have 2 jobs:

  • one that triggers/schedules the 1st script to fetch the PDF's
  • another one that watches the directory on the ftp server and runs the ruby parsing script when the directory changes. It would then get the JSON output somehow and push that through the normal data processing pipeline that roach typically uses (ie. crawler -> redis -> rabbitMQ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants