Crawlers should support parsing PDF files #16

ekryski · 2014-03-18T22:25:25Z

We could just use the ruby parser that has already been written and try to integrate that into roach as a job.

ADunfield · 2014-04-06T20:00:26Z

When you get to this you should probably have a deep look at the Ruby PDF parser we have. I feel like that guy has worked out some serious magic.

bredele · 2014-04-06T20:10:43Z

@ADunfield I will @ekryski good point, we could have a ruby version of a job in order to use what's already been done.

ekryski · 2014-04-11T17:24:42Z

We have 2 ruby scripts for PDFs:

To crawl through the IAR site and grab the PDF's, which get sent to our FTP server
To parse the PDF's and turn them into JSON

What I think we can do is have 2 jobs:

one that triggers/schedules the 1st script to fetch the PDF's
another one that watches the directory on the ftp server and runs the ruby parsing script when the directory changes. It would then get the JSON output somehow and push that through the normal data processing pipeline that roach typically uses (ie. crawler -> redis -> rabbitMQ).

ekryski added this to the Phase 2 milestone Mar 18, 2014

ekryski added the feature label Mar 18, 2014

bredele modified the milestones: Phase 1, Phase 2 Apr 7, 2014

ekryski modified the milestones: Phase 2, Phase 1 Apr 11, 2014

Provide feedback