Skip to content
This repository has been archived by the owner on Sep 11, 2019. It is now read-only.

Storage layer #6

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Storage layer #6

wants to merge 9 commits into from

Conversation

stefanw
Copy link
Collaborator

@stefanw stefanw commented Mar 2, 2015

This PR contains a few things:

  • some convenience fixes/features (also present as separate PR Minor features #5, will rebase this branch when merged)
  • a sqlite database for tasks and results
  • a way of storing scrape tasks in the db and running them
  • a way of marking tasks as done and storing results of tasks

This is how I run scrapers:

scraper = Scraper('scraper_name', {'threads': 8})
scraper.log.info('Store Tasks')
scraper.store_tasks('scrape_detail', get_tasks()) # get_tasks is a generator for args, kwargs
scraper.log.info('Process tasks')
scraper.process_tasks() # Runs all stored tasks that are not done

Scrapers can store results like this:

# Do some scraping
self.write_result(data_dict, ['unique_field'])

What this still needs:

  • A way to chunk/stream tasks from the db to the threads without locking the db, currently requires all tasks to be in memory before processing them
  • More expressive marking of task state in database (failed)
  • Clearing tasks from the database
  • Documentation
  • Make pipelines work with this flow
  • Clean up, better API
  • Better namespacing on scraper object?

Nice to have:

  • Some kind of auto-cli for the scraper which accepts configuration and commands to run scrapers, clear tasks
  • A way to configure the database?
  • Reporting from the database to nice HTML?

@webmaven
Copy link

webmaven commented Jul 3, 2016

@pudo, should this PR be rebased in order to merge it?

@pudo
Copy link
Collaborator

pudo commented Jul 3, 2016

@webmaven hi! This repo is unfortunately no longer actively maintained, I just don't have the time to do it and the value of scrapekit wasn't as great as I'd hoped for. If you want to become the new maintainer, please feel free to fork it and let me know if you're ready to do a release, then we can transfer the PyPi registration.

Hope you understand :)

@webmaven
Copy link

webmaven commented Jul 3, 2016

I understand. As far as your offer to become the new maintainer, let me think about it. If I don't get back to you, feel free to ping me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants