Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration of harvesting #36

Open
ghost opened this issue Jun 25, 2020 · 3 comments
Open

Configuration of harvesting #36

ghost opened this issue Jun 25, 2020 · 3 comments
Assignees

Comments

@ghost
Copy link

ghost commented Jun 25, 2020

Optimisation and Configuration
The process of harvesting itself needs to be parameterised so that users can split up the harvesting work as they see fit and assign additional workers to it as they see fit - in addition to any other system settings that might help speed the process of harvesting for particular use-cases. Scheduling should also be configurable.

Can we explore what use cases your looking to meet here, so we can plan this?

@ghost ghost assigned nickevansuk and thill-odi Jun 25, 2020
@robredpath
Copy link
Collaborator

@robredpath
Copy link
Collaborator

Hi @thill-odi @nickevansuk !

I'm looking at the work for our second sprint on the OA Harvester Extension, and this is an outstanding question.

We've got a bit of work to do around making the harvesting part of the system customisable. @odscjames quoted from one of the early spec documents, but that has left quite a lot undefined.

Can you tell us a bit about:

  • who you expect to be using the harvester code directly
  • what we know about how they might want to slice up the harvesting? Would it be per-publisher? Per-feed? Based on some sort of query/filter that they might apply at the filtering stage and/or from the status service?
  • Any constraints around speed that you might know about?

@robredpath
Copy link
Collaborator

I spoke with @thill-odi a couple of weeks ago.

The work to be done here in this round of development is around parallelisation - threads, workers, etc.

The anticipated use case is a developer setting up their own application that consumes OA data, and wanting to configure the process to match the resources available to them. We anticipate that they'll already broadly know what they want to do, having (hopefully) used the API and other developer resources to try out their idea.

The expectation is that they'll be competent enough to do any sort of source selection themselves, so no filtering/per-source stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants