Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pub speed #21

Open
wants to merge 16 commits into
base: develop
Choose a base branch
from
Open

Pub speed #21

wants to merge 16 commits into from

Conversation

cameronneylon
Copy link
Collaborator

I've done a couple of things here to enable me to use the codebase for a different project. Alongside this I've introduced a couple of fixes for problems I've found.

  1. There are a bunch of new files in the examples folder, these aren't really necessary for anything, its just where I've been working but they might be useful examples.
  2. models.py - New elements added to the Article model to enable tracking of submit and accept date based on pubmed information
  3. pubspeed-excel report added for the external project
  4. An openaire scraper and relevant tests (this was earlier work but should probably be incorporated)
  5. scrapers/pubmed.py Various additions (see below)

In scrapers/pubmed.py I made a series of changes.

First of these is that pubmed contains invalid dates. I wrapped date parsing in a try: except: clause and set it to unknown if it fails. Not sure whether this is the best approach but it worked for what I needed. Line ~184

I added end_date as a parameter for fetch_batch, include the end_date where it is called and added code to incorporate it in the parameters. I don't know whether this matters or not but I was trying to solve a problem where date searching pubmed seems to be broken. Lines 160-200.

Modified Article.create() to use Article.create_or_update_by_doi() as the period batching was duplicating article creation (because of the date search issue giving the same article in multiple periods). If there is no DOI it defaults to the old approach.

Need to add final tests to ensure we actually get the right answer and then would probably be a good idea to re-factor the test cases for a cleaner approach and more readable code.
Could still do with cleaning up the test cases and iterating over them in a cleaner fashion but this works and is giving the right answers.
Modified models.Article to include submit and accept dates and scrapers.pubmed to obtain those dates from Pubmed XML. Created a
modified version of the excel report to dump out results and sh and yaml files in examples/pubspeed for testing purposes.
Running the loading and report generation from a python script for convenience. Some modifications still needed to the processing
step and somewhat unclear how the Pubmed searching is currently working.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant