scrap-science

A bunch of Jupyter notebooks to scrap some of the most popular web platforms for scientific papers.

Setup

python environment
chrome driver
pdf2txt

arxiv

get all search result urls and their corresponding number of search results
dump these into scrap_arvix.ipynb and get a clean duplicate free list of urls of individual papers
get all pdfs
scarp data
manula quality check

bioRxiv

get all search result urls and their corresponding number of search results
dump these into scrap_biorxiv.ipynb and get a clean duplicate free list of urls of individual papers
now scarp - this step includes getting the pdf too.
manual quality check

Pubmed

search pubmed and download results as .csv into raw_result folder
use scrap_pubmed.ipynb to combine all csv's, remove duplicates and finally scrap it (no pdfs)
save as .csv and do manual search quality check

MICCAI

get content pdfs from springer

2014 and 2015 - get urls manually
2016 and 2017 - pdfs have urls in them

run getMiccaiUrls.py to get the urls in the pdf and dump them in a list as a .npy file
read these in scrap_miccai.ipynb and add to them those hardcoded from 2014 and 2015
run scrap_miccai.ipynb (no pdfs)

IEEE

go to http://ieeexplore.ieee.org/Xplore/home.jsp
enter keywords and download .csv. Link to search will be in the first row.
combine and clean the multiple downloaded .csv's using combine_Ieee.ipynb. This produces a single ieee.csv without duplicates.
run scrap_ieee.ipynb. We first get as much pdf as we can, then we loop through the csv and convert pdf2txt to get emails.
manual cleanup