Python Miner: Big Data Publications

This repository contains the scripts that implement part of the methods described in the publications: "". The scripts handle data fetching, preparation, and visualisation. Classification is implemented in R, found in the R-contrast-pub repository. The scripts handle the following research steps:

Get initial corpus

Searching PubMed and PubMed Central for articles with a specific query (esearch). Article data is fetched (efetch) and stored into a SQLite database. After the fetch, unwanted articles are removed and the remaining are cleaned.

Get matching corpus

Searching for matching PubMed and PubMed Central articles based on journal and publication date range (esearch) and fetching results (efetch). Article data is fetched (efetch) and stored into a SQLite database. After the fetch, unwanted articles are removed and the remaining are cleaned.

Remove articles

Unwanted articles are removed from the database by the following criteria:

They have an empty abstract;
Their doctype is defined in the EXCLUDED_DOCTYPES variable in the config;
Their journal ISSN is defined in the EXCLUDED_JOURNALS variable in the config;
They are a double, based on their title (with all symbols removed, regex: [^a-z]);
They are a double, based on their DOI.

Cleaning articles

Articles in the database are cleaned by performing the following steps:

Special characters are removed from article titles and abstracts (script)
Tokenizing the titles and abstracts
Removing stopwords from the tokenized titles and abstracts (script)
(Optional) Stemming the tokenized titles and abstracts
Removing very small and very big tokens (unlikely real words, script)

Preparing datasets

The initial and matching corpora are retrieved. A predetermined number of datasets is created by taking the complete initial corpus and matching a random set from the matching corpus. The dataset is then vectorized and turned into a feature matrix. Lastly, the matrix and original dataset are stored as an pickle object.

Other

The following scripts were used for various tasks to perform the research, for example: analyse datasets, gather metadata, create figures.

baseline_data.py fetches some baseline metadata about the initial and matching corpora such as word counts and document counts.
word_distribution.py fetches word distribution metadata over the documents in the initial and matching corpora.
doc_word_freqs.py and docs_per_year.py create figures using Matplotlib for respectively the word to document frequency and the number of documents per (publication) year.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Classifier		Classifier
Database		Database
Datasets		Datasets
Initial_search		Initial_search
Match_search		Match_search
Metadata		Metadata
Plotting_scripts		Plotting_scripts
Preprocessing		Preprocessing
Text_data		Text_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
baseline_data.py		baseline_data.py
clean_articles.py		clean_articles.py
config.py		config.py
main_initial_corpus.py		main_initial_corpus.py
main_match_corpus.py		main_match_corpus.py
prepare_datasets.py		prepare_datasets.py
requirements.txt		requirements.txt
word_distribution.py		word_distribution.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Miner: Big Data Publications

Get initial corpus

Get matching corpus

Remove articles

Cleaning articles

Preparing datasets

Other

About

Releases

Packages

Languages

License

AMCeScience/python-miner-pub

Folders and files

Latest commit

History

Repository files navigation

Python Miner: Big Data Publications

Get initial corpus

Get matching corpus

Remove articles

Cleaning articles

Preparing datasets

Other

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages