Skip to content

AMCeScience/python-miner-pub

Repository files navigation

Python Miner: Big Data Publications

This repository contains the scripts that implement part of the methods described in the publications: "". The scripts handle data fetching, preparation, and visualisation. Classification is implemented in R, found in the R-contrast-pub repository. The scripts handle the following research steps:

Searching PubMed and PubMed Central for articles with a specific query (esearch). Article data is fetched (efetch) and stored into a SQLite database. After the fetch, unwanted articles are removed and the remaining are cleaned.

Searching for matching PubMed and PubMed Central articles based on journal and publication date range (esearch) and fetching results (efetch). Article data is fetched (efetch) and stored into a SQLite database. After the fetch, unwanted articles are removed and the remaining are cleaned.

Unwanted articles are removed from the database by the following criteria:

  1. They have an empty abstract;
  2. Their doctype is defined in the EXCLUDED_DOCTYPES variable in the config;
  3. Their journal ISSN is defined in the EXCLUDED_JOURNALS variable in the config;
  4. They are a double, based on their title (with all symbols removed, regex: [^a-z]);
  5. They are a double, based on their DOI.

Articles in the database are cleaned by performing the following steps:

  1. Special characters are removed from article titles and abstracts (script)
  2. Tokenizing the titles and abstracts
  3. Removing stopwords from the tokenized titles and abstracts (script)
  4. (Optional) Stemming the tokenized titles and abstracts
  5. Removing very small and very big tokens (unlikely real words, script)

The initial and matching corpora are retrieved. A predetermined number of datasets is created by taking the complete initial corpus and matching a random set from the matching corpus. The dataset is then vectorized and turned into a feature matrix. Lastly, the matrix and original dataset are stored as an pickle object.

Other

The following scripts were used for various tasks to perform the research, for example: analyse datasets, gather metadata, create figures.

  1. baseline_data.py fetches some baseline metadata about the initial and matching corpora such as word counts and document counts.
  2. word_distribution.py fetches word distribution metadata over the documents in the initial and matching corpora.
  3. doc_word_freqs.py and docs_per_year.py create figures using Matplotlib for respectively the word to document frequency and the number of documents per (publication) year.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages