Welcome to the repository dedicated to the exploration of current trends in science using machine learning and unsupervised learning techniques. In this project, we aim to provide a comprehensive quantitative overview of current topics in science based on a large archive of scientific papers. The goal is to identify homogeneous clusters and present reduced-dimensional views that capture the main traits of the dataset.
As data scientists, our objective is to investigate current topics in science and determine areas where advanced academic cooperation could be beneficial. We have access to a large archive of scientific papers, and the task involves preprocessing the data, exploring descriptive statistics, and applying unsupervised learning techniques to reveal trends.
- Explore the available data, conduct descriptive statistics, and explorative visualizations.
- Design a plan for data preprocessing and feature engineering.
- Apply unsupervised learning techniques to identify homogeneous clusters.
- Provide reduced-dimensional views capturing the main traits of the dataset.
- Critically assess decisions made during each step and iterate as necessary.
- Start by exploring the available data with descriptive statistics and explorative visualization.
- Design a plan for data preprocessing and feature engineering.
- Keep the concepts and techniques learned during the course in mind.
For this use case, we have been given access to an archive of recently published scientific works. The dataset is available from the following webpage: arXiv Dataset
Exploring the topics in the title and abstract.
Cluster Scores using different metrics
Scatter plot of the top 2 PCA componenets with the found ideal cluster size.
Articles are grouped by month, and then the cluster label that belongs to this month is identified to discover changes in the themes.
The word clouds for every cluster are created by finding the most frequent category terms, which are looked up for the respective names.
- Clone the repo.
- Download the arxiv-metadata-oai-snapshot.json (3.97 GB) from arXiv Dataset and place it in the arXiv folder.
- Create a virtual environment using
python -m venv venv
and activate it usingsource venv/bin/activate
. - Install the Python modules using
pip install -r requirements.txt
. - Run all cells in each notebook (1_eda, 2_preprocessing, 3_tfidf_scores, 4_classification).
The notebooks using Dask Distributed to work with the complete dataset require at least 50GB of RAM. This was done with AMD® Ryzen 7 5800h with Radeon graphics × 16, 32GB RAM + 64GB SWAP Memory.
The following notebooks will use Dask Distributed to work with the complete dataset and require at least 50GB of RAM. This was done with AMD® Ryzen 7 5800h with Radeon graphics × 16, 32GB RAM + 64GB SWAP Memory.