Development repo of end-to-end Data Science project for extracting insights from Newspaper articles via Text Mining toolkit
- LinearSVC Text Classifier with 82% accuracy for predicting 6 labels (from Politics to Sports) available via containerized API
- XGBoost Classifier with 92% accuracy for predicting unsupervised generated labels of also 6 different subjects
The main goal of this project is to build a production ready Text Classifier wrapped inside an API in order to showcase how I personally approach Data Science problems specifically when it comes to Text Mining.
Therefore I usually follow the steps below, which are explained as follows:
Data Extraction
: extract newspaper articles using avaible API's (notebooks)Data Processing
: clean the data before diving in (notebooks)Exploratory Data Analysis (EDA)
: draw insights via Sentiment Analysis (notebooks)Modeling
: build a supervised Text Classifier and clusterize articles using GloVe Word2Vec (notebooks)Productization
: deploy Text Classifier via API end-point (api)
All technologies involved in this project are described below more or less in the order of the notebooks.
Feature Extraction
: Tfidf Vectorizer and RegExSupervised Learning
: LinearSVC, Logistic Regression, Multinomial Naive Bayes, Random Forest and XGBoost ClassifiersClustering
: DBSCAN and K-MeansTransferred Learning
: GloVe Word2VecDimensionality Reduction
: t-SNEStatistical Testing
: Chi Squared for words correlations to labelsModel Selection
: Train Test Split and K-Fold Cross ValidationAPI
: Data Extraction from NYT, Sentiment Analysis with TextBlob, Productization / Deployment with FastAPI
Main libraries and frameworks: scikit-learn, xgboost, textblob, fastapi, pandas, numpy, seaborn, matplotlib, docker, docker-compose
python >= 3.6
conda
: Miniconda Python 3.
- Create
conda
environment based onyml
file.
conda env create -f environment.yml
- Activate
conda
environment.
conda activate ds-env
- Run notebooks as you wish
├── README.md <- The top-level README for scientists and engineers.
│
├── api <- API folder with containerized Text Classifier
│
├── data <- Data folder (versioned in the cloud, not with git)
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── models <- Trained and serialized models (versioned in the cloud, not with git)
│
├── notebooks <- Notebooks folder
│
├── credentials <- Required credentials stored locally (ignored on repo for security issues)
│
├── reports <- Code-free, stakeholders-ready reports such as markdown files
│ ├── figures <- Graphics and figures to be used in reporting
│ └── data <- Output data generated by models or analyses
│
├── environment.yml <- The conda env file for reproducing the environment
│
├── setup.py <- makes 'src' installable so it can be imported
│
└── src <- Reusable Python code
├── __init__.py <- Makes src a Python module
└── ... <- Further modules as work in progress