Executive Summary

This data science project aimed to create the highest performing model that can correctly classify individual posts to either Showerthoughts and CasualConversation. Leveraging machine learning algorithms on historical subreddit data from both Showerthoughts and CasualConversation communities, this project produced four models that the social media startup can use and modify to build their personality app that classifies people to either happy or sad based on text data. As the best performing model out of the four, the Logistic Regression model achieved an 79 percent accuracy rate in distinguishing between Showerthoughts and CasualConversation, saving the company time and resources in creating the backbone of its personality app.

Data Collection and Description

PRAW method for collecting subreddit data
Data Dictionary

Variable	Data Type	Value Count	Description
title	object	2365	title of individual post in subreddit
selftext	object	1370	text within individual post in subreddit
subreddit	int64	2365	1 - Showerthoughts and 0 - CasualConversation
created_utc	float64	2365	unix timestamp
name	object	2365	username of individual in Reddit

Data Visualizations

Model Preprocessing Procedure

Supervised Machine Learning
Natural Language Processing

***cvec stands for CountVectorizer ***tvec stands for TfidVectorizer

Target: subreddit
Feature: title
CountVectorizer and TfidVectorizer Transformers
Models: Logistic Regression, Bernoulli, Multinomial, Random Forest
train_test_split: test_size=0.25, random_state=42, stratify=y
'cvec__max_features' : [1500, 3000, 5000],
'cvec__min_df' : [0.03, 0.05],
'cvec__max_df': [0.85, 0.9],
'cvec__ngram_range' : [(1, 1), (1, 2)]
'tvec__max_features' : [1500, 3000, 5000],
'tvec__min_df' : [0.03, 0.05],
'tvec__max_df': [0.85, 0.9],
'tvec__ngram_range' : [(1, 1), (1, 2)]
Set GridSearchCV
Set cross validation to 5

***second iteration only

custom_stop_words = ["people", "like", "life", "one", "get", "never", "anyone", "day", "make", "else"]

Models' ROC-AUC and Test Accuracy

Summary of Test and ROC Scores for Each Model for second iteration:

***CV stands for CountVectorizer ***TV stands for TfidVectorizer

Model	CV Test Score	CV ROC Score	TV Test Score	TV ROC Score
Logistic Regression	0.785472972972973	0.785453216374269	0.7820945945945946	0.7811586257309941
Bernoulli	0.7432432432432432	0.744517543859649	0.7432432432432432	0.744517543859649
Multinomial	0.6942567567567568	0.6911549707602339	0.6976351351351351	0.6937134502923976
Random Forest	0.7820945945945946	0.7813413742690059	0.7736486486486487	0.7723866959064327

Summary of Test and ROC Scores for each Model for first iteration:

Model	CV Test Score	CV ROC Score	TV Test Score	TV ROC Score
Logistic Regression	0.7905405405405406	0.7902046783625731	0.7820945945945946	0.7808845029239766
Bernoulli	0.7533783783783784	0.7546600877192983	0.7533783783783784	0.7546600877192983
Multinomial	0.7010135135135135	0.6977339181286549	0.6959459459459459	0.6921600877192982
Random Forest	0.785472972972973	0.7840826023391813	0.7804054054054054	0.7792397660818713

Recommended Model: Benefit and Drawback

Based on the highest ROC-AUC and test performances from the two iterations, we recommend the Logistic Regression model with the CountVectorizer transformer from the first iteration for social media startup's team to replicate and modify for their personality app, since the model was able to successfully distinguish between Showerthoughts and CasualConversation 79 percent of the time while the model was also able to correctly predict which new post will go to which subreddit 79 percent of the time as well.

However, one drawback from this recommended model is that we observe case of overfitting since its training score was higher than test score. Overfitting in this context may mean that the model is overly confident in its predictions for this majority class, which is Showerthoughts, potentially making predictions that are overly specific to the training data and may not generalize well to new posts.

Next Steps

Here are following next steps to continue to improve the recommended model and investigate other findings from the four models:

experiment with Logistic Regression's hyperparameters such as adjusting C to reduce model's overfitting
experiment with other models' respective hyperparameters such as adjusting alpha and minimum sample leafs to reduce overfitting
investigate why overall, the test and ROC scores are generally similiar to one another
investigate why the Bernoulli model is the only one that was better at identifying CasualConversation than Showerthoughts
investigate why Logistic Regression's test scores with TfidVectorizer from both iterations remain the same
collect more unique subreddit posts to increase the diversity of words that our models can train on

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.ipynb_checkpoints		.ipynb_checkpoints
datasets		datasets
images		images
README.md		README.md
Social Media Startup's Personality Quiz App.pdf		Social Media Startup's Personality Quiz App.pdf
dk_eda_model1.ipynb		dk_eda_model1.ipynb
dk_model2.ipynb		dk_model2.ipynb
dk_praw_clean.ipynb		dk_praw_clean.ipynb
praw.ini		praw.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Executive Summary

Data Collection and Description

Model Preprocessing Procedure

Models' ROC-AUC and Test Accuracy

Recommended Model: Benefit and Drawback

Next Steps

About

Releases

Packages

Languages

dxk613/reddit_classification_nlp

Folders and files

Latest commit

History

Repository files navigation

Executive Summary

Data Collection and Description

Model Preprocessing Procedure

Models' ROC-AUC and Test Accuracy

Recommended Model: Benefit and Drawback

Next Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages