layout | title | permalink | newlink |
---|---|---|---|
page |
Data Science |
/datasci/ |
/datasci/ |
This page showcases a portfolio of Data Science projects I've done.
We use satellite imagery, soil and climate data to map worldwide extent of cropland irrigation, and how it has changed from 2001 to 2015.
I worked on this even after graduation, and a revised version is under review for publication.
In collaboration with Dept of Environmental Studies, UC Berkeley.
Version 2 | Version 1 | Code (all public)
Which schools in New York need the most help, to improve diversity in SHSAT exam?
SHSAT is a competitive exam to enter specialized high schools in NYC. It is currently overwhelmingly White and Asian. We check which schools with Black and Latino populations hold potential but need help.
KNN, Random Forests, Logistic Regression, and Perceptron Neural Network.
Report (public)
Can we detect fake news using machine learning?
We try both classical techniques as well as recurrent neural networks. Within the classical realm, we use "AutoML", which searches for the optimal machine learning algorithm. Our best accuracy was 81%.
Report (public)
We use convolutional neural networks to train and recognize landmarks in Paris.
We use an NVIDIA GPU to train the network. We train with transfer learning (bottlenecks) as well as retraining pre-trained models.
MobileNet, Inception v3, AlexNet, VGG16.
The best performing model (MobileNet) had 95% accuracy.
Report Presentation (public)
Statistical field experiment to measure causal inference: we use a difference-in-differences, within-subject design. We recruited about 150 subjects on Amazon MTurk for this experiment. We measured focus via response time and correctness, with both "relaxing" and "busy-work" distractions. We found statistical significance (at 95% confidence level) for one of our claims.
Report (public, dataset available upon request)
The following projects are private in accordance with Berkeley's academic policy. However, I can make them available to non-students on individual basis upon request.
Will this user click on this ad?
Online advertising is big business. Ad companies make money only when a user clicks on an ad, so they try hard to show ads on which a user is likely to click.
On a large, anonymized dataset, we run logistic regression. We get 76% accuracy, AUC 0.58.
Report (private, available upon request)
Does government regulation improve Internet connectivity?
We explore whether regulation can lower costs, improve speed, or provide more people with the Internet.
Exploratory data analysis with R statistical language.
Report (private, available upon request)
We analyze whether factors like police density, chances of conviction, proportion of young men or minorities affect crime rates.
Linear regression with R.
Report (private, available upon request)
We use word counts to build a text classifier. We try with word sequences (n-grams) and word frequency scores (TF-IDF), and logistic regression. We could classify text into newsgroups correctly 77% of the time.
Natural language processing: CountVectorizer, TfidfVectorizer. Classifier: logistic regression.
Report (private, available upon request)
How accurately can we recognize digits written by hand? We use classical machine learning algorithms. We also check if blurring the pictures can improve accuracy.
k-nearest neighbor, naive Bayes, and Gaussian NB.
Report (private, available upon request)
Based on a variety of properties, we check if a mushroom is poisonous. We reduce the complexity of data and group mushrooms into poisonous and non-poisonous categories.
Principal component analysis (PCA), k-means clustering, Gaussian mixture models
Report (private, available upon request)
We run naive-Bayes at scale with old-school Hadoop Map-Reduce. Enron dataset.
Notebook (private, available upon request)
We use Spark and document-similarity metrics to detect synonyms on Google's n-gram corpus.
Notebook (private, available upon request)
Is it acidity? SO2? Density? Alcohol? Something else?
We use OLS, ridge and lasso regressions to determine features that predict wine quality.
Notebook (private, available upon request)
PageRank was the original Google search algorithm.
We implement PageRank graph search algorithm on a large Wikipedia dataset, with Spark.
Notebook (private, available upon request)
We implement a "bag-of-words" model, running on a neural network, to analyze sentiment on the Stanford Sentiment TreeBank. It cannot beat a simple naive-Bayes model.
Notebook (private, available upon request)
We implement n-gram language models. We first do smoothed models on scikit-learn. We then ramp this up with a recurrent neural network on TensorFlow.
Notebooks (private, available upon request)
We use the Viterbi algorithm to tag each word in a sentence with the part of speech in English grammar. We use a "hidden Markov model" (HMM).
Notebook (private, available upon request)
Statistical models say yes: given the temperature at the time of launch, it was likely to fail.
Logistic regression, binomial and binary. Likelihood-ratio tests (LRTs). Profile and Wald intervals. Bootstrap.
R Notebook (private, available upon request)
Sugar and sodium were statistically significant, and sweeter cereal boxes were likely sitting on the bottom shelves.
Multinomial logistic regression. Odds ratios.
R Notebook (private, available upon request)
We develop a statistical time-series model.
Seasonal auto-regressive moving average (SARIMA). Our best model was ARIMA(0,1,0)(1,1,0)4.
R Notebook (private, available upon request)
Speed limits, seat belts, blood alcohol limits: do these reduce fatalities? (Yes.)
Panel data models with pooled data, fixed effects and random effects.
R Notebook (private, available upon request)
Page under construction. More projects on the way!