DataScienceExamples

Examples of various data science & data analysis topics using various sources of data.

Most of this repository will likely be in the form of Jupyter notebooks. I hope to include links to the data for all of the notebooks so that the notebooks can be used by others.

Some of the tools that I'm using so far in this repository:

Pandas
Numpy
Scipy
Scikit-Learn
Matplotlib
XGBoost
NLTK
Gensim
PySpark

I also hope to add in examples using other tools such as:

Keras
TensorFlow/Theano
Seaborn

Current notebooks:

Basics:

Analyzing the UCI ML breast cancer data. Some topics covered: Logistic regression, PCA, RFE, L1 regularization, learning curves, validation curves, imbalanced data, train/test splitting, pipelines, ROC curves, Precision-Recall curves.
Analyzing the UCI ML adult/census income data. Some topics covered: Label encoding, grid search and randomized search CV, decision trees, random forests, AdaBoost, XGBoost.
Analyzing some out-of-copyright text in order to predict the author. Some things covered here: Basic text preparation, stop words, stemming/lemmatization, LSA, LDA, random forests, Naive Bayes classification, NLTK, GenSim.
Basic introduction to PySpark. Some basic text manipulations using the text of Moby Dick. Topics covered: map(), flatMap(), filter(), reduce(), reduceByKey(), sortBy(), sortByKey(), SparkContext, reading text into an RDD.
Movie recommendations with Spark. Topics covered: Alternating least squares, DataFrames, cross validation, model evaluation, parameter grid search.
Analyzing racial demographics and neighborhood income in New York City. Topics covered: Spark SQL, reading from a database, Linear regression and K-Means in Spark.

NYC Tree Census Data:

Looking at how species of trees are spread throughout the city. Some topics covered: Linear regression, visualization, joining different data sets, feature engineering
Looking at the distributions of trees in different neighborhoods. Some topics covered: Linear regression, nonlinear regression, k-fold cross validation, error estimation, minimization/optimization.
Two analyses: (1) Predicting borough based on tree data. (2) Predicting income based on neighborhood trees. Some topics covered: Naive Bayes classifiers, count vectorization, decision tree regression, SVD.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
basics		basics
nyc_tree_census		nyc_tree_census
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataScienceExamples

About

Releases

Packages

Languages

lopez86/DataScienceExamples

Folders and files

Latest commit

History

Repository files navigation

DataScienceExamples

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages