Files

code
docs
faq
- ai-and-ml
- bagging-boosting-rf
- choosing-technique
- classifier-history
- classifier_categories
- clf-behavior-data
- closed-form-vs-gd
- datascience-ml
- decision-tree-binary
- decisiontree-error-vs-entropy
- diff-perceptron-adaline-neuralnet
- difference-deep-and-normal-learning
- dimensionality-reduction
- euclidean-distance
- evaluate-a-model
- issues-with-clustering
- large-num-features
- lda-vs-pca
- linear-gradient-derivative
- logistic-why-sigmoid
- logistic_regression_linear
- logisticregr-neuralnet
- median-vs-mean
- ml-curriculum
- ml-examples
- ml-solvable
- multiclass-metric
- naive-bayes-boundary
- naive-bayes-vartypes
- naive-naive-bayes
- neuralnet-error
- overfitting
- pca-scaling
- pearson-r-vs-linear-regr
- probablistic-logistic-regression
- regularized-logistic-regression-performance
- select_svm_kernels
- softmax
- softmax_regression
- svm_for_categorical_data
- tensorflow-vs-scikitlearn
- visual-backpropagation
- why-python
- README.md
- ai-and-ml.md
- avoid-overfitting.md
- bag-of-words-sparsity.md
- bagging-boosting-rf.md
- best-ml-algo.md
- choosing-technique.md
- classifier-categories.md
- classifier-history.md
- clf-behavior-data.md
- closed-form-vs-gd.md
- computing-the-f1-score.md
- copyright.md
- cost-vs-loss.md
- data-science-career.md
- datamining-overview.md
- datamining-vs-ml.md
- dataprep-vs-dataengin.md
- datascience-ml.md
- decision-tree-binary.md
- decision-tree-disadvantages.md
- decisiontree-error-vs-entropy.md
- deep-learning-resources.md
- deeplearn-vs-svm-randomforest.md
- deeplearning-criticism.md
- definition_data-science.md
- diff-perceptron-adaline-neuralnet.md
- difference-deep-and-normal-learning.md
- difference_classifier_model.md
- different.md
- dimensionality-reduction.md
- dropout.md
- euclidean-distance.md
- evaluate-a-model.md
- feature_sele_categories.md
- implementing-from-scratch.md
- inventing-deeplearning.md
- issues-with-clustering.md
- large-num-features.md
- lazy-knn.md
- lda-vs-pca.md
- linear-gradient-derivative.md
- logistic-analytical.md
- logistic-boosting.md
- logistic-why-sigmoid.md
- logistic_regression_linear.md
- logisticregr-neuralnet.md
- many-deeplearning-libs.md
- median-vs-mean.md
- mentor.md
- missing-data.md
- ml-curriculum.md
- ml-examples.md
- ml-origins.md
- ml-python-communities.md
- ml-solvable.md
- ml-to-a-programmer.md
- model-selection-in-datascience.md
- multiclass-metric.md
- naive-bayes-boundary.md
- naive-bayes-vartypes.md
- naive-bayes-vs-logistic-regression.md
- naive-naive-bayes.md
- neuralnet-error.md
- nnet-debugging-checklist.md
- num-support-vectors.md
- number-of-kfolds.md
- open-source.md
- overfitting.md
- parametric_vs_nonparametric.md
- pca-scaling.md
- pearson-r-vs-linear-regr.md
- prerequisites.md
- probablistic-logistic-regression.md
- py2py3.md
- r-in-datascience.md
- random-forest-perform-terribly.md
- regularized-logistic-regression-performance.md
- return_self_idiom.md
- scale-training-test.md
- select_svm_kernels.md
- semi-vs-supervised.md
- softmax.md
- softmax_regression.md
- standardize-param-reuse.md
- svm_for_categorical_data.md
- technologies.md
- tensorflow-vs-scikitlearn.md
- underscore-convention.md
- version.md
- visual-backpropagation.md
- when-to-standardize.md
- why-python.md
images
.gitignore
LICENSE.txt
README.md

missing-data.md

dealing with missing data

Jan 3, 2016

931f60f · Jan 3, 2016

What are some common approaches for dealing with missing data?

Many different approaches exist for dealing with missing values; I'd roughly categorize our options into a) deletion and b) imputation techniques.

a) Deletion

We have a lot of training samples and can afford deleting some of those. Here, we can simply remove samples with missing feature values from the dataset entirely.
We have a large number of feature columns and some of them are redundant. Relatively many samples have a missing feature value in a certain column. In this scenario, it may be a good idea to remove these feature columns with missing values entirely.

b) Imputation

If we can't afford deleting data points, we could use imputation techniques to "guess" placeholder values from the remaining data points.

The simplest imputation technique may be the replacement of a missing feature value by its feature column's mean (median or mode).
Instead of replacing a feature value by its column mean, we can only consider the k-nearest neighbors of this datapoint for computing the mean (median or mode) -- we identify the neighbors based on the remaining feature columns that don't have missing values.