Overfitting happens when the learning is affected by noise: the performance on the test set is way worse than that on the training set. A hypothesis
$$ \begin{aligned} \text { error }{\text {train }}(h) &<\text { error }{\text {train }}\left(h^{\prime}\right) \ \operatorname{error}{\mathcal{E}}(h) &>\operatorname{error}{\mathcal{E}}\left(h^{\prime}\right) \end{aligned} $$
Overfitting can happen because of noise or lack of representative instances.
We can distinguish 4 types of data in normal datasets:
- Categorical
- Nominal: non-orderable
$\rightarrow$ allows any one-to-one correspondence - Ordinal: orderable
$\rightarrow$ allows any order preserving transformation
- Nominal: non-orderable
- Numerical
- Interval: having meaningful differences but just + and
$\rightarrow$ allows linear function transformations - Ratio: having a definition of 0 and allowing mathematical operations
$\rightarrow$ allows any mathematical function, like standardization
- Interval: having meaningful differences but just + and
We can cite 4 main problems in datasets: noise/outliers, missing values, duplicates, inconsistencies.
To compute errors significantly, we should split the dataset into training set (used to generate the model) and test set, used to compute the test set error. We can add a validation set, that will allow us to evaluate the model (for example after pruning or hyperparameter tuning) for tuning, which will then be followed by the test set.
We define the training/test and training/validation/test splits as Holdout. An alternative is represented by Cross Validation (k-fold), which randomly partitions the training set in
Python was not designed for numeric computation. Numpy was created to provide a high-performance multidimensional array library. A numpy array is a collection of similar data-type elements, stored in a contiguous memory block. A list is instead just an array of pointer to objects. Numpy is parallel, dividing tasks into chunks and processing them at the same time. It is implemented in C. For example, when used to compute the dot product of two vectors, it is faster by 10-fold.
Pandas was born to deal with huge datasets to perform data manipulation and statistical analysis. We can import a CSV dataset using pd.read_csv()
and then use the .head()
method to display the first 5 rows.
Each DataFrame has columns
. Note that when you're importing a dataset having no column names, you'll have to use pd.read_csv(..., header=None)
.
You can read a column using df['column_name']
or df.column_name
.
TensorFlow is a deep learning library created by Google, providing a high-level API, primitives and tools for building and training neural networks. It is implemented in Python. Tensors are multilinear maps from vector spaces to the real numbers. TF is actually similar to NumPy, but offers automatic derivatives and tensor functions. A nice summary of TF is available here. A practical cheatsheet is found here.
We usually intend classification to be supervised. We can talk about crisp (an individual gets assigned one label) or probabilistic (an individual gets assigned a vector of probabilities). Using contingency tables, we could compare the attributes 2d or 3d to check the correlation between them.
When evaluating a test set, we should consider probabilistic variety of the relationship between the training set and the error. We should consider the confidence level
To measure the performances, several measures are available:
-
Precision:
$\frac{TP}{TP+FP}$ , i.e. the percentage of true positives over all the predicted positives -
Recall:
$\frac{TP}{TP+FN}$ , i.e. the percentage of detected positives over all the positives -
Specificity,
$\frac{TN}{TN+FP}$ , i.e. the percentage of detected negatives over all the negatives, see Recall - Accuracy, i.e. the weighted sum of sensitivity and specificity, or just percentage of right detections
-
F1: armonic mean of precision and recall
$F 1=2 \frac{\text { prec } \cdot \text { rec }}{\text { prec }+\text { rec }}$
We can use these measures for multiclass too, in the micro, macro and weighted flavors.
In case the data is shifted towards classes, it's interesting to compare the performance against a random classifier: this is done with the
With probabilistic classifiers, we can generate the Lift chart and the ROC curve, telling us how the classifier behaves when we change the threshold probability.
When building a lift chart, we imply we're trying to get the positives out of a dataset. We're comparing the number of positives we obtain using our classifer to test, and the ones we get by randomly sampling.
The ROC curve shows how decreasing the threshold makes FP and TP increase, meaning we have good performance when TP increases faster than FP. Decreasing the threshold increases the recall and decreases the precision.
To perform multiclass classification using binary classfiers, we have two strategies: we could use a one-vs-one strategy, building a classifier for all pairs of classes, binary testing them and aggregating the results, or a one-vs-rest model, in which we have a classifier per class, and estimate the probability of positivity.
Ensemble methods use multiple classifiers making them vote. This should turn the error rate down, if the classifiers are independent: if the errors are uncorrelated, the ensemble will be wrong only when the majority of the classifiers is. We have several methods for the sampling: bagging (uniform sample with replacement), boosting (focus on examples which are hard to classify), and adaboost (the importance of each base classifier depends on its error rate).
Gradient boosting is based on the idea of additive modeling: we add a bunch of simple terms together to create a more complicated expression. Boosting is a strategy that combines multiples simple models into a single composite model: by combining these, the overall model becomes stronger. At every step, our overall model becomes
in which we tweak the previous model with
AdaBoost, one of the first boosting methods, uses Decision Stumps (a decision stump is a simple classifier that splits the data into two groups, one for each class, by one feature). The stump decides a feature and a threshold. Two weight vectors are used: one for the classifiers, one for the examples. During training, difficult examples have higher weight. After training, each classifier is weighted by the number of examples it correctly classified. A classifier with 50% accuracy is weighted zero, a classifier with less than that is weighted negatively.
-
Here, the inner nodes contain a test that directs to either subtree, while the leaves predict a class.
-
There are several pros: they are explainable, they do not have any assumptions on the distribution of data, and they handle collinearity efficiently
-
There are several cons too: they have high variance and can create complex models, they are sensitive to outliers, they are prone to errors in many-class classification
-
When generating the tree, if all the elements belong to a class
c
we generate a leaf, otherwise choose a test based on a single attribute and generate an inner node -
We could use specific entropy to decide the attribute to test
- Entropy tells us the similarity of the probabilities of the classes: high entropy means that the probability are mostly similar: $H(X)=-\sum_{j} p_{j} \log {2}\left(p{j}\right)$
- Specific entropy
$H(Y \mid X=v)$ just tells the entropy only considering the rows in which$X=v$ - We can then introduce Information Gain, measuring the amount of insight that
$X$ provides to forecast$Y$ :$I G(Y \mid X)=H(Y)-H(Y \mid X)$ - The amount of Information Gain for one random variable
$(Y)$ is the decrease of uncertainty given observation from other existing variables$(X)$ . - This algorithm is known as ID3
-
Other available tests exploit the Gini Index and the Misclassification Error
-
We can prune DTs after the generation, with a validation set, statistical pruning (applying statistical tests on nodes) or using the Minimum Description Length principle, measuring the complexity of the encoding compared with the misclassifications.
-
The Minimum Description Length principle is based on the intuition that by transmitting the exceptions together with a theory, we are transmitting the whole data. Formally, naming
$L(T)=$ length of the encoding of$T$ (in bit)$L(\mathcal{E} \mid T)=$ length of the encoding of the exceptions (in bit)we want to minimize the encoding
$-L(\mathcal{E} \mid T)-L(T)$
This is a classifier based on statistics: considers the contribution of all attributes (assuming they're independent). Basically, given a vector of evidences
Note that if something never happens with an evidence, i.e.
Smoothed frequency:
To use numeric values, we can use the Gaussian distribution, computing mean and variance, then using
This is also known as artificial neuron, being just a linear combination of weighted inputs. We start with a 0-array of weights, then at every iteration, if
If the data are not linearly separable, it means that the boundary between classes is some hyper-superface which is more complex. We could use kernel functions to approximate non-linear functions. We want to try and find the maximum margin hyperplane: while the linear perceptron accepts any hyperplane that separates the classes, the maximum margin hyperplane gives the largest separation between the classes. To overcome non-linearity of boundaries, we can map the data into a new space said feature space. Using kernel functions we're able to avoid computing complex dot products.
One of the simplest models: it keeps all the training data, and future predictions just compute the similarity between the input and the training data. We pick the
See the Statistics chapter for more.
Unlike linear regression, logistic regression is used to solve classification problems. As classification problems should output a class probability, we introduce the sigmoid function:
where
This is a non-parametric, Bayesian approach to regression. It's a probabilistic model, where the predictions are computed by sampling from the posterior distribution.
Let's first introduce a Bayesian approach. It basically assumes a prior distribution
The updated distribution encodes information both from the dataset and the prior distribution. To then obtain predictions, the predictive distribution is computed by weighting all the possible predictions by their calculated posterior:
$$ p\left(f^{} \mid x^{}, y, X\right)=\int_{w} p\left(f^{} \mid x^{}, w\right) p(w \mid y, X) d w $$
Now, GPR is non-parametric: rather than computing the probability distribution of the parameters, it calculates the probability distribution over all admissible functions that fit the data.
Given a set of
As clustering is unsupervised, we need indices to measure the properties of the clusters.
- We can talk about cohesion, being the proximity of objects in the same cluster.
- This is just the sum of the proximities to the geometric center (either the centroid, or the medioid, with the latter being an element of the dataset)
$\operatorname{Coh}\left(k_{i}\right)=\sum_{x \in k_{i}} \operatorname{Prox}\left(x, \mathbf{c}_{i}\right)$
- This is just the sum of the proximities to the geometric center (either the centroid, or the medioid, with the latter being an element of the dataset)
- And separation between two clusters, measuring either the distance between the two closest datapoints, the furthest two, or the centroids
- The global separation of a clustering scheme is addressed with the Sum of Squares Between clusters SSB:
$\mathrm{SSB}=\sum_{i=1}^{K} N_{i} \operatorname{Dist}\left(\mathbf{c}_{i}, \mathbf{c}\right)^{2}$ , where$c$ is the global centroid of the dataset - Note that the Total Sum of Squares TSS is just the sum of the SSE and the SSB
- The silhouette index represents the ratio
$s_{i}=\frac{b_{i}-a_{i}}{\max \left(a_{i}, b_{i}\right)} \in[-1,1]$ , where$a_i$ is the average distance wrt the other objects of the same cluster, and$b_i$ is the minimal distance from an object of another cluster
We then have some supervised measures for when we know the actual clusters (aka the Gold Standard):
- The Rand Index
- The Jaccard Coefficient
In K-means, we start by deciding a number
This generates a nested structure of clusters, being either agglomerative (start with data points as clusters and unify) or divisive (start with a single cluster and divide).
For agglomerative, we find the less separated pair and make it a cluster. This has
We can then cut the dendrogram at the desired depth. The horizontal axis of the dendrogram represents the total dissimilarity inside the clusters. The diameter of a cluster is just the distance among the most separated objects.
We might want clusters to be high-density regions separated by low-density regions. To compute density, we could either use grid-based methods or object-centered methods. Now, a cluster is a maximal set of points connected by density, i.e. points linked by other points in their neighbourhoods.
An alternative to DBSCAN is Kernel Density Estimation, describing the distribution of the data by a function and compute the density as the sum of the kernel functions associated with each point.
Here, we estimate the parameters of a statistical model to maximize the ability of the model to explain the data. The main technique is using mixture models, i.e. viewing the data as a set of observations from a mixture of different probability distributions. Using Expectation Maximization, we iteratively compute the probability that each object belongs to each distribution, then find the new estimates ot the parameters that maximize the expected likelihood.
To reduce the complexity of models and computation, we should try to reduce the features we have as input. The significance of attributes can vary, as we may have irrelevant alteration or redundancies.
We have to distinguish between unsupervised and supervised methods (considering the relationship between attribute and class, like filter methods).
Filter methods are based on general characteristics of data, selecting a subset of attributes independently form the mining model that will be used. Among these, we cite Pearson's correlation, LDA, ANOVA, Chi-Square.
SciKit Learn offers many implementations for the different scoring methods, usable in combination with SelectKBest
and SelectPercentile
:
# define feature selection
fs = SelectKBest(score_func=r_regression, k=10)
# apply feature selection
X_selected = fs.fit_transform(X, y)
Note that we're using r_regression
, a scoring method based on Pearson correlation, while for classification problems (numerical input, categorical output) we could use f_classif
, based on the ANOVA F measure. Another linear scoring method is represented by f_regression
, based on F-statistic and p-value.
The chi-squared test assumes that observed frequencies for a categorical variable match its expected frequencies. Using contingency tables, which show the numbers of matches between 2 classes, we can compute the chi-squared statistic for each attribute.
This is implemented in SciKit Learn:
SelectKBest(chi2)
Wrapper methods try training models using subsets of features, and iteratively add/remove features. The difference between filter and wrapper methods is the first works on the data, measuring the relevance of attributes by their correlation with the dependent variable, while wrapper methods are algorithm-based. The first are way faster, but less informative.
Principal Component Analysis PCA maps the whole dataset into a new space with fewer attributes. It finds an ordered set of dimensions, such that the first one captures most of the variability. Notice that this is not always what you want: for classification, it can backfire easily and should be avoided.
Recursive Feature Elimination ranks feature using an external estimator, considering smaller and smaller sets of features.
SVD is one of the most popular methods for dimensionality reduction in sparse datasets. It is a linear algebra method, and it is used to find the most relevant features in a dataset. It can be thought as a projection method where data is projected onto a lower dimensional space, while preserving the essence of the original data. This is implemented in SciKit Learn:
# define transform
svd = TruncatedSVD()
# prepare transform on dataset
svd.fit(data)
# apply transform to dataset
transformed = svd.transform(data)
Transfer learning consists in focusing on storing knowledge gained while solving a problem, and applying it to a different but related problem. We pick a source task with a huge dataset, develop the model, then start the training of the new one with this. It gives us a higher start, a higher slope, and a higher asymptote.
Few-shot learning is somewhat similar to transfer learning. Its main focus is meta-learning, or learning how to learn. It tries to build a similarity model on a bigger dataset, allowing the model to predict whether two datapoints are similar or not, then uses this to build a classifier on a really small dataset.
In the past, we have seen multiple approaches to hyperparameters optimization:
- Grid search: simplest method, we simply try all the possible combinations. It is computationally expensive.
- Random search: we randomly sample the hyperparameters space, and we try to find the best combination. It is computationally cheap, and stoppable whenever we want (so we can tune the accuracy we need by setting the iterations)
- Bayesian optimization: we use a Bayesian method to optimize the hyperparameters. Applied to hyperparameter optimization, Bayesian optimization builds a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By iteratively evaluating a promising hyperparameter configuration based on the current model, and then updating it, Bayesian optimization aims to gather observations revealing as much information as possible about this function and, in particular, the location of the optimum.
- Gradient based: usable for certain optimizers only, we compute the gradient with respect to the hyperparameters and try to optimize it
- Evolutionary algorithms: we use evolutionary computing to perform reproduction and mutation basing on the fitness of the individuals (parameter sets).
- Feature leakage: some features may inadvertently contain label data, resulting in a model that scores really higher in tests than in production.
- Hit Ratio: metric used for recommender systems, if we have 1 test element and a Hit Ratio @10, it will be positive if the test element is in the first 10 ranked elements
- Multicollinearity: features are correlated with each other, and the model is not able to learn the relationship between them.
Probabilistic losses
- BinaryCrossentropy
- CategoricalCrossentropy
- SparseCategoricalCrossentropy
- Poisson
- KLDivergence
Regression losses
- MeanSquaredError
- MeanAbsoluteError
- MeanAbsolutePercentageError
- MeanSquaredLogarithmicError
- CosineSimilarity
- Huber
- LogCosh
Hinge losses for "maximum-margin" classification
- Hinge
- SquaredHinge
- CategoricalHinge