-
Notifications
You must be signed in to change notification settings - Fork 3
Performance
Here, you can read a brief introduction to the evaluation techniques and the performance parameters used by Athena in the classification step.
In particular, this page will talk about the accuracy and the splitting methods of the dataset in training set and test set, the confusion matrix and the related measures, the ROC curve and the AUC value, and the rejection option.
If you already know them, you can return to the page of the wiki related to the classification.
The accuracy is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data, while the generalization error is its opposite measure.
These value are used to evaluate if a classifier is able to discriminate samples not used during the training phase, so to verify its predictive ability on new data.
In order to obtain these value, the initial dataset is splitted into 2 subsets: the training set and the test set.
The training set is composed by the set of samples used to train (fit) the classifier, while the test set is composed by samples not present in the training set, which are used to be classified by the classifier and the resulting labels are compared with the real class labels.
The accuracy value is so simply evaluated as the fraction of correctly classified samples:
There are many methods which can be used to split the dataset into two subsets, which allow to obtain more or less reliable models.
Currently, Athena offers two different evaluation methods, which subdivide in different ways the initial dataset and evaluate the performance values in distinct ways: the training-test split and the leave-one-out cross-validation.
The training-test split evaluation method consists in split the dataset into training set and test set, by randomly selecting a previously selected fraction of samples and extracted to be used as training set, while the other will be used as test set in order to evaluate the classifier's performance.
As example, by considering a fraction of 0.8, the 80% of the total samples are extracted and used to fit the classifier.
The other samples (which correspond to the 20% of the initial dataset) are used as instances never seen before by the classifier in the classification phase: their class labels will be predicted by the classifier and compared with their real label, in order to obtain some evaluation parameters (such as the generalization error).
Due to the randomly choice of the samples, the fitting-classification cycle is often repeated more times in order to obtain a more reliable evaluation (all the evaluated parameters are balanced between all the repetitions, for example this toolbox returns the average of their values).
If the performance strongly depends by the samples used to train the classifier, or the dataset is composed by a small number of observations, the leave-one-out cross-validation method often allows to obtain a more reliable model.
This method involves training the model with all the samples excluding only one, which is used as test set, repeating this procedure using all the samples as a test set once.
For example, the accuracy value of the classifier was therefore considered as the average accuracy value of the repetitions.
Despite its generally higher reliablity, this method is often very slow, even if it is possible to repeat only the fitting-classification cycle only once per excluded sample, unless some randomization is introduced in the fitting phase (for example, the fraction of resample of the training samples for training each tree of a Random Forest classifier).
In addition to the accuracy value and the generalization error, it may be useful to evaluate other measures.
For example, sometimes it is necessary to identify the accuracy related to the single classes, and in this case the confusion matrix is used
This matrix represents the accuracy and the error on the single classes, expressed by the number of correctly classified and misclassified samples related to the class, respectively, or their fraction over the overall number of sampes belonging to the specified class.
If these values are considered as fractions, in a 2-classes problem, they also express other commonly evaluated measures:
- TPR (True Positive Rate): it is the fraction of positive class samples correctly classified (also called sensitivity)
- TNR (True Negative Rate): it is the fraction of negative class samples correctly classified (also called specificity)
- FPR (False Positive Rate): it is the fraction of false positives (classified as positive even if belonging to the negative class)
- FNR (False Negative Rate): it is the fraction of false negatives (classified as negative even if belonging to the positive class)
The ROC (Receiver Operating Characteristic) curve is a graphic representation of the TPR value compared to the FPR value at some classification thresholds.
Since the TPR (used as y axis) defines how many correct positive results occur among all positive samples available during the test, while the FPR (used as x axis) defines how many incorrect positive results occur among all negative samples available during the test, this curve represents the relative trade-offs between benefits (true positives) and costs (false positives).
To have an idea of how "good" is a ROC curve, the 45 degrees line from the coordinates (0, 0) to the coordinates (1, 1) represents the ROC curve of a purely random classification, while the best possible prediction method would yield a point in the upper left corner or coordinate (0,1) of the ROC space (no false negatives and no false positives), which represents a perfect classification.
So, the goodness of this curve can be evaluated by computing the area underneath it, and this value is called AUC (Area Under the Curve) value.
This curve is very important, for example, in medical diagnosis and in biometric applications.
The rejection option allows to reject the samples which the classifier in not able to classify as belonging to any class with a probability value higher than a threshold value.
In particular, since for any sample, a classifier is able to assign a score which identify the probabily that it belongs to a specific class, the reject option is implemented as a simple check on the scores given for any class label by the classifier to the sample: if the higher score is lower than the chosen rejection threshold, this sample is simply rejected and its classification is not considered in the performance evaluation.
However, Athena will show also the statistics about the rejections if this option is used.
Note that in a binary classification a value of the rejection threshold lower than or equal to 0.5 looses its meaning: the higher probability score of a sample will be at least equal to 0.5 (it will have at minimum the 50% of probability to belong to the most likely class).
- Athena
- The initial interface
- Display mode
- Spectrum mode
- Time-frequency mode
- Guided mode
- Batch mode
- Utility
- Filter
- File formats
- Command history
- Data normalization
- Speech commands
- Shortcut commands