Skip to content

Evaluation strategy & performance metrics

Valentyn1997 edited this page Jan 29, 2020 · 2 revisions

Train/validation/test split

Ideal train-validation-test split will look like:

  • each subset has different patients
  • training only on normal images
  • ideally: the same nuber of positives/negatives in validation/test dataset image

But if we want balanced number of labels in validation-test subset, the train set will be really small (50% only). Remember the distribution of labels: image

  1. Train subset.
    • Size: 3012
    • Percentage from original data: 0.5145199863341305
    • Percentage of negative samples in subset: 1.00. So that we train on normal images.
    • Number of patients in subset: 1017
  2. Validation subset.
    • Size: 1419
    • Percentage from original data: 0.2423983600956611
    • Percentage of negative samples in subset: 0.485553206483439
    • Number of patients in subset: 473
  3. Test subset
    • Size: 1423
    • Percentage from original data: 0.24308165357020842
    • Percentage of negative samples in subset: 0.4195361911454673
    • Number of patients in subset: 474

Evaluation metric

  • In validation/test subset - sort all images by outlier score values (different models have outlier scores).
  • Define threshold by ROC analysis, calculate AUC.
  • Calculate F1-score:

F_{1}=\left({\frac {\mathrm {recall} ^{-1}+\mathrm {precision} ^{-1}}{2}}\right)^{-1}=2\cdot {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}