The documents explains:
- How we evaluate the performance of different changepoint methods on real RTT time series;
- The scoring method used in evaluation.
In order to tell which changepoint method with what configuration works the best for RTT time series, we need an evaluation framework composed of two parts:
- a dataset of ground truth;
- a scoring method.
The ground truth dataset is obtained throw manually labelling. In order to ensure the quality of labelled dataset, we performed capability test on labellers using an synthetic dataset. For sythetic RTT time seires, the moments when the generative model changes are known as ground truth. Usage:
$ python labeller_evaluation.py -h
usage: labeller_evaluation.py [-h] [-f FACT] [-l LABELLER] [-o OUTPUT]
optional arguments:
-h, --help show this help message and exit
-f FACT, --fact FACT directory storing ground fact.
-l LABELLER, --labeller LABELLER
directory storing the detections of human labeller.
-o OUTPUT, --output OUTPUT
filename storing the output result.
The synthetic dataset with known moments of change is stored in dataset/artificial_trace_ground_fact. There are 20 traces in the repository. These traces are generated by https://github.com/WenqinSHAO/rtt_gen.git. In average, each trace contain 8646 datapoints. There are in all 935 moments of change to be detected. Each trace is stored in a .csv file containing three columns: epoch for timestamp, rtt for RTT value, cp for change.
The labelled changes (by human labllers) are stored in dataset/artificial_trace_labelled. The .csv file follow the same format of ground truth. The labelling is done with the help of these tools: https://github.com/WenqinSHAO/rtt_visual.git.
The output directory is data/.
The capability test confirmed that human labllers are highly capable of detecting changes in synthetic RTT traces. Then we proceed with real traces. 50 real RTT traces of various forms are collected from RIPE Atlas. Each trace have 8162 datapoints on average. In all, they represent more than 34008 hours, i.e. 1417 days, of RTT measurements. 1047 moments of change are manually labelled and stored in dataset/real_trace_labelled. These labelled moments are used as ground truth in the evaluation of changepoint methods.
$ python cpt_evaluation.py -h
usage: cpt_evaluation.py [-h] [-d DIRECTORY] [-f FILENAME]
optional arguments:
-h, --help show this help message and exit
-d DIRECTORY, --directory DIRECTORY
benchmark changepoint methods using the traces from
the specified directory.
-f FILENAME, --filename FILENAME
file name for output.
The output directory is data/. The output of labeller_evaluation.py and cpt_evaluation.py are each a .csv file containing following columns:
file len changes tp fp fn precision recall score dis method penalty
11119.csv 10001 33 33 1092 0 0.0293333333333 1.0 1.0 0.151515151515 cpt_normal AIC
11119.csv 10001 33 32 104 1 0.235294117647 0.969696969697 1.0 0.03125 cpt_normal BIC
11119.csv 10001 33 32 97 1 0.248062015504 0.969696969697 1.0 0.03125 cpt_normal MBIC
11119.csv 10001 33 33 183 0 0.152777777778 1.0 1.0 0.121212121212 cpt_normal Hannan-Quinn
11119.csv 10001 33 30 17 3 0.63829787234 0.909090909091 0.986828872629 0.166666666667 cpt_poisson AIC
11119.csv 10001 33 18 14 15 0.5625 0.545454545455 0.684655151504 0.222222222222 cpt_poisson BIC
11119.csv 10001 33 18 13 15 0.58064516129 0.545454545455 0.684655151504 0.222222222222 cpt_poisson MBIC
11119.csv 10001 33 26 14 7 0.65 0.787878787879 0.751922379465 0.192307692308 cpt_poisson Hannan-Quinn
11119.csv 10001 33 12 11 21 0.521739130435 0.363636363636 0.580483609138 0.333333333333 cpt_poisson_naive AIC
11119.csv 10001 33 7 11 26 0.388888888889 0.212121212121 0.479300106511 0.571428571429 cpt_poisson_naive BIC
11119.csv 10001 33 7 11 26 0.388888888889 0.212121212121 0.501398809509 0.714285714286 cpt_poisson_naive MBIC
Four scoring methods are provided in localutils/benchmark.py. evaluation_window_weighted() is used in human labeller capability test and changepoint methods evaluation. It has a window option thus allows approximate matching between fact and detection. Each RTT change in ground truth is weighted according to its practical importance to produce a weighted version of recall, noted as score in the output file.
More precisely, a bipartite graph is first constructed using moments of ground truth events and detection events as vertices. Edges are composed of ground truth and detection event paris, the distance between which is no greater than the window size. The cost of each edge is then the distance between ground truth and detection. Minimum cost maximum-cardinality matching is calculated using Hungarian algorithm for the constructed graph. The cardinality of the matching is then the number of the True Positive. All the unmatched ground truth events contribute to False Negative. All the unmatched detection events are regarded as False Positive. Precision and recall are well calculated. score is a weighted version of recall, where each RTT change is weighted by the weighting() function. dis column in the output is the average cost of the matching, that is the average distance between matched ground truth and detection events.