-
Notifications
You must be signed in to change notification settings - Fork 32
Evaluation
olehmberg edited this page Apr 20, 2017
·
3 revisions
Gold standards for matching are simple CSV files that list two record IDs and a Boolean flag specifying whether the match is correct or not:
Dataset1_record1,Dataset2_record2,true
Dataset1_record2,Dataset5_record5,true
Dataset1_record1,Dataset2_record1,false
Dataset1_record2,Dataset2_record7,false
The gold standards can either be complete or partial:
- Complete Gold Standard: contains all possible matches in the data sets. All flags have the value “true”.
MatchingGoldStandard gs = new MatchingGoldStandard();
gs.loadFromCSVFile(new File("complete.csv"));
gs.setComplete(true);
- Partial Gold Standard: contains positive and negative examples for matches, indicates by the flag “true” or “false”. Only those correspondences (produced by a matcher) that are included in the partial gold standard are evaluated.
MatchingGoldStandard gs = new MatchingGoldStandard();
gs.loadFromCSVFile(new File("partial.csv"));
The evaluation of a matching result is performed by the matching evaluator:
MatchingEvaluator<Record, Attribute> evaluator = new MatchingEvaluator<Record, Attribute>(true);
Performance perf = evaluator.evaluateMatching(correspondences.get(),gs);
For data fusion, a gold standard is just another dataset. If the fused values are the same as the values in this dataset, they are evaluated as correct. The connection is made via the record IDs in the datasets.
// load the gold standard
DataSet<Movie, Attribute> gs = new FusableDataSet<>();
new MovieXMLReader().loadFromXML(new File("fused.xml"), "/movies/movie", gs);
// evaluate
DataFusionEvaluator<Movie, Attribute> evaluator = new DataFusionEvaluator<>(
strategy,
new RecordGroupFactory<Movie, Attribute>());
double accuracy = evaluator.evaluate(fusedDataSet, gs, null);