Skip to content

DataFusion

olehmberg edited this page Apr 26, 2017 · 4 revisions

Data Fusion

Load the datasets:

// Load the Data into FusableDataSet
FusableDataSet<Movie, Attribute> ds1 = new FusableDataSet<>();
new MovieXMLReader().loadFromXML(
  new File("usecase/movie/input/academy_awards.xml"),
  "/movies/movie",
  ds1);
ds1.printDataSetDensityReport();

FusableDataSet<Movie, Attribute> ds2 = new FusableDataSet<>();
new MovieXMLReader().loadFromXML(
  new File("usecase/movie/input/actors.xml"),
  "/movies/movie",
  ds2);
ds2.printDataSetDensityReport();

FusableDataSet<Movie, Attribute> ds3 = new FusableDataSet<>();
new MovieXMLReader().loadFromXML(
  new File("usecase/movie/input/golden_globes.xml"),
  "/movies/movie",
  ds3);
ds3.printDataSetDensityReport();

Load the record correspondences between the datasets (obtained from identity resolution):

// load correspondences
CorrespondenceSet<Movie, Attribute> correspondences = new CorrespondenceSet<>();
correspondences.loadCorrespondences(
  new File("usecase/movie/correspondences/academy_awards_2_actors_correspondences.csv"),
  ds1,
  ds2);
correspondences.loadCorrespondences(
  new File("usecase/movie/correspondences/actors_2_golden_globes_correspondences.csv"),
  ds2,
  ds3);

Define the data fusion strategy and add fusers for multiple attributes:

// define the fusion strategy
DataFusionStrategy<Movie, Attribute> strategy = new DataFusionStrategy<>(new MovieXMLReader());
// add attribute fusers
strategy.addAttributeFuser(
  Movie.TITLE,
  new TitleFuserShortestString(),
  new TitleEvaluationRule());
strategy.addAttributeFuser(
  Movie.DIRECTOR,
  new DirectorFuserLongestString(),
  new DirectorEvaluationRule());
strategy.addAttributeFuser(
  Movie.DATE,
  new DateFuserVoting(),
  new DateEvaluationRule());
strategy.addAttributeFuser(
  Movie.ACTORS,
  new ActorsFuserUnion(),
  new ActorsEvaluationRule());

Initialise the data fusion engine with the strategy and run it:

// create the fusion engine
DataFusionEngine<Movie, Attribute> engine = new DataFusionEngine<>(strategy);
// run the fusion
FusableDataSet<Movie, Attribute> fusedDataSet = engine.run(correspondences, null);

Load the gold standard and evaluate the fusion result:

// load the gold standard
DataSet<Movie, Attribute> gs = new FusableDataSet<>();
new MovieXMLReader().loadFromXML(
  new File("usecase/movie/goldstandard/fused.xml"),
  "/movies/movie",
  gs);
// evaluate
DataFusionEvaluator<Movie, Attribute> evaluator = new DataFusionEvaluator<>(
  strategy, new RecordGroupFactory<Movie, Attribute>());
evaluator.setVerbose(true);

double accuracy = evaluator.evaluate(fusedDataSet, gs, null);

System.out.println(String.format("Accuracy: %.2f", accuracy));