Skip to content

Join Operator

Zuozhi Wang edited this page Sep 14, 2016 · 18 revisions

Author: Sripad Kowshik Subramanyam

Synopsys

Implement an operator which takes in two other operators and tries to join their result's based on the constraints specified using a predicate.

Status

As of 9/14/2016: IN PROGRESS

Modules

edu.uci.ics.textdb.dataflow.common
edu.uci.ics.textdb.dataflow.join

Related Issues

https://github.com/TextDB/textdb/issues/111

Description

Join Operator performs the join of a certain field of the results of two other operators passed to it based on constraints passed by the user. The field to join upon and the constraints to be satisfied are specified using JoinPredicate. The getNextTuple() method is used to get the next result of the operator.

Currently supported predicates are:

  • JoinDistancePredicate: Takes in the attribute that specifies the ID, the attribute of the field to perform the join on and a distance threshold. If the distance between two spans of the field of results to be joined is within threshold, join is performed.

Example

Given below is a setting and examples using this setting to use JoinDistancePredicate (consider the two tuples to be from two different operators).

id author review spanList
tuple1 58 Bruce Wayne This book gives us a peek into the life of Bruce Wayne when he is not fighting crime as Batman "book":<6,11>
tuple2 58 Bruce Wayne This book gives us a peek into the life of Bruce Wayne when he is not fighting crime as Batman "gives":<12, 18>,
"us":<19, 22>

Where < spanStartIndex, spanEndIndex > represents a span.

If we want to join over review attribute with the condition within 10 character distance, we can write:

JoinDistancePredicate joinPredicate = new JoinDistancePredicate(idAttr, reviewAttr, 10);

Since, both the tuples have same ID, we can perform join on the two span lists.

The span distance is computed as:

|(span 1 spanStartIndex) - (span 2 spanStartIndex)| OR |(span 1 spanEndIndex) - (span 2 spanEndIndex)|)

The span "book":<6,11> from tuple1 and the span "gives":<12, 18> from tuple 2 satisfies this condition, there join will combine two spans into a new span.

The combining process is computed as:

< min(span1 spanStartIndex, span2 spanStartIndex), max(span1 spanEndIndex, span2 spanEndIndex) >

The new span after joining is "book_gives":<6, 18>.

The span "book":<6,11> from tuple1 and the span "us":<19, 22> from tuple 2 doesn't satisfy the condition. They will not be joined.

TODOs

  • Implement sorting of spans fo the results in order to improve performance of the operator.
  • Implement other kinds of predicates to increase the robustness and utility of the operator.
Clone this wiki locally