-
Notifications
You must be signed in to change notification settings - Fork 75
Join Operator
Author: Sripad Kowshik Subramanyam
Implement an operator which takes in two other operators and tries to join their result's based on the constraints specified using a predicate.
As of 9/14/2016: IN PROGRESS
edu.uci.ics.textdb.dataflow.common
edu.uci.ics.textdb.dataflow.join
https://github.com/TextDB/textdb/issues/111
Join Operator performs the join of a certain field of the results of two other operators passed to it based on constraints passed by the user. The field to join upon and the constraints to be satisfied are specified using JoinPredicate
. The getNextTuple()
method is used to get the next result of the operator.
Currently supported predicates are:
-
JoinDistancePredicate
: Takes in the attribute that specifies the ID, the attribute of the field to perform the join on and a distance threshold. If the distance between two spans of the field of results to be joined is within threshold, join is performed.
Given below is a setting and examples using this setting to use JoinDistancePredicate
(consider the two tuples to be from two different operators).
id | author | review | spanList | |
---|---|---|---|---|
tuple1 | 58 | Bruce Wayne | This book gives us a peek into the life of Bruce Wayne when he is not fighting crime as Batman | "book":<6,11> |
tuple2 | 58 | Bruce Wayne | This book gives us a peek into the life of Bruce Wayne when he is not fighting crime as Batman | "gives":<12, 18>, "us":<19, 22> |
Where < spanStartIndex, spanEndIndex > represents a span.
If we want to join over review attribute with the condition within 10 character distance, we can write:
JoinDistancePredicate joinPredicate = new JoinDistancePredicate(idAttr, reviewAttr, 10);
Since, both the tuples have same ID, we can perform join on the two span lists.
The span distance is computed as:
|(span 1 spanStartIndex) - (span 2 spanStartIndex)| OR |(span 1 spanEndIndex) - (span 2 spanEndIndex)|)
The span "book":<6,11>
from tuple1 and the span "gives":<12, 18>
from tuple 2 satisfies this condition, there join will combine two spans into a new span.
The combining process is computed as:
< min(span1 spanStartIndex, span2 spanStartIndex), max(span1 spanEndIndex, span2 spanEndIndex) >
The new span after joining is "book_gives":<6, 18>
.
The span "book":<6,11>
from tuple1 and the span "us":<19, 22>
from tuple 2 doesn't satisfy the condition. They will not be joined.
- Implement sorting of spans fo the results in order to improve performance of the operator.
- Implement other kinds of predicates to increase the robustness and utility of the operator.