Skip to content

Join Operator

sripadks edited this page Sep 14, 2016 · 18 revisions

Author: Sripad Kowshik Subramanyam

Synopsys

Implement an operator which takes in two other operators and tries to join their result's based on the constraints specified using a predicate.

Status

As of 9/14/2016: IN PROGRESS

Modules

edu.uci.ics.textdb.dataflow.common
edu.uci.ics.textdb.dataflow.join

Related Issues

https://github.com/TextDB/textdb/issues/111

Description

Join Operator performs the join of a certain field of the results of two other operators passed to it based on constraints passed by the user. The field to join upon and the constraints to be satisfied are specified using JoinPredicate. The getNextTuple() method is used to get the next result of the operator.

Currently supported predicates are:

  • JoinDistancePredicate: Takes in the attribute that specifies the ID, the attribute of the field to perform the join on and a distance threshold. If the distance between two spans of the field of results to be joined is within threshold, join is performed.

    Given below is a setting and examples using this setting to use JoinDistancePredicate (consider the two tuples to be from two different operators).

    Setting: Consider the attributes in the schema to be idAttr (type integer), authorAttr (type string), reviewAttr (type text), spanAttr (type span) for a book review.

    Let bookTuple1 be { idAttr:58, authorAttr:"Bruce Wayne", reviewAttr:"This book gives us a peek into the life of Bruce Wayne when he is not fighting crime as Batman.", { "book":<6, 11> } }

    Let bookTuple2 be { idAttr:58, authorAttr:"Bruce Wayne", reviewAttr:"This book gives us a peek into the life of Bruce Wayne when he is not fighting crime as Batman.", { "gives":<12, 18>, "us":<19, 22>} }

    Where < spanStartIndex, spanEndIndex > represents a span.

    JoinPredicate joinPre = new JoinDistancePredicate(idAttr, reviewAttr, 10);

    Example 1

Suppose that the outer tuple is bookTuple1 and inner tuple is bookTuple2 (from two operators) and we want to join over reviewAttr the words "book" and "gives". Since, both the tuples have same ID the distance between the words are computed by using their span. Since, the distance between the words (computed as |(span 1 spanStartIndex) - (span 2 spanStartIndex)| and |(span 1 spanEndIndex) - (span 2 spanEndIndex)|) is within 10 characters from each other, join will take place and return a tuple with a span list consisting of the combined span (computed as < min(span1 spanStartIndex, span2 spanStartIndex), max(span1 spanEndIndex, span2 spanEndIndex) >) given by <6, 18>. The returned tuple is { idAttr:58, authorAttr:"Bruce Wayne", reviewAttr:"This book gives us a peek into the life of Bruce Wayne when he is not fighting crime as Batman.", { "book_gives":<6, 18> } }

Example 2

Consider the previous example but with words "book" and "us" to be joined. Since, the tuple IDs are same, but the words are more than 10 characters apart and hence join won't produce a result and simply returns null.

TODOs

  • Implement sorting of spans fo the results in order to improve performance of the operator.
  • Implement other kinds of predicates to increase the robustness and utility of the operator.
Clone this wiki locally