Skip to content

Latest commit

 

History

History
118 lines (77 loc) · 6.19 KB

spark-sql-streaming-StreamingSymmetricHashJoinExec.adoc

File metadata and controls

118 lines (77 loc) · 6.19 KB

StreamingSymmetricHashJoinExec Binary Physical Operator — Stream-Stream Joins

StreamingSymmetricHashJoinExec is a binary physical operator that represents a Join logical operator of two streaming queries at execution time.

Note

A binary physical operator (BinaryExecNode) is a physical operator with left and right child physical operators.

Read up on BinaryExecNode (and physical operators in general) in The Internals of Spark SQL book.

Note

Join logical operator represents Dataset.join operator in a logical query plan.

StreamingSymmetricHashJoinExec requires that the join type be Inner, LeftOuter, or RightOuter with the same data types of the left and the right keys.

StreamingSymmetricHashJoinExec is created exclusively when StreamingJoinStrategy execution planning strategy is requested to plan a logical query plan with a Join logical operator of two streaming queries with equality predicates (EqualTo and EqualNullSafe)

StreamingSymmetricHashJoinExec is a stateful physical operator that writes to a state store.

The output schema of StreamingSymmetricHashJoinExec is…​FIXME

The output partitioning of StreamingSymmetricHashJoinExec is…​FIXME

Creating StreamingSymmetricHashJoinExec Instance

StreamingSymmetricHashJoinExec takes the following to be created:

  • Catalyst expressions of the keys on the left side

  • Catalyst expressions of the keys on the right side

  • JoinType

  • Join condition (JoinConditionSplitPredicates)

  • StatefulOperatorStateInfo

  • Event-time watermark

  • State watermark (JoinStateWatermarkPredicates)

  • Physical operator on the left side (SparkPlan)

  • Physical operator on the right side (SparkPlan)

StreamingSymmetricHashJoinExec initializes the internal properties.

Performance Metrics

StreamingSymmetricHashJoinExec uses the performance metrics of StateStoreWriter.

shouldRunAnotherBatch Method

shouldRunAnotherBatch(newMetadata: OffsetSeqMetadata): Boolean
Note
shouldRunAnotherBatch is part of the StateStoreWriter Contract to check whether MicroBatchExecution should run another batch (based on the updated metadata).

shouldRunAnotherBatch…​FIXME

Executing Physical Operator (Generating Recipe for Distributed Computation as RDD[InternalRow]) — doExecute Method

doExecute(): RDD[InternalRow]
Note
doExecute is part of SparkPlan Contract to generate the runtime representation of an physical operator as a recipe for distributed computation over internal binary rows on Apache Spark (RDD[InternalRow]).

doExecute first requests the StreamingQueryManager for the StateStoreCoordinatorRef to the StateStoreCoordinator RPC endpoint (for the driver).

doExecute then requests the SymmetricHashJoinStateManager for the names of the state stores for the left and right side of the streaming join.

In the end, doExecute requests the left and right physical operators to execute (generate an RDD) and then stateStoreAwareZipPartitions with processPartitions (and with the StateStoreCoordinatorRef and the state stores).

Processing Partitions — processPartitions Internal Method

processPartitions(
  leftInputIter: Iterator[InternalRow],
  rightInputIter: Iterator[InternalRow]): Iterator[InternalRow]

processPartitions…​FIXME

Note
processPartitions is used exclusively when StreamingSymmetricHashJoinExec physical operator is requested to execute.

Internal Properties

Name Description

hadoopConfBcast

Hadoop Configuration broadcast (to the Spark cluster)

joinStateManager

nullLeft

GenericInternalRow of the size of the output schema of the left physical operator

nullRight

GenericInternalRow of the size of the output schema of the right physical operator

storeConf