StreamingSymmetricHashJoinExec Binary Physical Operator — Stream-Stream Joins

StreamingSymmetricHashJoinExec is a binary physical operator that represents a Join logical operator of two streaming queries at execution time.

Note	A binary physical operator (`BinaryExecNode`) is a physical operator with left and right child physical operators. Read up on BinaryExecNode (and physical operators in general) in The Internals of Spark SQL book.

Note	`Join` logical operator represents `Dataset.join` operator in a logical query plan. Read up on Join Logical Operator in The Internals of Spark SQL book.

StreamingSymmetricHashJoinExec requires that the join type be Inner, LeftOuter, or RightOuter with the same data types of the left and the right keys.

StreamingSymmetricHashJoinExec is created exclusively when StreamingJoinStrategy execution planning strategy is requested to plan a logical query plan with a Join logical operator of two streaming queries with equality predicates (EqualTo and EqualNullSafe)

StreamingSymmetricHashJoinExec is a stateful physical operator that writes to a state store.

The output schema of StreamingSymmetricHashJoinExec is…FIXME

The output partitioning of StreamingSymmetricHashJoinExec is…FIXME

Creating StreamingSymmetricHashJoinExec Instance

StreamingSymmetricHashJoinExec takes the following to be created:

Catalyst expressions of the keys on the left side
Catalyst expressions of the keys on the right side
JoinType
Join condition (JoinConditionSplitPredicates)
StatefulOperatorStateInfo
Event-time watermark
State watermark (JoinStateWatermarkPredicates)
Physical operator on the left side (SparkPlan)
Physical operator on the right side (SparkPlan)

StreamingSymmetricHashJoinExec initializes the internal properties.

Performance Metrics

StreamingSymmetricHashJoinExec uses the performance metrics of StateStoreWriter.

`shouldRunAnotherBatch` Method

shouldRunAnotherBatch(newMetadata: OffsetSeqMetadata): Boolean

Note	`shouldRunAnotherBatch` is part of the StateStoreWriter Contract to check whether MicroBatchExecution should run another batch (based on the updated metadata).

shouldRunAnotherBatch…FIXME

Executing Physical Operator (Generating Recipe for Distributed Computation as RDD[InternalRow]) — `doExecute` Method

doExecute(): RDD[InternalRow]

Note	`doExecute` is part of `SparkPlan` Contract to generate the runtime representation of an physical operator as a recipe for distributed computation over internal binary rows on Apache Spark (`RDD[InternalRow]`).

doExecute first requests the StreamingQueryManager for the StateStoreCoordinatorRef to the StateStoreCoordinator RPC endpoint (for the driver).

doExecute then requests the SymmetricHashJoinStateManager for the names of the state stores for the left and right side of the streaming join.

In the end, doExecute requests the left and right physical operators to execute (generate an RDD) and then stateStoreAwareZipPartitions with processPartitions (and with the StateStoreCoordinatorRef and the state stores).

Processing Partitions — `processPartitions` Internal Method

processPartitions(
  leftInputIter: Iterator[InternalRow],
  rightInputIter: Iterator[InternalRow]): Iterator[InternalRow]

processPartitions…FIXME

Note	`processPartitions` is used exclusively when `StreamingSymmetricHashJoinExec` physical operator is requested to execute.

Internal Properties

Name	Description
`hadoopConfBcast`	Hadoop Configuration broadcast (to the Spark cluster) Used exclusively to create a SymmetricHashJoinStateManager
`joinStateManager`	SymmetricHashJoinStateManager Used when `OneSideHashJoiner` is requested to storeAndJoinWithOtherSide, removeOldState, commitStateAndGetMetrics, and for the values for a given key
`nullLeft`	`GenericInternalRow` of the size of the output schema of the left physical operator
`nullRight`	`GenericInternalRow` of the size of the output schema of the right physical operator
`storeConf`	StateStoreConf Used exclusively to create a SymmetricHashJoinStateManager

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-sql-streaming-StreamingSymmetricHashJoinExec.adoc

spark-sql-streaming-StreamingSymmetricHashJoinExec.adoc

StreamingSymmetricHashJoinExec Binary Physical Operator — Stream-Stream Joins

Creating StreamingSymmetricHashJoinExec Instance

Performance Metrics

`shouldRunAnotherBatch` Method

Executing Physical Operator (Generating Recipe for Distributed Computation as RDD[InternalRow]) — `doExecute` Method

Processing Partitions — `processPartitions` Internal Method

Internal Properties

Files

spark-sql-streaming-StreamingSymmetricHashJoinExec.adoc

Latest commit

History

spark-sql-streaming-StreamingSymmetricHashJoinExec.adoc

File metadata and controls

StreamingSymmetricHashJoinExec Binary Physical Operator — Stream-Stream Joins

Creating StreamingSymmetricHashJoinExec Instance

Performance Metrics

shouldRunAnotherBatch Method

Executing Physical Operator (Generating Recipe for Distributed Computation as RDD[InternalRow]) — doExecute Method

Processing Partitions — processPartitions Internal Method

Internal Properties

`shouldRunAnotherBatch` Method

Executing Physical Operator (Generating Recipe for Distributed Computation as RDD[InternalRow]) — `doExecute` Method

Processing Partitions — `processPartitions` Internal Method