Skip to content

Latest commit

 

History

History
46 lines (31 loc) · 2.1 KB

spark-sql-streaming-ContinuousDataSourceRDD.adoc

File metadata and controls

46 lines (31 loc) · 2.1 KB

ContinuousDataSourceRDD — Input RDD of DataSourceV2ScanExec Physical Operator with ContinuousReader

ContinuousDataSourceRDD is a specialized RDD (RDD[InternalRow]) that is used exclusively for the only input RDD (with the input rows) of DataSourceV2ScanExec leaf physical operator with a ContinuousReader.

ContinuousDataSourceRDD is created exclusively when DataSourceV2ScanExec leaf physical operator is requested for the input RDDs (which there is only one actually).

ContinuousDataSourceRDD uses spark.sql.streaming.continuous.executorQueueSize configuration property for the size of the data queue.

ContinuousDataSourceRDD uses spark.sql.streaming.continuous.executorPollIntervalMs configuration property for the epochPollIntervalMs.

ContinuousDataSourceRDD takes the following to be created:

  • SparkContext

  • Size of the data queue

  • epochPollIntervalMs

  • InputPartition[InternalRow]s

ContinuousDataSourceRDD uses InputPartition (of a ContinuousDataSourceRDDPartition) for preferred host locations (where the input partition reader can run faster).

Computing Partition — compute Method

compute(
  split: Partition,
  context: TaskContext): Iterator[InternalRow]
Note
compute is part of the RDD Contract to compute a given partition.

compute…​FIXME

getPartitions Method

getPartitions: Array[Partition]
Note
getPartitions is part of the RDD Contract to specify the partitions to compute.

getPartitions…​FIXME