Streaming Source

A Streaming Source represents a continuous stream of data for a streaming query. It generates batches of DataFrame for given start and end offsets. For fault tolerance, a source must be able to replay data given a start offset.

A streaming source should be able to replay an arbitrary sequence of past data in the stream using a range of offsets. This means that only streaming sources like Kafka and Kinesis (which have the concept of per-record offset) fit into this model. This is the assumption so structured streaming can achieve end-to-end exactly-once guarantees.

Source trait has the following features:

schema of the data (as StructType using schema method)
the maximum offset (of type Offset using getOffset method)
Returns a batch for start and end offsets (of type DataFrame).

It lives in org.apache.spark.sql.execution.streaming package.

import org.apache.spark.sql.execution.streaming.Source

There are two available Source implementations:

FileStreamSource
MemoryStream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-sql-source.adoc

spark-sql-source.adoc

Streaming Source

MemoryStream

Files

spark-sql-source.adoc

Latest commit

History

spark-sql-source.adoc

File metadata and controls

Streaming Source

MemoryStream