The processing-pipeline is our streaming service between ingestion and analyzing database. It is reading from Cloud PubSub and writing to Cloud SQL.
* Currently we are not using Cloud Dataflow. Instead we are using the "Apache Beam Direct Runner" and running the application on our kubernetes cluster.
The pipeline has 3 major steps to read, convert and write data from
PubSub into Cloud SQL. All of those steps are defined in ProcessingApplication.java
-
PubSubIO: The pipeline is continuously listening to the PubSub topic
ingestion-prod
. Every new data-set gets passed to the next step of the pipeline. -
ConstructDatabaseOutputOperations: At this step we are preparing all necessary operations that has to be executed on our SQL database to apply the new object-oriented object from PubSub. Therefore we are using two data-transfer-objects:
InputOperationDto
andOutputOperationDto
. We are parsing the input from PubSub into theInputOperationDto
and converting it then to multiple encapsulatedOutputOperationDto
which do contain all SQL write/delete operations. -
ApplyOutputOperationsToDatabase: Finally we are applying all prepared
OutputOperationDto
to the database.
I created a detailed communication flow diagram with the most important classes and their methods and properties: