Dataflow

Summary

Dataflow is a Kubernetes-native platform for executing large parallel data-processing pipelines.

Each pipeline is specified as a Kubernetes custom resource which consists of one or more steps which source and sink messages from data sources such Kafka, NATS Streaming, or HTTP services.

Each step runs zero or more pods, and can scale horizontally using HPA or based on queue length using built-in scaling rules. Steps can be scaled-to-zero, in which case they periodically briefly scale-to-one to measure queue length so they can scale a back up.

Learn more about features.

Use Cases

Real-time "click" analytics
Anomaly detection
Fraud detection
Operational (including IoT) analytics

Screenshot

Example

pip install git+https://github.com/argoproj-labs/argo-dataflow#subdirectory=dsls/python

from argo_dataflow import cron, pipeline

if __name__ == '__main__':
    (pipeline('hello')
     .namespace('argo-dataflow-system')
     .step(
        (cron('*/3 * * * * *')
         .cat()
         .log())
    )
     .run())

Documentation

Read in order:

Beginner:

Quick start
Concepts
Sources
Processors
Sinks
Examples

Intermediate:

Handlers
Git usage
Expression syntax
Garbage collection
Scaling
Command line
Kubectl
Events interop
Workflow interop
Meta-data

Advanced

Configuration
Features
Limitations
Reliability
Metrics
Image contract
Jaeger tracing
Reading material
Security
Dataflow vs X
Contributing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dataflow

Summary

Use Cases

Screenshot

Example

Documentation

Architecture Diagram

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dataflow

Summary

Use Cases

Screenshot

Example

Documentation

Architecture Diagram