Stateful Data Flow Beta Build composable event-driven data pipelines in minutes.

Get Started Now

Stateful Dataflow Concepts

Fluvio is an implementation of the Event-Driven Architecture (EDA), and Stateful Dataflows are an extension of the same paradigm. The following concepts are essential for understanding the architecture:

 

Events

An event registers an activity that occurred in the past - an immutable fact. Events can be represented in formats like JSON, Avro, Protobuf, etc. Events can be produced by services such as databases, microservices, sensors, IoT devices, or other physical devices. Events can also be produced by other event streams. Stateful Dataflows use events as triggers that start the chain of operations and generate one or more results.

Events have the following properties:

  • Key - the unique identity of an event.
  • Value - the actual data of the event.
  • Time - timestamp when the event was produced.
  • Schema - defines the structure of the event.

Time is a core system primitive utilized by the window processing operators to order and group events. Schema is also a core primitive that ensures the events comply with specific data formats to provide accurate data extraction for subsequent operations. Schema is designed to support multiple formats: JSON, Avro, Protobuf, etc.

 

Event Streams & Partitions

An event stream is an asynchronous unbounded collection of events. Unbounded means never-ending; for example, a temperature sensor generates events continuously. Asynchronous means events arrive at any time rather than fetched periodically from a data store. Event streams are partitioned to process high volumes of data in parallel.

Stateful Dataflows can apply a shared business logic in parallel across multiple partitions.

 

State

State is an “opinion derived from a fact.” In technical terms, states are aggregate objects computed on streams of events. For example, counting cars passing an intersection, calculating profits on financial transactions, detecting fraud, etc. States are persistent and can survive restarts, ensuring the results remain accurate and deterministic. States enable users to make decision on accumulated data.

At present, Stateful Dataflows support the following state types:

  • Key-value
  • Windowing

The key-value state performs aggregate operations on unbounded streams, where a specific key captures the value of a computation. Fluvio offsets management uses key-value to store the last value for a client key. The windowing state performs them on bounded streams as defined by the window configuration. Check out Windowing for additional information.

 

Operators

Operators are system operations implemented by Fluvio and are available to use in your applications. The operations range from basic operations, such as filter, map, and flat-map, to complex windowing operations, such as group-by, aggregates, etc.

Check out [Operators Section] for additional information.

 

Functions

Functions are custom defined business logic that applies domain knowledge to operators. Your functions can be programmed in any language that compiles to WebAssembly - Rust, Python, JavaScript, Go, C++, C#, etc. Though currently the product support Rust, with Python under development.

Functions can be chained in Services to perform complex operations.

 

Window Processing

Window processing, divides the data streams into bounded sets of records that are then processed in the window context.

Windowing builds table aggregates for many use cases - counting, trend analysis, anomaly detection, data collection for dashboards and tables, materialized views, etc.

For additional information, check out the window operators section.

 

Service Chaining (DAG)

Stateful Dataflows use DAG (Directed Acyclic Graph) to chain multiple services. The DAG, in essence, represents the logical view of the data flow between operators.

The DAG definition is expressed in YAML format. The Stateful Dataflows CLI (sdf) converts this YAML file into a series of Compoment Model APIs.

We are also considering a programmatic interface to express the DAG; don’t hesitate to contact us if you are interested.

 

WebAssembly Component Model

Stateful Dataflows use WebAssembly Component Model to combine multiple WebAssembly libraries. The component model requires WIT, a language-agnostic IDL, to represent the data model and the functional interfaces.

In the next section, we will describe the components of the service file and the WASM code generated by SDF.

 

Getting Started with Stateful Dataflows

Stateful Dataflows are an extension to Fluvio, a data streaming runtime that provides a robust communication layer for event streaming services.

In the context of Fluvio, the Stateful Dataflows leverage Fluvio topics as their data source and the target output. Fluvio topics serve as the interface with connectors or other services, making it easier to exchange data and facilitate communication between various cluster components.

Provisioning and operating a Stateful Dataflow requires the following system components:

  1. Fluvio Cluster to connect the dataflows with data streaming.

  2. Dataflow File to define the schema, composition, services, and operations.

  3. SDF (Stateful Dataflows) CLI to build, test, and deploy the dataflows.

The Stateful Dataflows can be built, tested, and run locally during preview releases. As we approach general availability, they can also be deployed in your InfinyOn Cloud cluster. In addition, the dataflows may be published to Hub and shared with others with one click and installation.

 

Dataflow File

The dataflow file defines how the custom-defined services should interact with Fluvio topics, what data transformations should be applied, and any other relevant configuration settings. The dataflow file serves as a blueprint for your service composition.

As a general pattern, the dataflow is a composition of services, where each service reads from one or more data sources (topics), computes one or more functions, and passes the result to a state or writes to one or more sinks (topics).

For additional information, check out the Dataflow File section.

 

Fluvio Cluster

The Fluvio Cluster is responsible for managing and packaging all the necessary resources required to run dataflows. This includes creating and managing topics that serve as the source and sink of data for your dataflow to interact with. The cluser also ensures the availability, scalability, and reliability of your data streams.

Create an account on InfinyOn Cloud, and provision a Fluvio cluster:

  1. Download FVM which also installs fluvio CLI:
$ curl -fsS https://hub.infinyon.cloud/install/install.sh | bash

This command will download the Fluvio Version Manager (fvm), Fluvio CLI (fluvio) and config files into $HOME/.fluvio, with the executables in $HOME/.fluvio/bin. To complete the installation, you will need to add the executables to your shell $PATH.

  1. Sign-up for a free InfinyOn Cloud account

  2. Login from the CLI, and provision a Fluvio cluster:

$ fluvio cloud login
$ fluvio cloud cluster create

Check out Getting Started section for additional informmation.

 

SDF (Stateful Dataflow) CLI

SDF is a binary designed to assist developers in building, testing, and deploying Stateful Dataflows. SDF performs all the validation logic required to ensure each dataflows can efficiently integrated and run in a Fluvio cluster. The SDF binary is part of the fluvio client installation package.

The SDF binary has three objectives:

  1. Build and Test Packages
  2. Run Dataflows
  3. Publish Packages and Dataflows to the Hub

For additional information, check out the SDF section.

 

Next Steps