Streaming SQL in Stateful DataFlows

SQL: The Universal Language of Data

Remember when SQL was the only way to talk to your data? It wasn’t just a query language - it was the query language. But its story goes deeper than syntax.

From medieval ports to modern databases

Just as merchants in medieval Mediterranean ports needed a shared language to trade (that’s where “[lingua franca]” came from), the tech world needed SQL to make data accessible across different systems and teams.

If you’re in a room with a DBA, a data analyst, and a business analyst. What’s the one language they all speak? Likely SQL.

SELECT product_name, COUNT(*) as orders
FROM sales
WHERE region = 'EMEA'
GROUP BY product_name

Look familiar? Whether you’re running Oracle, Postgres, or MySQL, this just works. Well sort of!

Why SQL prevails

Three key factors made SQL a long-term utility that stood the test of time.

It’s Human-Friendly Instead of telling machines HOW to get data, you just say WHAT you want. SELECT * FROM users WHERE status = 'active' reads almost like English.
It’s Everywhere From startups to Fortune 500s, SQL skills travel. Write once, run anywhere - from healthcare to fintech.
It Just Works Need to analyze sales data? Track user behavior? SQL’s got you covered, backed by decades of tooling and optimization.

Take it with a pinch of salt and pragmatism

If we’re honest, there were some tradeoffs and some issues with SQL:

Microsoft’s T-SQL speaks differently than Oracle’s PL/SQL
Complex nested data structures can get messy
SQL primarily suited transactions

Yet we stuck with SQL because it solved the fundamental problem: helping humans and machines speak the same language about data.

SQL is still useful

In today’s world of microservices, real-time streams, and distributed systems, SQL still has utility.

Architects don’t just choose a query language - they choose an ecosystem for the entire organization to build with. Even as we build modern streaming first continuous data stacks, SQL offers ease of use and convenience.

Think about it: When was the last time you built a data-driven system that didn’t involve SQL in some way? Its journey from IBM’s labs to global standard wasn’t just about technology - it was about creating a common ground where technology and business could meet.

SQL is here to stay

Developers in the 1970s: trapped in proprietary systems, each with its own quirky, custom language dreamt of better ways to access data. Retrieving data was a fragmented nightmare—until SQL arrived. Born at IBM as part of the System R project in the early 1970s, SQL was a breakthrough in making data accessible.

By 1979, a little company called Relational Software (we know them now as Oracle) launched the first commercial SQL database, proving its real-world potential. Then, in 1986, SQL became an ANSI standard, cementing its place as a portable, interoperable language across systems.

Suddenly, data wasn’t locked away anymore—SQL democratized it, sparking innovation and powering the rise of relational databases. It’s a classic tale of standardization unlocking progress, and it’s why SQL’s story still resonates today.

I wrote my first SQL query in 1998 as an 11-year-old kid. Just in time to learn about one of the major limitations of data in the form of date formats and timestamps, also known as the ‘y2k problem’ or the ‘millennium bug.’

Even in 2025, SQL is not a perfect utility and has its drawbacks.

Limitations and Painful Problems We Overlook

Despite its brilliance, SQL isn’t perfect. Here are some of its limitations and why we’ve tolerated them for decades:

Scalability Struggles
Traditional SQL databases weren’t built for today’s massive datasets or high-concurrency workloads. As data exploded, NoSQL systems like MongoDB stepped in to handle scale—but SQL’s still king for structured data, so we adapt with workarounds like sharding or cloud-native solutions.
Not Built for Real-Time
SQL excels at batch processing, not real-time data streams (e.g., IoT or live analytics). Extensions like streaming SQL exist, but they’re bolt-ons to a system designed for a different era. Yet, we forgive this because batch processing gets the job done for a lot of use cases.
Complex Queries Get Messy
Ever tried writing a nested subquery with five levels? SQL can turn into a readability nightmare for advanced analytics. Procedural extensions (e.g., PL/SQL) help, but they dilute its simplicity. We put up with it because the basics remain manageable.
Vendor Lock-In
While SQL is standardized, vendors like Oracle or Microsoft add proprietary twists (e.g., T-SQL’s quirks). Try to write queries on BigQuery, DuckDB, or Snowflake and try to process JSON from Rest APIs and you will surely feel some irritation. Switching databases is not just lift and shift. We shrug it off because the core standard still offers decent portability.
Performance Trade-Offs
SQL’s abstraction can lead to inefficient query plans—great for ease, less so for speed. Optimization requires expertise, but we accept it because the productivity gains outweigh the tuning headaches and there are profilers and query optimization patterns and now large language models like text2sql.

So why do we overlook these issues? SQL’s benefits outweigh the pain. Its ease of use lowers the barrier to entry, letting millions work with data without mastering low-level programming. The mature ecosystem, libraries, and ORMs—keeps it relevant. Plus, SQL evolves: modern adaptations (e.g., cloud databases like Snowflake or streaming extensions) address many shortcomings. It’s not perfect, but it’s the foundation we’ve built on—and it’s not going anywhere.

SQL in event streaming

In a world of Artificial Intelligence, Web3, global markets, event streaming is no longer a luxury, its a basic need. Whether you’re optimizing operations, detecting anomalies, or personalizing customer experiences, the ability to process and analyze data as it flows is profitable.

Now if you’re thinking most people don’t really need real-time, batch processing works just fine, and streaming is too complex and not worth the effort, think again.

Ask yourself:

Is your application combining, enriching, aggregating data from multiple sources in your customer’s context?
Is your application serving user facing analytics or intelligent data intensive functionality?
Are your customers happy to follow set usage and access patterns based on your schedule?
Are they happy with stale insights?

Or do they need fresh data on demand processed asynchronously?

If we’re real, there is more demand for event streaming and event driven architecture than ever before. From time-sensitive analytics to agentic AI, there are more and more scenarios where streaming is the ideal choice.

SQL Streaming in Stateful DataFlows

We’re stoked to finally deliver the much awaited SQL stream processing functionality in Stateful DataFlow. This powerful addition brings the familiarity of SQL to the dynamic realm of real-time data, empowering developers to query and process streams with simplicity and efficiency.

SQL Stream Processing

SQL stream processing lets you apply SQL queries to continuously flowing data, enabling real-time analytics and decision-making. It bridges the gap between traditional SQL—designed for static datasets—and the fluid, dynamic nature of streaming data. With SQL stream processing, you can filter, project, and aggregate data as it arrives, delivering immediate insights that drive action.

Think of it as SQL optimized for columnar operations, and turbocharged for real-time scenarios. It’s intuitive, accessible, and leverages the SQL skills your team already has, making it easier than ever to harness the power of streaming data.

Query and Process Event Streams

Stateful DataFlow SQL interface is designed to be dual-purpose, offering flexibility for both querying and processing streams:

Querying Stateful Data: Retrieve and analyze aggregated or transformed results from streams. For example:
```
SELECT * FROM count_per_word ORDER BY count DESC LIMIT 3
```
This query might fetch the top three most frequent words from a stream of text data, returned as a dataframe for further analysis.
Stream Processing Operations: Embed SQL directly into your dataflow pipelines to transform data as it flows. Use operations like filtering (WHERE), projecting (SELECT), and aggregating (GROUP BY) to process streams in real time.

Real-World Example: NY Transit Data Processing

To demonstrate this feature in action, check out the ny-transit dataflow. This dataflow uses the familiar New York Taxi dataset. We have a fluvio utility to take the static parquet files and read it using the Fluvio client as events over time to emulate the streaming data scenario.

With SQL stream processing, you can:

Query streaming data that is aggregated in streaming topics. Join static data, or slowly changing data, with live events.
Process the data using SQL to filter for specific compute aggregates like the number of trips per route, or average tips.

Here’s an example SQL query that you can run within the pipeline:

SELECT
  `pu-location-id` AS pu_zone,
  l.zone as pu_zone_name,
  avg(tips) as average_tip,
  v.name as vendor
FROM trips_tbl t
JOIN location_tbl l
  ON t.`pu-location-id` = l.location
JOIN vendor_tbl v
  ON t.`hvfhs-license-num` = v._key
WHERE tips > 0.0
GROUP BY l.zone, `pu-location-id`, v.name
ORDER BY average_tip DESC

This query processes the stream continuously, updating results as new data arrives. It’s a practical illustration of how SQL is useful to query and transform real-time data, providing actionable insights on the fly.

Benefits of Fluvio SQL Stream Processing

This feature delivers several key advantages for data-intensive applications:

Familiarity: Leverage your team’s existing SQL knowledge to handle streams, reducing the learning curve for real-time analytics. No need to learn a new language or framework—just use the SQL you already know.
Real-Time Insights: Make decisions based on the latest data, not yesterday’s reports. This is crucial for time-sensitive applications like fraud detection, live monitoring, or dynamic pricing.
Flexibility: Use SQL for both querying stateful data and defining transformations within streaming pipelines, adapting to a wide range of use cases.
Seamless Integration: SQL operations fit naturally into Fluvio’s Statful DataFlow stream processing paradigm, ensuring smooth and efficient workflows.

How It Compares to Other Systems

While Fluvio’s SQL shares similarities with other stream processing systems, there are some distinctions to note:

Syntax: Fluvio’s SQL is columnar and optimized for processing optimized in memory dataframes like Polars, Apache Arrow, adapted for streaming within dataflow pipelines.
Scope: It supports core SQL operations like filtering, grouping, joining, and aggregating. For advanced streaming features like event-time watermarks or complex windowing, Stateful DataFlow uses the capability of operators, primitives, and functions with the ability to embed SQL in functions.

Fluvio’s SQL provides an accessible and powerful way to define stream processing logic using standard SQL syntax, making it a valuable tool for many real-time applications.

Conclusion

With the introduction of SQL stream processing, Fluvio is making real-time data more accessible and actionable than ever. Whether you’re querying stateful data or defining complex stream operations, SQL provides a familiar yet powerful tool to unlock the full potential of your data streams.

Ready to get started? Explore the ny-transit dataflow and see firsthand how SQL queries and operators work with real-time stateful dataflows. With Fluvio, the future of streaming analytics is at your fingertips—familiar, flexible, and ready for action.

Stay in Touch:

Thanks for checking out this article. If you’d like to see how InfinyOn Cloud could level up your data operations - Just Ask.

Share your thoughts on our Github Discussions
Join the conversation on Fluvio Discord Server
Subscribe to our YouTube channel for project updates.
Follow us on Twitter for the latest news.
Connect with us on LinkedIn for professional networking.

Company

Resources