Stateful Data Flow Beta Build composable event-driven data pipelines in minutes.

Get Started Now

Streaming analytics on NYC Taxi Trips

Deb RoyChowdhury

Deb RoyChowdhury

VP Product, InfinyOn Inc.

SHARE ON
GitHub stars
Here is a video of me running the dataflow step by step.

Index

Introduction

If I am real, the most difficult aspects of working in data platforms for me is the sheer volume of noise that is out there in the ecosystem. Hypes, fads, misuse of terms, misrepresentation of capability. It’s all out there.

In the midst of all that, the hardest questions to precisely answer for a data platform are - What does the product do? Who is it for? Why does it matter?

These questions may seem trivial. And at times frustrating to go round in circles while trying to clearly answer them. But it is absolutely critical to answer these questions with a high degree of precision and integrity.

Answering the questions is not enough either. Talk is cheap. It’s essential to show and prove the claims.

My intent in this blog is to show and tell. First I will share an end end streaming analytics data flow. And then I will provide the answer to the 3 questions.

We will go from a streaming variant of the TLC Trip Record Data from the NYC Taxi & Limousine Commission to processed streams which can be queried using SQL, and visualized in a real-time dashboard.

NY Transit dataflow

The New York transit dataflow demonstrates how to create tables, use the SQL query interface to analyze the data, and launch a visualization tool to evaluate the results."

The source of the data and the schemas can be found at the following links:

In the initial flow of the project, I am using:

  • Fluvio for event streaming
  • Stateful DataFlow for stateful stream processing
  • Rust for writing the data processing logic
  • SQL within Stateful DataFlow to query streaming tables
  • Python for for real-time visualization

The diagram represents the flow of the project.

NYC Taxi DataFlow

The Dataflow

The dataflow.yaml defines three services that handle the save-vendors, save-locations, and build-trips-table data. The vendors and locations topics store static data, while the trips topic stores event data.

The build-trips-table service performs two key functions:

  • It creates a trips_tbl, which is refreshed every 30 seconds.
  • It computes the average tips per zone and writes the result to pu-tips topic (pu = Pick-up).

The pu-tips topic is used by the visualization tool to display the average tips per zone. The darker the color, the higher the average tip.

NYC Taxi DataFlow

Step-by-step

Make sure to Install SDF and start a Fluvio cluster.

1. Run the Dataflow

Use sdf command line tool to run the dataflow:

sdf run --ui
  • Use --ui to generate the graphical representation and run the Studio.
  • Using sdf run as opposed to sdf deploy will run the dataflow with an ephemeral worker, which will be cleaned up on exit.

Note: It is important to run the dataflow first, as it creates the topics for you.

2. Read the locations & the zones:

Read locations from the data generator and add them to the locations topic:

curl -s https://demo-data.infinyon.com/api/ny-transit/locations | fluvio produce locations

Read vendors from the data generator and add them to the vendors topic:

curl -s https://demo-data.infinyon.com/api/ny-transit/vendors | fluvio produce vendors

3. Start the http connector:

In a new terminal change directory to ./connectors, download the connector binary, and start connector:

cd ./connectors
cdk hub download infinyon/[email protected]
cdk deploy start --ipkg infinyon-http-source-0.4.3.ipkg -c trips-connector.yaml

The NY transit connector receives 10 events per second. Use fluvio to see the events streaming in real-time:

fluvio consume trips

Use to exit

4. Use SQL interface to explore the trips_tbl table

The dataflow creates the trips_tbl table that gets refreshed every 30 seconds.

SDF offers a sql interface to explore the table. Just type sql in the terminal to get started:

>> sql

To show tables, type show tables; and hit enter:

SHOW tables

Let’s compute total fares, tips, tolls, and driver_pay for all rides:

SELECT 
  count(*) AS rides, 
  sum(`base-passenger-fare`), 
  sum(tips), 
  sum(tolls), 
  sum(`driver-pay`)
FROM 
  trips_tbl

Or identify the number of riders who paid congestion surcharge:

SELECT 
  count(*) AS rides, 
  sum(`base-passenger-fare`), 
  sum(`congestion-surcharge`)
FROM trips_tbl t 
WHERE t.`congestion-surcharge` > 0

If we are a driver, we want to know whe pick-up zone that has the highest average tip:

SELECT 
  `pu-location-id` AS pu_zone, 
  avg(tips) as average_tip 
FROM trips_tbl 
WHERE tips > 0.0
GROUP BY `pu-location-id`
ORDER BY average_tip DESC

Unfortunately, the pick-up zone is not human-readable, but we can join the trips_tbl table with the location_tbl table to get the zone name:

SELECT
  `pu-location-id` AS pu_zone,
  l.zone as pu_zone_name,
  avg(tips) as average_tip 
FROM trips_tbl t
JOIN location_tbl l
ON t.`pu-location-id` = l.location
WHERE tips > 0.0
GROUP BY l.zone, `pu-location-id`
ORDER BY average_tip DESC

We are also interested in determining which vendor receives the highest average tip—Uber or Lyft. To achieve this, we need to perform a 3-way join between the trips_tbl, location_tbl, and vendor_tbl tables:

SELECT
  `pu-location-id` AS pu_zone,
  l.zone as pu_zone_name,
  avg(tips) as average_tip,
  v.name as vendor
FROM trips_tbl t
JOIN location_tbl l
  ON t.`pu-location-id` = l.location
JOIN vendor_tbl v
  ON t.`hvfhs-license-num` = v._key
WHERE tips > 0.0
GROUP BY l.zone, `pu-location-id`, v.name
ORDER BY average_tip DESC

Feel free to explore the data further.

To exit the sql interface, type .exit.

5. Visualize the data

While tables are useful, they can be challenging to interpret. That’s why we developed a visualization tool to make understanding the data easier and more intuitive.

Create the virtual environment
python3 -m venv venv
Activate the virtual environment
source venv/bin/activate
Install the required packages
pip install geopandas folium matplotlib fluvio
Run the script
python3 plot.py

Open browser on “index.html” to see the map.

On mac:

open index.html

The map will continously update with the latest data from the Fluvio topic every 30 seconds.

Clean-up

Exit sdf terminal and clean-up. The --force flag removes the topics:

sdf clean --force

Stop the connector:

cdk deploy shutdown --name trips-connector

Why InfinyOn

Why: InfinyOn exists to empower engineers to build efficient, secure, reliable, streaming analytics applications fast.

The current state of data processing infrastructure is fragemented, complicated, inefficient, and expensive. Somehow with the epochs of innovation in technology and better levels of abstraction, we have lost touch of the basic first principles of computer science like I/O, Compute, Network.

The existence of InfinyOn is from the ashes of frustrations and battle scars of building large scale distributed data intensive applications in high frequency trading, network monitoring, cyber security, surveillance tech, autonomous drones and driving, ecommerce insights, and more.

We believe that Rust is a fundamentally better programming language to build memory safe systems without the baggage of garbage collection.

We believe that WebAssembly is a significant upgrade on the JVM for secure, sandboxed, edge compute.

We believe in mechanical sympathy and well architected systems as foundations of data intensive applications.

We also believe in functional interfaces, intuitive development workflows, and human usbaility.

That is to say that we don’t expect data engineers and anlysts to drop everything and learn Rust instead of familiar interfaces like SQL, Python etc. which they are used to. We don’t expect them to spend too much effort on infrastructure management, dependency management, CI/CD, version control which takes their focus away from data modelling, business logic development, data enrichment, schema managemnt, and creating insights from the data.

What is InfinyOn? Who is it for?

What: InfinyOn is an end to end streaming analytics platform for software engineers who deeply understand data and data engineers who deeply understand software.

Who: InfinyOn is for builders of data intensive application who are willing to embrace rigorous and disciplined pogramming paradigms in Rust, while working with familiar patterns of SQL, Python, and JavaScript.

Stay in Touch:

If you want a refined efficient data processing system to power end to end streaming analytics, let us know by filling out this form.