Use Cases
Docs
Blog Articles
BlogResources
Pricing
PricingMasking PII with Dataflow Composition
Contributor, InfinyOn
Masking PII with Dataflow Composition
Introduction
Data privacy is a growing concern among individuals. More and more digital citizens are concerned about how businesses use their data. Naturally most businesses are careful to protect their data and to ensure that it is only used for the purpose for which it was collected.
In this blog, we will demonstrate how to use Stateful Dataflows and composition to mask personally identifiable information (PII) in a data stream.
Data governance use case
Businesses collect personally identifiable information (PII) like name, email, address, and social security numbers from customers.
Businesses protect the data by tagging private information, encryption, data retention, and data deletion. Masking PII is a basic aspect of a reliable data governance strategy.
The goal is to prevent exposing users’ private information externally. PII masking is a critical and ubiquitous need due to the billions of consumers using digital services daily.
Stateful data flow to mask PII
Typically PII is a part of user interactions and events. For example, on internet websites and mobile apps, users also use PII to identify themselves. When they interact with government services, tax, insurance, healthcare etc., they transmit personally identifiable information(PII) as part of on-line applications.
Typically the raw data is stored in restricted servers or databases. Workflows with the required access processes the raw data in batches and create standardized datasets. Processing the events in a streaming data flow is a neat way to make the workflow easier to manage.
The data processing to mask PII commonly involves a combination of regex, string manipulation, and encryption. In a streaming data flow, we can use Stateful Dataflows Composition to implement this functionality. This way the raw stream of data with PII can be processed and masked in a single pass.
Here the data flow in action. The raw stream of data with PII is received from user-info
topic. The data is processed in the mask-ssn-service
service and the output is sent to masked
topic.
1. Run the Dataflow
You’ll need a dataflow.yaml
file in the local directory to run the dataflow.
Create the dataflow.yaml file and paste the following content:
apiVersion: 0.4.0
meta:
name: mask-user-info
version: 0.1.0
namespace: example
imports:
- pkg: example/[email protected]
path: ../_packages/mask-ssn-pkg
functions:
- name: mask-ssn
config:
converter: raw
topics:
user-info:
schema:
value:
type: string
masked:
schema:
value:
type: string
services:
mask-ssn-service:
sources:
- type: topic
id: user-info
transforms:
- operator: map
uses: mask-ssn
sinks:
- type: topic
id: masked
# Development only, it does not get published to hub
dev:
imports:
- pkg: example/[email protected]
path: ../_packages/mask-ssn-pkg
In the dataflow.yaml
we simply describe the dataflow by listing the topics, services, and the functions.
The mask-ssn-service
is a stateful service that uses the mask-ssn
function from the mask-ssn-pkg
package that has been built and tested independently.
The mask-ssn
function uses regex to mask the SSN. We can run the service with a single command that builds and runs the dataflow:
sdf run
2. Test data flow
We test the data flow with a few records and see how the data is processed.
Produce a few records to the user-info
topic:
fluvio produce user-info
> {"name":"alice","ssn":"123-45-6789"}
Ok!
> {"name":"bob","ssn":"987-65-4321"}
Ok!
Consume from masked
topic:
fluvio consume masked -Bd
{"name":"alice","ssn":"***-**-****"}
{"name":"bob","ssn":"***-**-****"}
3. Check service statistics
We use show state
command in service runtime to display statistics:
>> show state mask-ssn-service/mask-ssn/metrics --table
Key Window succeeded failed
stats * 2 0
>> show state mask-ssn-service/mask-ssn/metrics --table
Key Window succeeded failed
stats * 2 0
Conclusion
Just like that in a few simple steps we were able to mask PII in a dataflow. Dataflow composition is a powerful tool that can be used to build complex dataflows.
The workflow to build the function package is simple and straightforward. Developers can publish end to end data flows, functions, schemas, and types on InfinyOn Hub and reuse in other similar data flows.
If you’d like to explore how you can implement Stateful Dataflows Setup a 1:1 call with me.
Connect with us:
You can contact us through Github Discussions or our Discord if you have any questions or comments, we welcome your insights on stateful dataflows