That time when almost 100% accurate proof-of-concept data ended up giving POC results that were 1,900% off

Disclaimer
Intro
Key Architectural-Product Compromise
Going Back to Basics
Baseline Performance
The 90% POC Approach
Initial Results
The Findings
Reality Check
Takeaways

Disclaimer

Some architectural, operational and data details are intentionally omitted, generalized or slightly altered for security and confidentiality reasons. The numbers, orders of magnitude and lessons learned are accurate.

Intro

We run an on-premises national-scale cybersecurity investigations product.
One of this product’s core capabilities is an open platform that lets customers build their own data ingestion pipelines and push data into the data lake and application environment where analytics, detection, and other magic happens.
The openness platform is based on NiFi: customers build pipelines reads whatever sources they wish and push data to Kafka, from which our Spark ETL picks it up, enriches it, and writes Delta tables to S3-compatible storage.
For national-scale deployments, the hardware costs alone are in the millions.

The problem: due to business needs and decisions, we also had to offer this product to organization-level customers with much smaller budgets.

Key Architectural-Product Compromise

We decided to trade some openness for better performance and optimization.
NiFi is convenient for building pipelines, but overall performance depends on the processors’ implementation (e.g., how many flow files are generated and moved), the external source throughput, and so on - in general, NiFi is not primarily a performance-focused tool.
So we decided to replace the open data-pipeline model with in-house, highly optimized ingestion connectors for specific sources.
From there, the data lake and application environment would scale linearly with the data throughput of each deployment.

Going Back to Basics

The original architecture followed best practice: robust, resilient and extensible.
It was also expensive in terms of I/O, CPU and memory.
Data moved multiple times inside NiFi (depending on processor implementation), then from NiFi to Kafka, from Kafka to ETL (Spark), and from ETL to S3.
We needed to simplify things, even if the architecture diagram would’ve looked less elegant.
We went for an early-2000s back-to-basics approach: one optimized Spark job that reads directly from sources, performs enrichment, and writes to storage. No Kafka, no customer-customizable NiFi flows - just a single, optimized ETL.

Baseline Performance

In the most optimized NiFi pipelines, including Kafka and the ETL, the implementation required roughly 10 cores to sustain 5 Gbps end-to-end:

5 in NiFi
2 in Kafka
3 in ETL

Scaling was linear in NiFi and roughly linear (in quantized steps) in Kafka and ETL.

The 90% POC Approach

The production flow had multiple enrichment sources.
Given time and resource constraints, and since this was a POC, we didn’t want to implement every integration just to prove the idea. We wrote almost production-ready code:

100% of the data model matched production
90% of the data enrichments that stemmed from internal sources were implemented
10% of the data enrichments that stemmed from external integrations were mocked with the same data size

We planned to run this POC and extrapolate the results to estimate the performance of a full-scale deployment

Initial Results

Let’s ignore memory utilization since memory is relatively negligible in our workloads and is cheaper.

The POC numbers looked great:

15 Gbps sustained on only 4 cores for the new Spark job, from source to storage
3x the throughput with 40% of the cores
Files were about 20x smaller after compaction

All the data validation tests passed. Fields were present, schemas matched, row counts were correct, even bytes per row were aligned.
It seemed like an amazing optimization, but we couldn’t explain the 20x reduction in storage.

The Findings

It turned out the 10% (about 10 fields) of mock data made a big difference.
Even though data types and field sizes matched, the mock fields were populated from dictionaries with low cardinality.
That improved compression dramatically, producing much smaller files.
This impacted both storage size and write time to storage, which is a significant part of the ETL.
While it’s expected that compression affects file size and write time, we didn’t anticipate such a large effect by such a negligible difference from production workload.

Reality Check

We rebuilt the mock dictionaries to better reflect production cardinality patterns (without touching sensitive data).

After regenerating the mocks and rerunning the tests:

Throughput: 8 Gbps on 5 cores
Storage: similar to production ratios

Still a solid improvement - roughly 60% improvement in sustained throughput while using about 50% fewer cores - but not the physics-defying gains we first observed.

Takeaways

A relatively small change in the data can have a dramatic effect on storage and ETL performance.
When running POCs for high-throughput data pipelines, aiming for 90% similarity and extrapolating results can be risky. If resources and time allow, make the POC data fully match production patterns so you’re comparing apples to apples.
A robust, resilient, and extensible system design can be the right technical decision but the wrong business decision.