Disclaimer

Some architectural, operational and data details are intentionally omitted, generalized or slightly altered for security and confidentiality reasons. The numbers, orders of magnitude and lessons learned are accurate.

Intro

We run an on-premises national-scale cybersecurity investigations product.
One of this product’s core capabilities is an open platform that lets customers build their own data ingestion pipelines and push data into the data lake and application environment where analytics, detection, and other magic happens.
The openness platform is based on NiFi: customers build pipelines reads whatever sources they wish and push data to Kafka, from which our Spark ETL picks it up, enriches it, and writes Delta tables to S3-compatible storage.
For national-scale deployments, the hardware costs alone are in the millions.

The problem: due to business needs and decisions, we also had to offer this product to organization-level customers with much smaller budgets.

Key Architectural-Product Compromise

We decided to trade some openness for better performance and optimization.
NiFi is convenient for building pipelines, but overall performance depends on the processors’ implementation (e.g., how many flow files are generated and moved), the external source throughput, and so on - in general, NiFi is not primarily a performance-focused tool.
So we decided to replace the open data-pipeline model with in-house, highly optimized ingestion connectors for specific sources.
From there, the data lake and application environment would scale linearly with the data throughput of each deployment.

Going Back to Basics

The original architecture followed best practice: robust, resilient and extensible.
It was also expensive in terms of I/O, CPU and memory.
Data moved multiple times inside NiFi (depending on processor implementation), then from NiFi to Kafka, from Kafka to ETL (Spark), and from ETL to S3.
We needed to simplify things, even if the architecture diagram would’ve looked less elegant.
We went for an early-2000s back-to-basics approach: one optimized Spark job that reads directly from sources, performs enrichment, and writes to storage. No Kafka, no customer-customizable NiFi flows - just a single, optimized ETL.

Baseline Performance

In the most optimized NiFi pipelines, including Kafka and the ETL, the implementation required roughly 10 cores to sustain 5 Gbps end-to-end:

  • 5 in NiFi
  • 2 in Kafka
  • 3 in ETL

Scaling was linear in NiFi and roughly linear (in quantized steps) in Kafka and ETL.

The 90% POC Approach

The production flow had multiple enrichment sources.
Given time and resource constraints, and since this was a POC, we didn’t want to implement every integration just to prove the idea. We wrote almost production-ready code:

  • 100% of the data model matched production
  • 90% of the data enrichments that stemmed from internal sources were implemented
  • 10% of the data enrichments that stemmed from external integrations were mocked with the same data size

We planned to run this POC and extrapolate the results to estimate the performance of a full-scale deployment

Initial Results

Let’s ignore memory utilization since memory is relatively negligible in our workloads and is cheaper.

The POC numbers looked great:

  • 15 Gbps sustained on only 4 cores for the new Spark job, from source to storage
  • 3x the throughput with 40% of the cores
  • Files were about 20x smaller after compaction

All the data validation tests passed. Fields were present, schemas matched, row counts were correct, even bytes per row were aligned.
It seemed like an amazing optimization, but we couldn’t explain the 20x reduction in storage.

The Findings

It turned out the 10% (about 10 fields) of mock data made a big difference.
Even though data types and field sizes matched, the mock fields were populated from dictionaries with low cardinality.
That improved compression dramatically, producing much smaller files.
This impacted both storage size and write time to storage, which is a significant part of the ETL.
While it’s expected that compression affects file size and write time, we didn’t anticipate such a large effect by such a negligible difference from production workload.

A Concrete Example: community_id

As mentioned earlier, we store Delta tables with Parquet as the underlying file format - so Parquet-specific encodings directly impact our storage footprint.

While most of the fields that caused the issue are confidential and can’t be shared, one field stands out as a strong example: community_id.

In the mock data, community_id was populated from a dictionary of about 50 unique values.
In production, this field has billions of unique values - each representing a distinct community identifier in the system.

This single field alone accounted for almost 35% of the 20x reduction in file size after compaction.

Why Low Cardinality Makes Such a Big Difference in Parquet

It’s not just about the compression algorithm (zstd in our case) - it’s about how Parquet stores data.

Parquet uses columnar storage, meaning all values of a single column are stored together. This layout enables a bunch of write-time optimizations and encoding techniques that kick in before general-purpose compression. The two most impactful in our use case were:

  1. Dictionary Encoding: Parquet replaces repeated values with integer indices pointing to a dictionary of unique values. If community_id has only 50 unique values, each value in the column is stored as a small integer (fitting in a few bits) instead of the full string. With billions of unique values, the dictionary itself becomes massive, and the indices require more bits - often causing Parquet to fall back to plain encoding entirely, losing the benefit.

  2. Run-Length Encoding (RLE): When consecutive values in a column are identical, RLE stores them as a single value plus a count. For example, if 1,000 consecutive rows have community_id = "ABC", RLE stores this as ("ABC", 1000) instead of repeating “ABC” 1,000 times. Low-cardinality fields naturally produce longer runs of identical values, especially after sorting, making RLE extremely effective.

These encodings compound: dictionary encoding reduces each value to a small integer, and RLE then compresses runs of those integers. Only after these encodings does Zstandard (or another compression codec) apply its general-purpose compression on an already highly optimized data.

With 50 unique mock values in the data we generated for the POC, dictionary encoding was tiny and RLE found long runs everywhere, producing unrealistic improvements in write performance and storage. But with billions of production values, both encodings were far less effective - creating the significant gap between POC file sizes and what we’d see in production.

Reality Check

We rebuilt the mock dictionaries to better reflect production cardinality patterns (without touching sensitive data).

After regenerating the mocks and rerunning the tests:

  • Throughput: 8 Gbps on 5 cores
  • Storage: similar to production ratios

Still a solid improvement - roughly 60% improvement in sustained throughput while using about 50% fewer cores - but not the physics-defying gains we first observed.

Takeaways

  1. A relatively small change in the data can have a dramatic effect on storage and ETL performance.

  2. When running POCs for high-throughput data pipelines, aiming for 90% similarity and extrapolating results can be risky. If resources and time allow, make the POC data fully match production patterns so you’re comparing apples to apples.

  3. A robust, resilient, and extensible system design can be the right technical decision but the wrong business decision.