CDC
Content-Defined Chunking (CDC)#
Overview#
Content-Defined Chunking (CDC) is an advanced deduplication technique that allows ncps to significantly reduce storage usage by identifying and sharing common data across different NAR files.
In a traditional Nix cache, each store path is stored as a single NAR file. Even if two packages share many identical files (e.g., they share common libraries or base layers), they are stored as two separate, complete NAR files. This results in significant storage redundancy.
CDC solves this by splitting NAR files into smaller, variable-sized chunks based on their content rather than fixed offsets.
How It Works#
ncps uses the FastCDC algorithm to process NAR files:
- Preprocessing: Before chunking,
ncpsdecompresses the NAR file if it is compressed (e.g., xz, zstd). CDC always operates on the raw, uncompressed data to maximize cross-NAR deduplication. - Chunking: The uncompressed data is passed through the FastCDC chunker. The chunker identifies "natural" content-defined boundaries in the data stream to split the file into variable-sized chunks.
- Hashing: Each chunk is hashed (using BLAKE3) to create a unique identifier based on its content.
- Deduplication: If a chunk with the same hash already exists in the store (from another NAR file),
ncpssimply references the existing chunk instead of storing it again. - Compression: New (non-duplicate) chunks are compressed with zstd before being written to the storage backend.
- Assembly: When a client requests a store path,
ncpsassembles it on-the-fly from its constituent chunks, decompressing each chunk and recompressing the stream for the client using the encoding the client prefers (zstd, brotli, gzip, or raw).
Benefits#
- Storage Efficiency: Dramatic reduction in storage usage when hosting multiple versions of the same package or packages with shared dependencies.
- Cross-NAR Deduplication: Deduplication works across all packages in the cache, not just within a single package.
- Transfer Efficiency: Chunks are stored in the same backend as NAR files, benefitting from the same scalability and reliability.
Configuration#
CDC is disabled by default. You can enable it by setting cache.cdc.enabled to true.
Basic Configuration#
cache:
cdc:
enabled: true
# Optional: Tune chunk sizes (default values shown)
min: 16384 # 16 KB
avg: 65536 # 64 KB
max: 262144 # 256 KBParameters#
| Flag | Description | Environment Variable | Default |
|---|---|---|---|
--cache-cdc-enabled |
Enable CDC for deduplication | CACHE_CDC_ENABLED |
false |
--cache-cdc-min |
Minimum chunk size in bytes | CACHE_CDC_MIN |
16384 |
--cache-cdc-avg |
Average (target) chunk size in bytes | CACHE_CDC_AVG |
65536 |
--cache-cdc-max |
Maximum chunk size in bytes | CACHE_CDC_MAX |
262144 |
Storage Considerations#
When CDC is enabled:
- Chunks are stored in the configured storage backend (local or S3) under a
chunk/prefix or directory. ncpsmaintains a mapping between NAR files and their chunks in the database.- The
max-sizeand LRU cleanup mechanisms still apply to the total size of the cache, including chunks.
Performance Impact#
Processing NAR files through the CDC chunker adds some CPU overhead during the initial download/cache miss. However, the storage savings and potentially reduced I/O (when chunks are already cached) often outweigh this cost in large-scale deployments.