CDC
Content-Defined Chunking (CDC)#
Overview#
Content-Defined Chunking (CDC) is an advanced deduplication technique that allows ncps to significantly reduce storage usage by identifying and sharing common data across different NAR files.
In a traditional Nix cache, each store path is stored as a single NAR file. Even if two packages share many identical files (e.g., they share common libraries or base layers), they are stored as two separate, complete NAR files. This results in significant storage redundancy.
CDC solves this by splitting NAR files into smaller, variable-sized chunks based on their content rather than fixed offsets.
How It Works#
ncps uses the FastCDC algorithm to process NAR files:
- Preprocessing: Before chunking,
ncpsdecompresses the NAR file if it is compressed (e.g., xz, zstd). CDC always operates on the raw, uncompressed data to maximize cross-NAR deduplication. - Chunking: The uncompressed data is passed through the FastCDC chunker. The chunker identifies "natural" content-defined boundaries in the data stream to split the file into variable-sized chunks.
- Hashing: Each chunk is hashed (using BLAKE3) to create a unique identifier based on its content.
- Deduplication: If a chunk with the same hash already exists in the store (from another NAR file),
ncpssimply references the existing chunk instead of storing it again. - Compression: New (non-duplicate) chunks are compressed with zstd before being written to the storage backend.
- Assembly: When a client requests a store path,
ncpsassembles it on-the-fly from its constituent chunks, decompressing each chunk and recompressing the stream for the client using the encoding the client prefers (zstd, brotli, gzip, or raw).
Benefits#
- Storage Efficiency: Dramatic reduction in storage usage when hosting multiple versions of the same package or packages with shared dependencies.
- Cross-NAR Deduplication: Deduplication works across all packages in the cache, not just within a single package.
- Transfer Efficiency: Chunks are stored in the same backend as NAR files, benefitting from the same scalability and reliability.
Configuration#
CDC is disabled by default. You can enable it by setting cache.cdc.enabled to true.
Basic Configuration#
cache:
cdc:
enabled: true
# Optional: Tune chunk sizes (default values shown)
min: 16384 # 16 KB
avg: 65536 # 64 KB
max: 262144 # 256 KBParameters#
| Flag | Description | Environment Variable | Default |
|---|---|---|---|
--cache-cdc-enabled |
Enable CDC for deduplication | CACHE_CDC_ENABLED |
false |
--cache-cdc-min |
Minimum chunk size in bytes | CACHE_CDC_MIN |
16384 |
--cache-cdc-avg |
Average (target) chunk size in bytes | CACHE_CDC_AVG |
65536 |
--cache-cdc-max |
Maximum chunk size in bytes | CACHE_CDC_MAX |
262144 |
--cache-cdc-lazy-chunking-enabled |
Enable lazy chunking: store compressed NAR first, chunk in background | CACHE_CDC_LAZY_CHUNKING_ENABLED |
true |
--cache-cdc-background-workers |
Number of background workers for lazy chunking | CACHE_CDC_BACKGROUND_WORKERS |
(number of CPUs) |
--cache-cdc-delete-delay |
Delay before deleting compressed NAR files after chunking completes | CACHE_CDC_DELETE_DELAY |
24h |
Lazy Chunking#
When lazy chunking is enabled (default), NAR files are stored in their original compressed format first, then chunked in the background. This improves Time To First Byte (TTFB) by avoiding synchronous chunking during download.
Behavior:
When a new NAR is added to the cache with lazy chunking enabled:
- The NAR is downloaded from upstream and stored in its original compressed form (e.g., xz, zstd) in a temporary location.
- The associated narinfo is immediately normalized to
compression: noneand stored in the database. This signals that the NAR will eventually be served from chunks. - The first client request for the NAR is served by decompressing the temporary file on the fly, ensuring a fast Time To First Byte (TTFB).
- In the background, the NAR is asynchronously chunked from the temporary file.
- Once chunking is complete, subsequent requests for the NAR are served from the newly created chunks.
- The temporary compressed file is deleted.
The deleteDelay parameter applies when migrating an existing whole-file NAR to chunks, not to new NARs. It ensures the original whole file remains available for a configured period after migration to allow clients to update their caches.
Configuration Example#
cache:
cdc:
enabled: true
min: 16384 # 16 KB
avg: 65536 # 64 KB
max: 262144 # 256 KB
lazy-chunking-enabled: true
background-workers: 4
delete-delay: 24hStorage Considerations#
When CDC is enabled:
- Chunks are stored in the configured storage backend (local or S3) under a
chunk/prefix or directory. ncpsmaintains a mapping between NAR files and their chunks in the database.- The
max-sizeand LRU cleanup mechanisms still apply to the total size of the cache, including chunks.
Performance Impact#
Processing NAR files through the CDC chunker adds some CPU overhead during the initial download/cache miss. However, the storage savings and potentially reduced I/O (when chunks are already cached) often outweigh this cost in large-scale deployments.