High Availability
High Availability Deployment#
Deploy multiple ncps instances for zero-downtime operation, load distribution, and redundancy.
Why High Availability?#
Running multiple ncps instances provides:
- ✅ Zero Downtime - Instance failures don't interrupt service
- ✅ Load Distribution - Requests spread across multiple servers
- ✅ Horizontal Scaling - Add instances to handle more traffic
- ✅ Geographic Distribution - Deploy instances closer to clients
- ✅ Rolling Updates - Update instances one at a time without downtime
Architecture#
┌──────────────────────────────────┐
│ Nix Clients │
└────────────────┬─────────────────┘
│
▼
┌────────────────────────────────┐
│ Load Balancer │
│ (nginx, HAProxy, cloud LB) │
└────────┬────────────────┬──────┘
│ │
┌────▼────┐ ┌────▼────┐ ┌──────────┐
│ ncps #1 │ │ ncps #2 │ ... │ ncps #N │
└────┬────┘ └────┬────┘ └─────┬────┘
│ │ │
└────────┬───────┴─────────────────┘
│
┌───────────┼────────────┬───────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────────┐ ┌────────┐ ┌──────────┐ ┌─────────┐
│ Redis / │ │ S3 │ │PostgreSQL│ │ Load │
│ Database │ │Storage │ │ / MySQL │ │Balancer │
│ (Locks) │ │ │ │ (Data) │ │ │
└──────────┘ └────────┘ └──────────┘ └─────────┘Requirements#
Required Components#
- Multiple ncps instances (2+, recommended 3+)
- Distributed locking backend
- Redis server (version 5.0+)
- PostgreSQL advisory locks (version 9.1+)
- MySQL advisory locks (version 8.0+)
- S3-compatible storage (shared across all instances)
- AWS S3, MinIO, DigitalOcean Spaces, etc.
- PostgreSQL or MySQL database (shared across all instances)
- PostgreSQL 12+ or MySQL 8.0+
- SQLite is NOT supported for HA
- Load balancer to distribute requests
- nginx, HAProxy, cloud load balancer, etc.
Network Requirements#
- All instances must reach Redis
- All instances must reach S3 storage
- All instances must reach shared database
- Load balancer must reach all instances
- Clients reach load balancer
Quick Start#
Option 1: Docker Compose with MinIO#
See Docker Compose HA example.
Option 2: Kubernetes with Helm#
helm install ncps ./charts/ncps -n ncps -f values-ha.yamlvalues-ha.yaml:
replicaCount: 3
hostname: cache.example.com
hostName: cache.example.com
storage:
s3:
enabled: true
bucket: ncps-cache
endpoint: https://s3.amazonaws.com
region: us-east-1
forcePathStyle: false # Set to true for MinIO
database:
url: postgresql://ncps:password@postgres:5432/ncps
redis:
enabled: true
addrs:
- redis:6379
podDisruptionBudget:
enabled: true
minAvailable: 2See Helm Chart for details.
Detailed Configuration#
Step 1: Set Up Redis#
Single Redis Instance:
docker run -d \
--name redis \
-p 6379:6379 \
redis:7-alpineRedis Cluster (for production):
# Use Redis cluster or sentinel for Redis HA
# See Redis documentation for cluster setupStep 2: Set Up S3 Storage#
AWS S3:
# Create bucket
aws s3 mb s3://ncps-cache --region us-east-1
# Enable versioning (recommended)
aws s3api put-bucket-versioning \
--bucket ncps-cache \
--versioning-configuration Status=EnabledMinIO:
# Start MinIO
docker run -d \
--name minio \
-p 9000:9000 \
-p 9001:9001 \
-v minio-data:/data \
minio/minio server /data --console-address ":9001"
# Create bucket
mc alias set myminio http://localhost:9000 minioadmin minioadmin
mc mb myminio/ncps-cacheStep 3: Set Up Database#
PostgreSQL:
# Create database and user
sudo -u postgres psqlCREATE DATABASE ncps;
CREATE USER ncps WITH PASSWORD 'secure-password';
GRANT ALL PRIVILEGES ON DATABASE ncps TO ncps;MySQL:
CREATE DATABASE ncps;
CREATE USER 'ncps'@'%' IDENTIFIED BY 'secure-password';
GRANT ALL PRIVILEGES ON ncps.* TO 'ncps'@'%';
FLUSH PRIVILEGES;Step 4: Configure ncps Instances#
All instances use identical configuration:
cache:
hostname: cache.example.com # Same for all instances
storage:
s3:
bucket: ncps-cache
endpoint: https://s3.amazonaws.com
region: us-east-1
access-key-id: ${S3_ACCESS_KEY}
secret-access-key: ${S3_SECRET_KEY}
force-path-style: false # Set to true for MinIO
database-url: postgresql://ncps:password@postgres:5432/ncps?sslmode=require
redis:
addrs:
- redis:6379
password: ${REDIS_PASSWORD} # If using auth
lock:
backend: redis # Options: local, redis, postgres, mysql
download-lock-ttl: 5m
lru-lock-ttl: 30m
retry:
max-attempts: 3
initial-delay: 100ms
max-delay: 2s
jitter: true
upstream:
urls:
- https://cache.nixos.org
public-keys:
- cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY=
prometheus:
enabled: trueStep 5: Deploy Instances#
Docker:
# Start instance 1
docker run -d --name ncps-1 -p 8501:8501 \
-v $(pwd)/config.yaml:/config.yaml \
kalbasit/ncps /bin/ncps serve --config=/config.yaml
# Start instance 2
docker run -d --name ncps-2 -p 8502:8501 \
-v $(pwd)/config.yaml:/config.yaml \
kalbasit/ncps /bin/ncps serve --config=/config.yaml
# Start instance 3
docker run -d --name ncps-3 -p 8503:8501 \
-v $(pwd)/config.yaml:/config.yaml \
kalbasit/ncps /bin/ncps serve --config=/config.yamlKubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ncps
spec:
replicas: 3
selector:
matchLabels:
app: ncps
template:
metadata:
labels:
app: ncps
spec:
containers:
- name: ncps
image: kalbasit/ncps:latest
# ... configuration ...Step 6: Set Up Load Balancer#
nginx:
upstream ncps_backend {
server ncps-1:8501;
server ncps-2:8501;
server ncps-3:8501;
}
server {
listen 80;
server_name cache.example.com;
location / {
proxy_pass http://ncps_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}HAProxy:
frontend ncps_frontend
bind *:80
default_backend ncps_backend
backend ncps_backend
balance roundrobin
option httpchk GET /nix-cache-info
server ncps1 ncps-1:8501 check
server ncps2 ncps-2:8501 check
server ncps3 ncps-3:8501 checkHow Distributed Locking Works#
ncps uses Redis to coordinate multiple instances:
Download Deduplication#
When multiple instances request the same package:
- Instance A acquires download lock for hash
abc123 - Instance B tries to download same package
- Instance B cannot acquire lock (Instance A holds it)
- Instance B retries with exponential backoff
- Instance A completes download and releases lock
- Instance B acquires lock, finds package in S3, serves it
- Result: Only one download from upstream
LRU Coordination#
Only one instance runs cache cleanup at a time:
- Instances try to acquire global LRU lock
- First instance to acquire lock runs LRU
- Other instances skip LRU (lock held)
- After cleanup, lock is released
- Next scheduled LRU cycle, another instance may acquire lock
Benefits:
- Prevents concurrent deletions
- Avoids cache corruption
- Distributes LRU load
See Distributed Locking for technical details and database advisory lock configuration (PostgreSQL/MySQL).
Health Checks#
Configure load balancer health checks:
Endpoint: GET /nix-cache-info
nginx example:
upstream ncps_backend {
server ncps-1:8501 max_fails=3 fail_timeout=30s;
server ncps-2:8501 max_fails=3 fail_timeout=30s;
server ncps-3:8501 max_fails=3 fail_timeout=30s;
}Kubernetes:
livenessProbe:
httpGet:
path: /nix-cache-info
port: 8501
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /nix-cache-info
port: 8501
initialDelaySeconds: 5
periodSeconds: 5Rolling Updates#
Update instances one at a time for zero downtime:
# Update instance 1
docker stop ncps-1
docker rm ncps-1
docker pull kalbasit/ncps:latest
docker run -d --name ncps-1 ... # Same command
# Wait and verify instance 1 is healthy
# Update instance 2
docker stop ncps-2
# ... same process
# Update instance 3
# ... same processKubernetes:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1Monitoring HA Deployments#
Key Metrics#
- Instance health: Up/down status
- Lock acquisition rate: Download and LRU locks
- Lock contention: Retry attempts
- Redis connectivity: Connection status
- Cache hit rate: Per-instance and aggregate
Example Prometheus Queries#
# Lock acquisition success rate
rate(ncps_lock_acquisitions_total{result="success"}[5m])
/ rate(ncps_lock_acquisitions_total[5m])
# Lock retry attempts
rate(ncps_lock_retry_attempts_total[5m])
# Cache hit rate
rate(ncps_nar_served_total[5m])See Monitoring for dashboards.
Troubleshooting#
Download Locks Not Working#
Symptom: Multiple instances download the same package
Check:
# Verify Redis configuration
grep "redis-addrs" config.yaml
# Test Redis connectivity
redis-cli -h redis-host ping
# Check logs for lock messages
grep "acquired download lock" /var/log/ncps.logHigh Lock Contention#
Symptom: Many retry attempts, slow downloads
Solutions:
- Increase retry settings
- Increase lock TTLs for long operations
- Scale down instances if too many
See Distributed Locking for detailed troubleshooting.
Migration from Single-Instance#
Prerequisites#
- ✅ Set up PostgreSQL or MySQL database
- ✅ Migrate from SQLite (if applicable)
- ✅ Set up S3-compatible storage
- ✅ Deploy Redis server
Migration Steps#
1. Migrate to S3 Storage:
# Sync local storage to S3
aws s3 sync /var/lib/ncps s3://ncps-cache/2. Migrate Database:
# Export SQLite data
sqlite3 ncps.db .dump > backup.sql
# Import to PostgreSQL (after conversion)
pgloader sqlite:///var/lib/ncps/db/db.sqlite \
postgresql://ncps:password@localhost:5432/ncps3. Configure First Instance:
# Update config.yaml to use S3 and PostgreSQL
# Add Redis configuration4. Verify Functionality:
- Test package downloads
- Check Redis for lock keys
- Verify cache hits
5. Add Additional Instances:
- Use identical configuration
- Point to same Redis, S3, and database
- Add to load balancer
Best Practices#
- Start Redis First - Ensure Redis is healthy before starting ncps instances
- Use Health Checks - Configure load balancer health checks
- Monitor Lock Metrics - Watch for contention and failures
- Plan Capacity - 3+ instances recommended for true HA
- Test Failover - Regularly test instance failures
- Centralize Logs - Use log aggregation for troubleshooting
- Set Up Alerts - Alert on high lock failures, Redis unavailability
Next Steps#
- Client Setup - Set up Nix clients
- Distributed Locking - Understand locking in depth
- Monitoring - Configure observability
- Operations - Learn about backups, upgrades
Related Documentation#
- Distributed Locking - Deep dive into Redis locking
- Helm Chart - Simplified HA deployment
- Reference - All HA options
- Monitoring - HA-specific monitoring