Monitoring
Monitoring Guide#
Set up monitoring, metrics, and alerting for ncps.
Enable Prometheus#
Enable the /metrics endpoint (supported by serve and migrate-narinfo commands):
prometheus:
enabled: trueAccess metrics at: http://your-ncps:8501/metrics (for serve) or via stdout/OTel (for migrate-narinfo).
Available Metrics#
HTTP Metrics:
http_server_requests_total- Total HTTP requestshttp_server_request_duration_seconds- Request durationhttp_server_active_requests- Active requests
Cache Metrics:
ncps_nar_served_total- NAR files servedncps_narinfo_served_total- NarInfo files served
Lock Metrics (HA):
ncps_lock_acquisitions_total{type,result,mode}- Lock acquisitionsncps_lock_hold_duration_seconds{type,mode}- Lock hold timencps_lock_failures_total{type,reason,mode}- Lock failures
Migration Metrics:
ncps_migration_objects_total{migration_type,operation,result}- Objects migrated- Labels:
migration_type(narinfo-to-db/nar-to-chunks),operation(migrate/delete),result(success/failure/skipped)
- Labels:
ncps_migration_duration_seconds{migration_type,operation}- Migration operation duration histogram- Labels:
migration_type(narinfo-to-db/nar-to-chunks),operation(migrate/delete)
- Labels:
ncps_migration_batch_size{migration_type}- Migration batch sizes- Label:
migration_type(narinfo-to-db/nar-to-chunks)
- Label:
Background Migration Metrics:
ncps_background_migration_objects_total{migration_type,operation,result}- Total number of objects processed during background migration- Labels:
migration_type(narinfo-to-db/nar-to-chunks),operation(migrate/delete),result(success/failure)
- Labels:
ncps_background_migration_duration_seconds{migration_type,operation}- Background migration operation duration histogram- Labels:
migration_type(narinfo-to-db/nar-to-chunks),operation(migrate/delete)
- Labels:
Prometheus Configuration#
Add to prometheus.yml:
scrape_configs:
- job_name: 'ncps'
static_configs:
- targets: ['ncps:8501']
scrape_interval: 30sGrafana Dashboards#
Key Panels#
Cache Performance:
- Cache hit rate
- NAR served rate
- Request duration (p50, p95, p99)
HA Lock Performance:
- Lock acquisition success rate
- Lock retry attempts
- Lock hold duration
Example PromQL Queries#
Cache hit rate:
rate(ncps_nar_served_total[5m])Lock success rate:
rate(ncps_lock_acquisitions_total{result="success"}[5m])
/ rate(ncps_lock_acquisitions_total[5m])Migration throughput:
rate(ncps_migration_objects_total{migration_type="narinfo-to-db"}[5m])Migration success rate:
sum(rate(ncps_migration_objects_total{migration_type="narinfo-to-db",result="success"}[5m]))
/ sum(rate(ncps_migration_objects_total{migration_type="narinfo-to-db"}[5m]))Migration duration (p50, p99):
# Median
histogram_quantile(0.5, ncps_migration_duration_seconds{migration_type="narinfo-to-db"})
# 99th percentile
histogram_quantile(0.99, ncps_migration_duration_seconds{migration_type="narinfo-to-db"})Alerting#
Recommended Alerts#
High Lock Failure Rate:
- alert: HighLockFailureRate
expr: rate(ncps_lock_failures_total[5m]) > 0.1
annotations:
summary: High lock failure ratencps Down:
- alert: NcpsDown
expr: up{job="ncps"} == 0
for: 1m
annotations:
summary: ncps instance downHealth Checks#
Endpoint: GET /nix-cache-info
Example check:
curl -f http://localhost:8501/nix-cache-info || exit 1Related Documentation#
- Observability - Configure metrics
- Troubleshooting - Debug issues