Production Runbook
Operational procedures for maintaining, troubleshooting, and incident response for AxiomDB infrastructure
Production Runbook
This runbook covers day-to-day operations, health checks, connectivity tests, and incident response procedures for the AxiomDB control plane.
Service Management
Starting, stopping, and checking status of all AxiomDB services
Health Checks
Automated and manual health verification procedures
Connectivity Tests
Verifying end-to-end connectivity through every layer
Incident Response
Playbooks for common production incidents
Maintenance
Planned maintenance procedures and checklists
Runbook Commands
Copy-paste commands for common operations
Service Architecture
┌──────────────────────────────────────────────────────────────────┐
│ Internet │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Nginx │ TLS termination │
│ │ :443 │ Reverse proxy │
│ └──────┬──────┘ │
│ │ │
│ ┌─────────────┼─────────────┐ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ AxiomDB │ │ Ops Console │ │
│ │ Gateway │ │ (Next.js) │ │
│ │ :4060 │ │ :3000 │ │
│ └────────┬────────┘ └─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Redis │ │ PostgreSQL │ │
│ │ :6379 │ │ :5432 │ │
│ │ (command queue)│ │ (databases) │ │
│ └─────────────────┘ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ PgBouncer │ │
│ │ :6432 │ │
│ │ (session mode) │ │
│ └─────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ UFW Firewall Rules │ │
│ │ 22/tcp (SSH), 80/tcp, 443/tcp, 4060/tcp │ │
│ │ 5432/tcp (internal only), 6432/tcp (internal) │ │
│ └─────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘Service Management
Service Inventory
| Service | Unit Name | Port | Purpose |
|---|---|---|---|
| AxiomDB Gateway | axiomdb-gateway | 4060 | REST API, branch provisioning, command queue |
| Ops Console | astradb-ops | 3000 | Web dashboard for management |
| PostgreSQL | postgresql | 5432 | Database engine |
| PgBouncer | pgbouncer | 6432 | Connection pooler |
| Nginx | nginx | 443/80 | TLS proxy, static assets |
| Redis | redis-server | 6379 | Command queue, caching |
Check All Services
#!/bin/bash
# Quick health check for all services
services=(
"axiomdb-gateway"
"astradb-ops"
"postgresql"
"pgbouncer"
"nginx"
"redis-server"
)
printf "%-25s %-10s %-10s\n" "SERVICE" "ACTIVE" "ENABLED"
printf "%-25s %-10s %-10s\n" "-------" "------" "-------"
for svc in "${services[@]}"; do
active=$(systemctl is-active "$svc" 2>/dev/null || echo "not-found")
enabled=$(systemctl is-enabled "$svc" 2>/dev/null || echo "not-found")
printf "%-25s %-10s %-10s\n" "$svc" "$active" "$enabled"
doneExpected output:
SERVICE ACTIVE ENABLED
------- ------ -------
axiomdb-gateway active enabled
astradb-ops active enabled
postgresql active enabled
pgbouncer active enabled
nginx active enabled
redis-server active enabledStart / Stop / Restart Services
# Start a service
sudo systemctl start axiomdb-gateway
# Stop a service
sudo systemctl stop axiomdb-gateway
# Restart a service
sudo systemctl restart axiomdb-gateway
# Reload configuration (no downtime)
sudo systemctl reload nginx
sudo systemctl reload pgbouncer
# View service logs
sudo journalctl -u axiomdb-gateway -f --since "10 minutes ago"
sudo journalctl -u postgresql -f --since "1 hour ago"Restart Order
When restarting multiple services, follow this order: (1) PostgreSQL, (2) PgBouncer, (3) Redis, (4) AxiomDB Gateway, (5) Ops Console, (6) Nginx. Reverse order for shutdowns.
Service Logs
# Gateway logs (Rust/Axum)
sudo journalctl -u axiomdb-gateway -n 100 --no-pager
# PostgreSQL logs
sudo tail -f /var/log/postgresql/postgresql-14-main.log
# PgBouncer logs
sudo journalctl -u pgbouncer -n 100 --no-pager
# Nginx access/error logs
sudo tail -f /var/log/nginx/access.log
sudo tail -f /var/log/nginx/error.log
# Redis logs
sudo journalctl -u redis-server -n 50 --no-pager
# Ops Console logs
sudo journalctl -u astradb-ops -n 50 --no-pagerHealth Checks
Endpoint Health Checks
# Gateway health endpoint
curl -s http://127.0.0.1:4060/health | jq .
# Expected response:
# {
# "status": "healthy",
# "version": "1.2.3",
# "uptime_seconds": 86400,
# "postgres": "connected",
# "redis": "connected"
# }
# Ops Console health
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:3000/api/health
# Nginx health
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:80/health
# External endpoint (through Nginx TLS)
curl -s -o /dev/null -w "%{http_code}" https://your-domain.com/healthDatabase Health Checks
-- Basic connectivity
SELECT 1 AS health_check;
-- PostgreSQL version and uptime
SELECT
version(),
pg_postmaster_start_time() AS started_at,
now() - pg_postmaster_start_time() AS uptime;
-- Check for replication (if applicable)
SELECT
client_addr,
state,
sent_lsn,
replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
FROM pg_stat_replication;
-- Database list with sizes
SELECT
datname,
pg_size_pretty(pg_database_size(datname)) AS size,
numbackends AS connections
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY datname;PgBouncer Health
# PgBouncer pool status
psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c "SHOW POOLS;"
# PgBouncer version
psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c "SHOW VERSION;"Redis Health
# Redis ping
redis-cli ping
# Redis info
redis-cli info server | head -20
# Check queue depth
redis-cli llen axiomdb:commandsAutomated Health Check Script
#!/bin/bash
# /opt/axiomdb/scripts/health-check.sh
set -euo pipefail
FAILURES=0
check() {
local name="$1"
local cmd="$2"
if eval "$cmd" > /dev/null 2>&1; then
echo "✓ $name"
else
echo "✗ $name"
((FAILURES++))
fi
}
check "Gateway health" "curl -sf http://127.0.0.1:4060/health"
check "Ops Console" "curl -sf http://127.0.0.1:3000/api/health"
check "PostgreSQL" "psql -h 127.0.0.1 -p 5432 -U axiomdb -c 'SELECT 1'"
check "PgBouncer" "psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c 'SHOW VERSION'"
check "Redis" "redis-cli ping"
check "Nginx" "curl -sf http://127.0.0.1:80/health"
if [ "$FAILURES" -gt 0 ]; then
echo ""
echo "FAILED: $FAILURES check(s)"
exit 1
fi
echo ""
echo "All checks passed"
exit 0Connectivity Tests
End-to-End Connectivity
Test the full request path from client through every layer:
# 1. External TLS connectivity
curl -v https://your-domain.com/health 2>&1 | grep -E "SSL|HTTP|Connected"
# 2. Gateway API (through Nginx proxy)
curl -s https://your-domain.com/api/branches | jq .
# 3. Gateway API (direct, bypassing Nginx)
curl -s http://127.0.0.1:4060/api/branches | jq .
# 4. PostgreSQL (direct connection)
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT current_database(), current_user;"
# 5. PostgreSQL (through PgBouncer)
psql -h 127.0.0.1 -p 6432 -U axiomdb -c "SELECT current_database(), current_user;"
# 6. Redis connectivity
redis-cli ping
redis-cli llen axiomdb:commandsNetwork Rule Verification
# Check UFW firewall status
sudo ufw status verbose
# Expected rules:
# Status: active
#
# To Action From
# -- ------ ----
# 22/tcp ALLOW IN Anywhere
# 80/tcp ALLOW IN Anywhere
# 443/tcp ALLOW IN Anywhere
# 4060/tcp ALLOW IN <gateway-ip>
# 5432/tcp ALLOW IN 127.0.0.1
# 6432/tcp ALLOW IN 127.0.0.1
# Check if a specific port is reachable
nc -zv 127.0.0.1 5432
nc -zv 127.0.0.1 6432
nc -zv 127.0.0.1 4060
# Check Nginx proxy configuration
sudo nginx -t
# View Nginx upstream status
curl -s http://127.0.0.1:4060/healthDNS and TLS Verification
# Check DNS resolution
dig +short your-domain.com
# Check TLS certificate
echo | openssl s_client -connect your-domain.com:443 -servername your-domain.com 2>/dev/null | \
openssl x509 -noout -dates -subject
# Check certificate chain
echo | openssl s_client -connect your-domain.com:443 -servername your-domain.com 2>/dev/null | \
grep -E "depth=|verify"
# Check for certificate expiry (warn if < 30 days)
EXPIRY=$(echo | openssl s_client -connect your-domain.com:443 -servername your-domain.com 2>/dev/null | \
openssl x509 -noout -enddate | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s 2>/dev/null || date -jf "%b %d %T %Y %Z" "$EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))
echo "Certificate expires in $DAYS_LEFT days"Incident Response
Incident: High Connection Count
Symptoms: Application errors, slow queries, FATAL: too many connections for role
Diagnosis:
-- Check total connections
SELECT count(*) FROM pg_stat_activity;
-- Connections by database and state
SELECT
datname,
state,
count(*) AS count
FROM pg_stat_activity
WHERE datname NOT IN ('template0', 'template1', 'postgres', 'pgbouncer')
GROUP BY datname, state
ORDER BY datname, count DESC;
-- Connections by client IP
SELECT
client_addr,
usename,
datname,
count(*) AS count
FROM pg_stat_activity
WHERE client_addr IS NOT NULL
GROUP BY client_addr, usename, datname
ORDER BY count DESC;
-- Long-running idle connections
SELECT
pid,
usename,
datname,
state,
now() - state_change AS idle_duration,
query
FROM pg_stat_activity
WHERE state = 'idle'
AND now() - state_change > interval '10 minutes'
ORDER BY idle_duration DESC;Resolution:
# Option 1: Kill idle connections older than 30 minutes
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND now() - state_change > interval '30 minutes'
AND datname NOT IN ('template0', 'template1', 'postgres', 'pgbouncer');
"
# Option 2: Restart PgBouncer to reset all pooled connections
sudo systemctl restart pgbouncer
# Option 3: Increase max_connections (requires restart)
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "ALTER SYSTEM SET max_connections = 300;"
sudo systemctl restart postgresqlIncident: Migration Failures
Symptoms: Branch stuck in provisioning, _prisma_migrations shows incomplete entries
Diagnosis:
-- Find stuck migrations
SELECT
migration_name,
started_at,
finished_at,
now() - started_at AS duration
FROM _prisma_migrations
WHERE finished_at IS NULL
ORDER BY started_at DESC;
-- Check for locks blocking DDL
SELECT
l.pid,
l.locktype,
l.mode,
l.granted,
a.usename,
a.query,
a.state
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT l.granted
ORDER BY l.pid;
-- Check for active DDL statements
SELECT
pid,
usename,
query,
state,
now() - query_start AS duration
FROM pg_stat_activity
WHERE query ~* 'ALTER|CREATE|DROP|CREATE INDEX'
AND state = 'active';Resolution:
# Step 1: Identify the blocking PID
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "
SELECT pid, query, state FROM pg_stat_activity
WHERE pid IN (
SELECT pid FROM pg_locks WHERE NOT granted
);
"
# Step 2: Terminate the blocking process
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT pg_terminate_backend(<PID>);"
# Step 3: If the migration itself is stuck, mark it as failed
psql -h 127.0.0.1 -p 5432 -U axiomdb -d <branch_db> -c "
UPDATE _prisma_migrations
SET finished_at = now(),
logs = 'Manually terminated due to stuck migration'
WHERE migration_name = '<migration_name>'
AND finished_at IS NULL;
"
# Step 4: Retry the migration via Gateway
curl -X POST http://127.0.0.1:4060/api/branches/<branch_id>/migrations/run \
-H "Content-Type: application/json" \
-d '{"migration": "<migration_name>"}'Never Force-Kill PostgreSQL Directly
Always use pg_terminate_backend() rather than kill -9. Force-killing PostgreSQL processes can lead to data corruption or crash recovery.
Incident: Network Rule Failures
Symptoms: Branch connections refused, pg_hba.conf errors, firewall blocking legitimate traffic
Diagnosis:
# Check pg_hba.conf for the branch's network rules
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "
SELECT type, database, user_name, address, netmask, auth_method
FROM pg_hba_file_rules
ORDER BY type, database;
"
# Check UFW rules
sudo ufw status numbered
# Test connectivity from the application's perspective
nc -zv <gateway-ip> 4060
# Check Nginx upstream errors
sudo tail -20 /var/log/nginx/error.log | grep -E "upstream|connect"Resolution:
# Step 1: Verify and reload pg_hba.conf
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT pg_reload_conf();"
# Step 2: Add missing network grant via AxiomDB
curl -X POST http://127.0.0.1:4060/api/branches/<branch_id>/network-grants \
-H "Content-Type: application/json" \
-d '{"cidr": "10.0.0.0/8", "description": "internal network"}'
# Step 3: If UFW is blocking, add the rule
sudo ufw allow from 10.0.0.0/8 to any port 5432
sudo ufw allow from 10.0.0.0/8 to any port 6432
# Step 4: Reload PostgreSQL to pick up pg_hba changes
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT pg_reload_conf();"Incident: Backup Failures
Symptoms: pgBackRest errors, WAL archiving lag, backup age > 24 hours
Diagnosis:
# Check pgBackRest status
pgbackrest --stanza=axiomdb info
# Check WAL archiving
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "
SELECT
archived_count,
failed_count,
last_archived_wal,
last_failed_wal,
last_failed_time
FROM pg_stat_archiver;
"
# Check archive_command in postgresql.conf
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SHOW archive_command;"
# Check pgBackRest logs
sudo tail -50 /var/log/pgbackrest/*.logResolution:
# If archive_command is failing
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SHOW archive_mode;"
# Test archive manually
pgbackrest --stanza=axiomdb archive-push pg_wal/000000010000000000000001
# If stanza needs repair
pgbackrest --stanza=axiomdb stanza-upgrade
# Retry the backup
pgbackrest --stanza=axiomdb --type=full backupIncident: Disk Space Emergency
Symptoms: No space left on device, PostgreSQL refusing writes, WAL accumulation
Immediate Actions:
# 1. Identify largest files
sudo du -sh /var/lib/postgresql/14/main/* | sort -rh | head -10
sudo du -sh /var/lib/pgbackrest/* | sort -rh | head -10
sudo du -sh /var/log/* | sort -rh | head -10
# 2. Clean old logs
sudo journalctl --vacuum-time=3d
sudo find /var/log -name "*.gz" -mtime +7 -delete
# 3. Clean old pgBackRest backups (keep last 3 full)
pgbackrest --stanza=axiomdb --repo1-retention-full=3 expire
# 4. Remove stale WAL archives
sudo find /var/lib/postgresql/14/main/pg_wal -name "*.backup" -mtime +7 -delete
# 5. Check for large temp files
sudo find /tmp -size +100M -type f -lsMaintenance Windows
Pre-Maintenance Checklist
□ Schedule maintenance window (low traffic period)
□ Notify users via status page
□ Create a pgBackRest snapshot
□ Verify current backup is fresh: pgbackrest --stanza=axiomdb info
□ Document current service states
□ Prepare rollback plan
□ Ensure SSH access is stable
□ Check disk space is sufficient for any operationsPostgreSQL Minor Version Upgrade
# 1. Create backup
pgbackrest --stanza=axiomdb --type=full backup
# 2. Stop services (in order)
sudo systemctl stop astradb-ops
sudo systemctl stop axiomdb-gateway
sudo systemctl stop pgbouncer
sudo systemctl stop postgresql
# 3. Upgrade packages
sudo apt update
sudo apt install postgresql-14
# 4. Start PostgreSQL
sudo systemctl start postgresql
# 5. Verify
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT version();"
# 6. Start remaining services
sudo systemctl start pgbouncer
sudo systemctl start axiomdb-gateway
sudo systemctl start astradb-ops
# 7. Run health checks
/opt/axiomdb/scripts/health-check.shPost-Maintenance Verification
# Run full health check
/opt/axiomdb/scripts/health-check.sh
# Verify all branches are accessible
curl -s http://127.0.0.1:4060/api/branches | jq '.branches[] | {name, status}'
# Check for errors in logs
sudo journalctl -u axiomdb-gateway --since "15 minutes ago" | grep -i "error\|fatal\|panic"
sudo journalctl -u postgresql --since "15 minutes ago" | grep -i "error\|fatal"
sudo journalctl -u pgbouncer --since "15 minutes ago" | grep -i "error\|fatal"Quick Reference
Common Commands
# Service status
systemctl status axiomdb-gateway postgresql pgbouncer nginx redis-server
# Gateway API calls
curl -s http://127.0.0.1:4060/health | jq .
curl -s http://127.0.0.1:4060/api/branches | jq .
curl -s http://127.0.0.1:4060/api/branches/<id> | jq .
# Database operations
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT * FROM pg_stat_activity;"
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT pg_database_size(datname), datname FROM pg_database;"
# PgBouncer admin
psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c "SHOW POOLS;"
psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c "SHOW STATS;"
# Backup operations
pgbackrest --stanza=axiomdb info
pgbackrest --stanza=axiomdb --type=full backup
pgbackrest --stanza=axiomdb verify
# Firewall
sudo ufw status
sudo ufw allow from 10.0.0.0/8 to any port 5432
# Logs
journalctl -u axiomdb-gateway -f
tail -f /var/log/postgresql/postgresql-14-main.log
tail -f /var/log/nginx/error.log
# square-dbctl provisioning
square-dbctl provision --branch <name> --database <db>
square-dbctl status --branch <name>Emergency Contacts
| Role | Contact | When to Escalate |
|---|---|---|
| On-call engineer | Slack #oncall | Any production incident |
| DBA | Slack #dba-team | Data corruption, performance degradation |
| Infrastructure | Slack #infra | VPS issues, network outages |
| Security | Slack #security | Credential compromise, unauthorized access |
Appendix: Port Reference
| Port | Service | Binding | Access |
|---|---|---|---|
| 22 | SSH | 0.0.0.0 | Restricted by IP |
| 80 | Nginx (HTTP→HTTPS redirect) | 0.0.0.0 | Public |
| 443 | Nginx (TLS) | 0.0.0.0 | Public |
| 3000 | Ops Console | 127.0.0.1 | Internal (proxied via Nginx) |
| 4060 | AxiomDB Gateway | 127.0.0.1 | Internal (proxied via Nginx) |
| 5432 | PostgreSQL | 127.0.0.1 | Internal only |
| 6379 | Redis | 127.0.0.1 | Internal only |
| 6432 | PgBouncer | 127.0.0.1 | Internal only |