Operations

Production Runbook

Operational procedures for maintaining, troubleshooting, and incident response for AxiomDB infrastructure

Production Runbook

This runbook covers day-to-day operations, health checks, connectivity tests, and incident response procedures for the AxiomDB control plane.


Service Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         Internet                                  │
│                            │                                      │
│                            ▼                                      │
│                     ┌─────────────┐                               │
│                     │   Nginx     │  TLS termination              │
│                     │   :443      │  Reverse proxy                │
│                     └──────┬──────┘                               │
│                            │                                      │
│              ┌─────────────┼─────────────┐                       │
│              ▼                           ▼                       │
│     ┌─────────────────┐       ┌─────────────────┐               │
│     │  AxiomDB        │       │  Ops Console    │               │
│     │  Gateway        │       │  (Next.js)      │               │
│     │  :4060          │       │  :3000          │               │
│     └────────┬────────┘       └─────────────────┘               │
│              │                                                   │
│              ▼                                                   │
│     ┌─────────────────┐       ┌─────────────────┐               │
│     │  Redis          │       │  PostgreSQL     │               │
│     │  :6379          │       │  :5432          │               │
│     │  (command queue)│       │  (databases)    │               │
│     └─────────────────┘       └────────┬────────┘               │
│                                        │                         │
│                                        ▼                         │
│                               ┌─────────────────┐               │
│                               │  PgBouncer      │               │
│                               │  :6432          │               │
│                               │  (session mode) │               │
│                               └─────────────────┘               │
│                                                                  │
│     ┌─────────────────────────────────────────────────┐         │
│     │  UFW Firewall Rules                             │         │
│     │  22/tcp (SSH), 80/tcp, 443/tcp, 4060/tcp        │         │
│     │  5432/tcp (internal only), 6432/tcp (internal)  │         │
│     └─────────────────────────────────────────────────┘         │
└──────────────────────────────────────────────────────────────────┘

Service Management

Service Inventory

ServiceUnit NamePortPurpose
AxiomDB Gatewayaxiomdb-gateway4060REST API, branch provisioning, command queue
Ops Consoleastradb-ops3000Web dashboard for management
PostgreSQLpostgresql5432Database engine
PgBouncerpgbouncer6432Connection pooler
Nginxnginx443/80TLS proxy, static assets
Redisredis-server6379Command queue, caching

Check All Services

#!/bin/bash
# Quick health check for all services

services=(
    "axiomdb-gateway"
    "astradb-ops"
    "postgresql"
    "pgbouncer"
    "nginx"
    "redis-server"
)

printf "%-25s %-10s %-10s\n" "SERVICE" "ACTIVE" "ENABLED"
printf "%-25s %-10s %-10s\n" "-------" "------" "-------"

for svc in "${services[@]}"; do
    active=$(systemctl is-active "$svc" 2>/dev/null || echo "not-found")
    enabled=$(systemctl is-enabled "$svc" 2>/dev/null || echo "not-found")
    printf "%-25s %-10s %-10s\n" "$svc" "$active" "$enabled"
done

Expected output:

SERVICE                   ACTIVE     ENABLED
-------                   ------     -------
axiomdb-gateway           active     enabled
astradb-ops               active     enabled
postgresql                active     enabled
pgbouncer                 active     enabled
nginx                     active     enabled
redis-server              active     enabled

Start / Stop / Restart Services

# Start a service
sudo systemctl start axiomdb-gateway

# Stop a service
sudo systemctl stop axiomdb-gateway

# Restart a service
sudo systemctl restart axiomdb-gateway

# Reload configuration (no downtime)
sudo systemctl reload nginx
sudo systemctl reload pgbouncer

# View service logs
sudo journalctl -u axiomdb-gateway -f --since "10 minutes ago"
sudo journalctl -u postgresql -f --since "1 hour ago"

Restart Order

When restarting multiple services, follow this order: (1) PostgreSQL, (2) PgBouncer, (3) Redis, (4) AxiomDB Gateway, (5) Ops Console, (6) Nginx. Reverse order for shutdowns.

Service Logs

# Gateway logs (Rust/Axum)
sudo journalctl -u axiomdb-gateway -n 100 --no-pager

# PostgreSQL logs
sudo tail -f /var/log/postgresql/postgresql-14-main.log

# PgBouncer logs
sudo journalctl -u pgbouncer -n 100 --no-pager

# Nginx access/error logs
sudo tail -f /var/log/nginx/access.log
sudo tail -f /var/log/nginx/error.log

# Redis logs
sudo journalctl -u redis-server -n 50 --no-pager

# Ops Console logs
sudo journalctl -u astradb-ops -n 50 --no-pager

Health Checks

Endpoint Health Checks

# Gateway health endpoint
curl -s http://127.0.0.1:4060/health | jq .

# Expected response:
# {
#   "status": "healthy",
#   "version": "1.2.3",
#   "uptime_seconds": 86400,
#   "postgres": "connected",
#   "redis": "connected"
# }

# Ops Console health
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:3000/api/health

# Nginx health
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:80/health

# External endpoint (through Nginx TLS)
curl -s -o /dev/null -w "%{http_code}" https://your-domain.com/health

Database Health Checks

-- Basic connectivity
SELECT 1 AS health_check;

-- PostgreSQL version and uptime
SELECT
    version(),
    pg_postmaster_start_time() AS started_at,
    now() - pg_postmaster_start_time() AS uptime;

-- Check for replication (if applicable)
SELECT
    client_addr,
    state,
    sent_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
FROM pg_stat_replication;

-- Database list with sizes
SELECT
    datname,
    pg_size_pretty(pg_database_size(datname)) AS size,
    numbackends AS connections
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY datname;

PgBouncer Health

# PgBouncer pool status
psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c "SHOW POOLS;"

# PgBouncer version
psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c "SHOW VERSION;"

Redis Health

# Redis ping
redis-cli ping

# Redis info
redis-cli info server | head -20

# Check queue depth
redis-cli llen axiomdb:commands

Automated Health Check Script

#!/bin/bash
# /opt/axiomdb/scripts/health-check.sh

set -euo pipefail

FAILURES=0

check() {
    local name="$1"
    local cmd="$2"
    if eval "$cmd" > /dev/null 2>&1; then
        echo "✓ $name"
    else
        echo "✗ $name"
        ((FAILURES++))
    fi
}

check "Gateway health"    "curl -sf http://127.0.0.1:4060/health"
check "Ops Console"       "curl -sf http://127.0.0.1:3000/api/health"
check "PostgreSQL"        "psql -h 127.0.0.1 -p 5432 -U axiomdb -c 'SELECT 1'"
check "PgBouncer"         "psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c 'SHOW VERSION'"
check "Redis"             "redis-cli ping"
check "Nginx"             "curl -sf http://127.0.0.1:80/health"

if [ "$FAILURES" -gt 0 ]; then
    echo ""
    echo "FAILED: $FAILURES check(s)"
    exit 1
fi

echo ""
echo "All checks passed"
exit 0

Connectivity Tests

End-to-End Connectivity

Test the full request path from client through every layer:

# 1. External TLS connectivity
curl -v https://your-domain.com/health 2>&1 | grep -E "SSL|HTTP|Connected"

# 2. Gateway API (through Nginx proxy)
curl -s https://your-domain.com/api/branches | jq .

# 3. Gateway API (direct, bypassing Nginx)
curl -s http://127.0.0.1:4060/api/branches | jq .

# 4. PostgreSQL (direct connection)
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT current_database(), current_user;"

# 5. PostgreSQL (through PgBouncer)
psql -h 127.0.0.1 -p 6432 -U axiomdb -c "SELECT current_database(), current_user;"

# 6. Redis connectivity
redis-cli ping
redis-cli llen axiomdb:commands

Network Rule Verification

# Check UFW firewall status
sudo ufw status verbose

# Expected rules:
# Status: active
#
# To                         Action      From
# --                         ------      ----
# 22/tcp                     ALLOW IN    Anywhere
# 80/tcp                     ALLOW IN    Anywhere
# 443/tcp                    ALLOW IN    Anywhere
# 4060/tcp                   ALLOW IN    <gateway-ip>
# 5432/tcp                   ALLOW IN    127.0.0.1
# 6432/tcp                   ALLOW IN    127.0.0.1

# Check if a specific port is reachable
nc -zv 127.0.0.1 5432
nc -zv 127.0.0.1 6432
nc -zv 127.0.0.1 4060

# Check Nginx proxy configuration
sudo nginx -t

# View Nginx upstream status
curl -s http://127.0.0.1:4060/health

DNS and TLS Verification

# Check DNS resolution
dig +short your-domain.com

# Check TLS certificate
echo | openssl s_client -connect your-domain.com:443 -servername your-domain.com 2>/dev/null | \
    openssl x509 -noout -dates -subject

# Check certificate chain
echo | openssl s_client -connect your-domain.com:443 -servername your-domain.com 2>/dev/null | \
    grep -E "depth=|verify"

# Check for certificate expiry (warn if < 30 days)
EXPIRY=$(echo | openssl s_client -connect your-domain.com:443 -servername your-domain.com 2>/dev/null | \
    openssl x509 -noout -enddate | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s 2>/dev/null || date -jf "%b %d %T %Y %Z" "$EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))
echo "Certificate expires in $DAYS_LEFT days"

Incident Response

Incident: High Connection Count

Symptoms: Application errors, slow queries, FATAL: too many connections for role

Diagnosis:

-- Check total connections
SELECT count(*) FROM pg_stat_activity;

-- Connections by database and state
SELECT
    datname,
    state,
    count(*) AS count
FROM pg_stat_activity
WHERE datname NOT IN ('template0', 'template1', 'postgres', 'pgbouncer')
GROUP BY datname, state
ORDER BY datname, count DESC;

-- Connections by client IP
SELECT
    client_addr,
    usename,
    datname,
    count(*) AS count
FROM pg_stat_activity
WHERE client_addr IS NOT NULL
GROUP BY client_addr, usename, datname
ORDER BY count DESC;

-- Long-running idle connections
SELECT
    pid,
    usename,
    datname,
    state,
    now() - state_change AS idle_duration,
    query
FROM pg_stat_activity
WHERE state = 'idle'
  AND now() - state_change > interval '10 minutes'
ORDER BY idle_duration DESC;

Resolution:

# Option 1: Kill idle connections older than 30 minutes
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "
    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE state = 'idle'
      AND now() - state_change > interval '30 minutes'
      AND datname NOT IN ('template0', 'template1', 'postgres', 'pgbouncer');
"

# Option 2: Restart PgBouncer to reset all pooled connections
sudo systemctl restart pgbouncer

# Option 3: Increase max_connections (requires restart)
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "ALTER SYSTEM SET max_connections = 300;"
sudo systemctl restart postgresql

Incident: Migration Failures

Symptoms: Branch stuck in provisioning, _prisma_migrations shows incomplete entries

Diagnosis:

-- Find stuck migrations
SELECT
    migration_name,
    started_at,
    finished_at,
    now() - started_at AS duration
FROM _prisma_migrations
WHERE finished_at IS NULL
ORDER BY started_at DESC;

-- Check for locks blocking DDL
SELECT
    l.pid,
    l.locktype,
    l.mode,
    l.granted,
    a.usename,
    a.query,
    a.state
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT l.granted
ORDER BY l.pid;

-- Check for active DDL statements
SELECT
    pid,
    usename,
    query,
    state,
    now() - query_start AS duration
FROM pg_stat_activity
WHERE query ~* 'ALTER|CREATE|DROP|CREATE INDEX'
  AND state = 'active';

Resolution:

# Step 1: Identify the blocking PID
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "
    SELECT pid, query, state FROM pg_stat_activity
    WHERE pid IN (
        SELECT pid FROM pg_locks WHERE NOT granted
    );
"

# Step 2: Terminate the blocking process
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT pg_terminate_backend(<PID>);"

# Step 3: If the migration itself is stuck, mark it as failed
psql -h 127.0.0.1 -p 5432 -U axiomdb -d <branch_db> -c "
    UPDATE _prisma_migrations
    SET finished_at = now(),
        logs = 'Manually terminated due to stuck migration'
    WHERE migration_name = '<migration_name>'
      AND finished_at IS NULL;
"

# Step 4: Retry the migration via Gateway
curl -X POST http://127.0.0.1:4060/api/branches/<branch_id>/migrations/run \
  -H "Content-Type: application/json" \
  -d '{"migration": "<migration_name>"}'

Never Force-Kill PostgreSQL Directly

Always use pg_terminate_backend() rather than kill -9. Force-killing PostgreSQL processes can lead to data corruption or crash recovery.

Incident: Network Rule Failures

Symptoms: Branch connections refused, pg_hba.conf errors, firewall blocking legitimate traffic

Diagnosis:

# Check pg_hba.conf for the branch's network rules
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "
    SELECT type, database, user_name, address, netmask, auth_method
    FROM pg_hba_file_rules
    ORDER BY type, database;
"

# Check UFW rules
sudo ufw status numbered

# Test connectivity from the application's perspective
nc -zv <gateway-ip> 4060

# Check Nginx upstream errors
sudo tail -20 /var/log/nginx/error.log | grep -E "upstream|connect"

Resolution:

# Step 1: Verify and reload pg_hba.conf
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT pg_reload_conf();"

# Step 2: Add missing network grant via AxiomDB
curl -X POST http://127.0.0.1:4060/api/branches/<branch_id>/network-grants \
  -H "Content-Type: application/json" \
  -d '{"cidr": "10.0.0.0/8", "description": "internal network"}'

# Step 3: If UFW is blocking, add the rule
sudo ufw allow from 10.0.0.0/8 to any port 5432
sudo ufw allow from 10.0.0.0/8 to any port 6432

# Step 4: Reload PostgreSQL to pick up pg_hba changes
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT pg_reload_conf();"

Incident: Backup Failures

Symptoms: pgBackRest errors, WAL archiving lag, backup age > 24 hours

Diagnosis:

# Check pgBackRest status
pgbackrest --stanza=axiomdb info

# Check WAL archiving
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "
    SELECT
        archived_count,
        failed_count,
        last_archived_wal,
        last_failed_wal,
        last_failed_time
    FROM pg_stat_archiver;
"

# Check archive_command in postgresql.conf
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SHOW archive_command;"

# Check pgBackRest logs
sudo tail -50 /var/log/pgbackrest/*.log

Resolution:

# If archive_command is failing
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SHOW archive_mode;"

# Test archive manually
pgbackrest --stanza=axiomdb archive-push pg_wal/000000010000000000000001

# If stanza needs repair
pgbackrest --stanza=axiomdb stanza-upgrade

# Retry the backup
pgbackrest --stanza=axiomdb --type=full backup

Incident: Disk Space Emergency

Symptoms: No space left on device, PostgreSQL refusing writes, WAL accumulation

Immediate Actions:

# 1. Identify largest files
sudo du -sh /var/lib/postgresql/14/main/* | sort -rh | head -10
sudo du -sh /var/lib/pgbackrest/* | sort -rh | head -10
sudo du -sh /var/log/* | sort -rh | head -10

# 2. Clean old logs
sudo journalctl --vacuum-time=3d
sudo find /var/log -name "*.gz" -mtime +7 -delete

# 3. Clean old pgBackRest backups (keep last 3 full)
pgbackrest --stanza=axiomdb --repo1-retention-full=3 expire

# 4. Remove stale WAL archives
sudo find /var/lib/postgresql/14/main/pg_wal -name "*.backup" -mtime +7 -delete

# 5. Check for large temp files
sudo find /tmp -size +100M -type f -ls

Maintenance Windows

Pre-Maintenance Checklist

□ Schedule maintenance window (low traffic period)
□ Notify users via status page
□ Create a pgBackRest snapshot
□ Verify current backup is fresh: pgbackrest --stanza=axiomdb info
□ Document current service states
□ Prepare rollback plan
□ Ensure SSH access is stable
□ Check disk space is sufficient for any operations

PostgreSQL Minor Version Upgrade

# 1. Create backup
pgbackrest --stanza=axiomdb --type=full backup

# 2. Stop services (in order)
sudo systemctl stop astradb-ops
sudo systemctl stop axiomdb-gateway
sudo systemctl stop pgbouncer
sudo systemctl stop postgresql

# 3. Upgrade packages
sudo apt update
sudo apt install postgresql-14

# 4. Start PostgreSQL
sudo systemctl start postgresql

# 5. Verify
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT version();"

# 6. Start remaining services
sudo systemctl start pgbouncer
sudo systemctl start axiomdb-gateway
sudo systemctl start astradb-ops

# 7. Run health checks
/opt/axiomdb/scripts/health-check.sh

Post-Maintenance Verification

# Run full health check
/opt/axiomdb/scripts/health-check.sh

# Verify all branches are accessible
curl -s http://127.0.0.1:4060/api/branches | jq '.branches[] | {name, status}'

# Check for errors in logs
sudo journalctl -u axiomdb-gateway --since "15 minutes ago" | grep -i "error\|fatal\|panic"
sudo journalctl -u postgresql --since "15 minutes ago" | grep -i "error\|fatal"
sudo journalctl -u pgbouncer --since "15 minutes ago" | grep -i "error\|fatal"

Quick Reference

Common Commands

# Service status
systemctl status axiomdb-gateway postgresql pgbouncer nginx redis-server

# Gateway API calls
curl -s http://127.0.0.1:4060/health | jq .
curl -s http://127.0.0.1:4060/api/branches | jq .
curl -s http://127.0.0.1:4060/api/branches/<id> | jq .

# Database operations
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT * FROM pg_stat_activity;"
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT pg_database_size(datname), datname FROM pg_database;"

# PgBouncer admin
psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c "SHOW POOLS;"
psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c "SHOW STATS;"

# Backup operations
pgbackrest --stanza=axiomdb info
pgbackrest --stanza=axiomdb --type=full backup
pgbackrest --stanza=axiomdb verify

# Firewall
sudo ufw status
sudo ufw allow from 10.0.0.0/8 to any port 5432

# Logs
journalctl -u axiomdb-gateway -f
tail -f /var/log/postgresql/postgresql-14-main.log
tail -f /var/log/nginx/error.log

# square-dbctl provisioning
square-dbctl provision --branch <name> --database <db>
square-dbctl status --branch <name>

Emergency Contacts

RoleContactWhen to Escalate
On-call engineerSlack #oncallAny production incident
DBASlack #dba-teamData corruption, performance degradation
InfrastructureSlack #infraVPS issues, network outages
SecuritySlack #securityCredential compromise, unauthorized access

Appendix: Port Reference

PortServiceBindingAccess
22SSH0.0.0.0Restricted by IP
80Nginx (HTTP→HTTPS redirect)0.0.0.0Public
443Nginx (TLS)0.0.0.0Public
3000Ops Console127.0.0.1Internal (proxied via Nginx)
4060AxiomDB Gateway127.0.0.1Internal (proxied via Nginx)
5432PostgreSQL127.0.0.1Internal only
6379Redis127.0.0.1Internal only
6432PgBouncer127.0.0.1Internal only

On this page