Operational procedures for maintaining, troubleshooting, and incident response for AxiomDB infrastructure

Production Runbook

This runbook covers day-to-day operations, health checks, connectivity tests, and incident response procedures for the AxiomDB control plane.

Service Management

Starting, stopping, and checking status of all AxiomDB services

Health Checks

Automated and manual health verification procedures

Connectivity Tests

Verifying end-to-end connectivity through every layer

Incident Response

Playbooks for common production incidents

Maintenance

Planned maintenance procedures and checklists

Runbook Commands

Copy-paste commands for common operations

Service Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         Internet                                  │
│                            │                                      │
│                            ▼                                      │
│                     ┌─────────────┐                               │
│                     │   Nginx     │  TLS termination              │
│                     │   :443      │  Reverse proxy                │
│                     └──────┬──────┘                               │
│                            │                                      │
│              ┌─────────────┼─────────────┐                       │
│              ▼                           ▼                       │
│     ┌─────────────────┐       ┌─────────────────┐               │
│     │  AxiomDB        │       │  Ops Console    │               │
│     │  Gateway        │       │  (Next.js)      │               │
│     │  :4060          │       │  :3000          │               │
│     └────────┬────────┘       └─────────────────┘               │
│              │                                                   │
│              ▼                                                   │
│     ┌─────────────────┐       ┌─────────────────┐               │
│     │  Redis          │       │  PostgreSQL     │               │
│     │  :6379          │       │  :5432          │               │
│     │  (command queue)│       │  (databases)    │               │
│     └─────────────────┘       └────────┬────────┘               │
│                                        │                         │
│                                        ▼                         │
│                               ┌─────────────────┐               │
│                               │  PgBouncer      │               │
│                               │  :6432          │               │
│                               │  (session mode) │               │
│                               └─────────────────┘               │
│                                                                  │
│     ┌─────────────────────────────────────────────────┐         │
│     │  UFW Firewall Rules                             │         │
│     │  22/tcp (SSH), 80/tcp, 443/tcp, 4060/tcp        │         │
│     │  5432/tcp (internal only), 6432/tcp (internal)  │         │
│     └─────────────────────────────────────────────────┘         │
└──────────────────────────────────────────────────────────────────┘

Service Management

Service Inventory

Service	Unit Name	Port	Purpose
AxiomDB Gateway	`axiomdb-gateway`	4060	REST API, branch provisioning, command queue
Ops Console	`astradb-ops`	3000	Web dashboard for management
PostgreSQL	`postgresql`	5432	Database engine
PgBouncer	`pgbouncer`	6432	Connection pooler
Nginx	`nginx`	443/80	TLS proxy, static assets
Redis	`redis-server`	6379	Command queue, caching

Check All Services

#!/bin/bash
# Quick health check for all services

services=(
    "axiomdb-gateway"
    "astradb-ops"
    "postgresql"
    "pgbouncer"
    "nginx"
    "redis-server"
)

printf "%-25s %-10s %-10s\n" "SERVICE" "ACTIVE" "ENABLED"
printf "%-25s %-10s %-10s\n" "-------" "------" "-------"

for svc in "${services[@]}"; do
    active=$(systemctl is-active "$svc" 2>/dev/null || echo "not-found")
    enabled=$(systemctl is-enabled "$svc" 2>/dev/null || echo "not-found")
    printf "%-25s %-10s %-10s\n" "$svc" "$active" "$enabled"
done

Expected output:

SERVICE                   ACTIVE     ENABLED
-------                   ------     -------
axiomdb-gateway           active     enabled
astradb-ops               active     enabled
postgresql                active     enabled
pgbouncer                 active     enabled
nginx                     active     enabled
redis-server              active     enabled

Start / Stop / Restart Services

# Start a service
sudo systemctl start axiomdb-gateway

# Stop a service
sudo systemctl stop axiomdb-gateway

# Restart a service
sudo systemctl restart axiomdb-gateway

# Reload configuration (no downtime)
sudo systemctl reload nginx
sudo systemctl reload pgbouncer

# View service logs
sudo journalctl -u axiomdb-gateway -f --since "10 minutes ago"
sudo journalctl -u postgresql -f --since "1 hour ago"

Restart Order

When restarting multiple services, follow this order: (1) PostgreSQL, (2) PgBouncer, (3) Redis, (4) AxiomDB Gateway, (5) Ops Console, (6) Nginx. Reverse order for shutdowns.

Service Logs

# Gateway logs (Rust/Axum)
sudo journalctl -u axiomdb-gateway -n 100 --no-pager

# PostgreSQL logs
sudo tail -f /var/log/postgresql/postgresql-14-main.log

# PgBouncer logs
sudo journalctl -u pgbouncer -n 100 --no-pager

# Nginx access/error logs
sudo tail -f /var/log/nginx/access.log
sudo tail -f /var/log/nginx/error.log

# Redis logs
sudo journalctl -u redis-server -n 50 --no-pager

# Ops Console logs
sudo journalctl -u astradb-ops -n 50 --no-pager

Health Checks

Endpoint Health Checks

# Gateway health endpoint
curl -s http://127.0.0.1:4060/health | jq .

# Expected response:
# {
#   "status": "healthy",
#   "version": "1.2.3",
#   "uptime_seconds": 86400,
#   "postgres": "connected",
#   "redis": "connected"
# }

# Ops Console health
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:3000/api/health

# Nginx health
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:80/health

# External endpoint (through Nginx TLS)
curl -s -o /dev/null -w "%{http_code}" https://your-domain.com/health

Database Health Checks

-- Basic connectivity
SELECT 1 AS health_check;

-- PostgreSQL version and uptime
SELECT
    version(),
    pg_postmaster_start_time() AS started_at,
    now() - pg_postmaster_start_time() AS uptime;

-- Check for replication (if applicable)
SELECT
    client_addr,
    state,
    sent_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
FROM pg_stat_replication;

-- Database list with sizes
SELECT
    datname,
    pg_size_pretty(pg_database_size(datname)) AS size,
    numbackends AS connections
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY datname;

PgBouncer Health

# PgBouncer pool status
psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c "SHOW POOLS;"

# PgBouncer version
psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c "SHOW VERSION;"

Redis Health

# Redis ping
redis-cli ping

# Redis info
redis-cli info server | head -20

# Check queue depth
redis-cli llen axiomdb:commands

Automated Health Check Script

#!/bin/bash
# /opt/axiomdb/scripts/health-check.sh

set -euo pipefail

FAILURES=0

check() {
    local name="$1"
    local cmd="$2"
    if eval "$cmd" > /dev/null 2>&1; then
        echo "✓ $name"
    else
        echo "✗ $name"
        ((FAILURES++))
    fi
}

check "Gateway health"    "curl -sf http://127.0.0.1:4060/health"
check "Ops Console"       "curl -sf http://127.0.0.1:3000/api/health"
check "PostgreSQL"        "psql -h 127.0.0.1 -p 5432 -U axiomdb -c 'SELECT 1'"
check "PgBouncer"         "psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c 'SHOW VERSION'"
check "Redis"             "redis-cli ping"
check "Nginx"             "curl -sf http://127.0.0.1:80/health"

if [ "$FAILURES" -gt 0 ]; then
    echo ""
    echo "FAILED: $FAILURES check(s)"
    exit 1
fi

echo ""
echo "All checks passed"
exit 0

Connectivity Tests

End-to-End Connectivity

Test the full request path from client through every layer:

# 1. External TLS connectivity
curl -v https://your-domain.com/health 2>&1 | grep -E "SSL|HTTP|Connected"

# 2. Gateway API (through Nginx proxy)
curl -s https://your-domain.com/api/branches | jq .

# 3. Gateway API (direct, bypassing Nginx)
curl -s http://127.0.0.1:4060/api/branches | jq .

# 4. PostgreSQL (direct connection)
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT current_database(), current_user;"

# 5. PostgreSQL (through PgBouncer)
psql -h 127.0.0.1 -p 6432 -U axiomdb -c "SELECT current_database(), current_user;"

# 6. Redis connectivity
redis-cli ping
redis-cli llen axiomdb:commands

Network Rule Verification

# Check UFW firewall status
sudo ufw status verbose

# Expected rules:
# Status: active
#
# To                         Action      From
# --                         ------      ----
# 22/tcp                     ALLOW IN    Anywhere
# 80/tcp                     ALLOW IN    Anywhere
# 443/tcp                    ALLOW IN    Anywhere
# 4060/tcp                   ALLOW IN    <gateway-ip>
# 5432/tcp                   ALLOW IN    127.0.0.1
# 6432/tcp                   ALLOW IN    127.0.0.1

# Check if a specific port is reachable
nc -zv 127.0.0.1 5432
nc -zv 127.0.0.1 6432
nc -zv 127.0.0.1 4060

# Check Nginx proxy configuration
sudo nginx -t

# View Nginx upstream status
curl -s http://127.0.0.1:4060/health

DNS and TLS Verification

# Check DNS resolution
dig +short your-domain.com

# Check TLS certificate
echo | openssl s_client -connect your-domain.com:443 -servername your-domain.com 2>/dev/null | \
    openssl x509 -noout -dates -subject

# Check certificate chain
echo | openssl s_client -connect your-domain.com:443 -servername your-domain.com 2>/dev/null | \
    grep -E "depth=|verify"

# Check for certificate expiry (warn if < 30 days)
EXPIRY=$(echo | openssl s_client -connect your-domain.com:443 -servername your-domain.com 2>/dev/null | \
    openssl x509 -noout -enddate | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s 2>/dev/null || date -jf "%b %d %T %Y %Z" "$EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))
echo "Certificate expires in $DAYS_LEFT days"

Incident Response

Incident: High Connection Count

Symptoms: Application errors, slow queries, FATAL: too many connections for role

Diagnosis:

-- Check total connections
SELECT count(*) FROM pg_stat_activity;

-- Connections by database and state
SELECT
    datname,
    state,
    count(*) AS count
FROM pg_stat_activity
WHERE datname NOT IN ('template0', 'template1', 'postgres', 'pgbouncer')
GROUP BY datname, state
ORDER BY datname, count DESC;

-- Connections by client IP
SELECT
    client_addr,
    usename,
    datname,
    count(*) AS count
FROM pg_stat_activity
WHERE client_addr IS NOT NULL
GROUP BY client_addr, usename, datname
ORDER BY count DESC;

-- Long-running idle connections
SELECT
    pid,
    usename,
    datname,
    state,
    now() - state_change AS idle_duration,
    query
FROM pg_stat_activity
WHERE state = 'idle'
  AND now() - state_change > interval '10 minutes'
ORDER BY idle_duration DESC;

Resolution:

# Option 1: Kill idle connections older than 30 minutes
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "
    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE state = 'idle'
      AND now() - state_change > interval '30 minutes'
      AND datname NOT IN ('template0', 'template1', 'postgres', 'pgbouncer');
"

# Option 2: Restart PgBouncer to reset all pooled connections
sudo systemctl restart pgbouncer

# Option 3: Increase max_connections (requires restart)
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "ALTER SYSTEM SET max_connections = 300;"
sudo systemctl restart postgresql

Incident: Migration Failures

Symptoms: Branch stuck in provisioning, _prisma_migrations shows incomplete entries

Diagnosis:

-- Find stuck migrations
SELECT
    migration_name,
    started_at,
    finished_at,
    now() - started_at AS duration
FROM _prisma_migrations
WHERE finished_at IS NULL
ORDER BY started_at DESC;

-- Check for locks blocking DDL
SELECT
    l.pid,
    l.locktype,
    l.mode,
    l.granted,
    a.usename,
    a.query,
    a.state
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT l.granted
ORDER BY l.pid;

-- Check for active DDL statements
SELECT
    pid,
    usename,
    query,
    state,
    now() - query_start AS duration
FROM pg_stat_activity
WHERE query ~* 'ALTER|CREATE|DROP|CREATE INDEX'
  AND state = 'active';

Resolution:

# Step 1: Identify the blocking PID
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "
    SELECT pid, query, state FROM pg_stat_activity
    WHERE pid IN (
        SELECT pid FROM pg_locks WHERE NOT granted
    );
"

# Step 2: Terminate the blocking process
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT pg_terminate_backend(<PID>);"

# Step 3: If the migration itself is stuck, mark it as failed
psql -h 127.0.0.1 -p 5432 -U axiomdb -d <branch_db> -c "
    UPDATE _prisma_migrations
    SET finished_at = now(),
        logs = 'Manually terminated due to stuck migration'
    WHERE migration_name = '<migration_name>'
      AND finished_at IS NULL;
"

# Step 4: Retry the migration via Gateway
curl -X POST http://127.0.0.1:4060/api/branches/<branch_id>/migrations/run \
  -H "Content-Type: application/json" \
  -d '{"migration": "<migration_name>"}'

Never Force-Kill PostgreSQL Directly

Always use pg_terminate_backend() rather than kill -9. Force-killing PostgreSQL processes can lead to data corruption or crash recovery.

Incident: Network Rule Failures

Symptoms: Branch connections refused, pg_hba.conf errors, firewall blocking legitimate traffic

Diagnosis:

# Check pg_hba.conf for the branch's network rules
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "
    SELECT type, database, user_name, address, netmask, auth_method
    FROM pg_hba_file_rules
    ORDER BY type, database;
"

# Check UFW rules
sudo ufw status numbered

# Test connectivity from the application's perspective
nc -zv <gateway-ip> 4060

# Check Nginx upstream errors
sudo tail -20 /var/log/nginx/error.log | grep -E "upstream|connect"

Resolution:

# Step 1: Verify and reload pg_hba.conf
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT pg_reload_conf();"

# Step 2: Add missing network grant via AxiomDB
curl -X POST http://127.0.0.1:4060/api/branches/<branch_id>/network-grants \
  -H "Content-Type: application/json" \
  -d '{"cidr": "10.0.0.0/8", "description": "internal network"}'

# Step 3: If UFW is blocking, add the rule
sudo ufw allow from 10.0.0.0/8 to any port 5432
sudo ufw allow from 10.0.0.0/8 to any port 6432

# Step 4: Reload PostgreSQL to pick up pg_hba changes
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT pg_reload_conf();"

Incident: Backup Failures

Symptoms: pgBackRest errors, WAL archiving lag, backup age > 24 hours

Diagnosis:

# Check pgBackRest status
pgbackrest --stanza=axiomdb info

# Check WAL archiving
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "
    SELECT
        archived_count,
        failed_count,
        last_archived_wal,
        last_failed_wal,
        last_failed_time
    FROM pg_stat_archiver;
"

# Check archive_command in postgresql.conf
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SHOW archive_command;"

# Check pgBackRest logs
sudo tail -50 /var/log/pgbackrest/*.log

Resolution:

# If archive_command is failing
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SHOW archive_mode;"

# Test archive manually
pgbackrest --stanza=axiomdb archive-push pg_wal/000000010000000000000001

# If stanza needs repair
pgbackrest --stanza=axiomdb stanza-upgrade

# Retry the backup
pgbackrest --stanza=axiomdb --type=full backup

Incident: Disk Space Emergency

Symptoms: No space left on device, PostgreSQL refusing writes, WAL accumulation

Immediate Actions:

# 1. Identify largest files
sudo du -sh /var/lib/postgresql/14/main/* | sort -rh | head -10
sudo du -sh /var/lib/pgbackrest/* | sort -rh | head -10
sudo du -sh /var/log/* | sort -rh | head -10

# 2. Clean old logs
sudo journalctl --vacuum-time=3d
sudo find /var/log -name "*.gz" -mtime +7 -delete

# 3. Clean old pgBackRest backups (keep last 3 full)
pgbackrest --stanza=axiomdb --repo1-retention-full=3 expire

# 4. Remove stale WAL archives
sudo find /var/lib/postgresql/14/main/pg_wal -name "*.backup" -mtime +7 -delete

# 5. Check for large temp files
sudo find /tmp -size +100M -type f -ls

Maintenance Windows

Pre-Maintenance Checklist

□ Schedule maintenance window (low traffic period)
□ Notify users via status page
□ Create a pgBackRest snapshot
□ Verify current backup is fresh: pgbackrest --stanza=axiomdb info
□ Document current service states
□ Prepare rollback plan
□ Ensure SSH access is stable
□ Check disk space is sufficient for any operations

PostgreSQL Minor Version Upgrade

# 1. Create backup
pgbackrest --stanza=axiomdb --type=full backup

# 2. Stop services (in order)
sudo systemctl stop astradb-ops
sudo systemctl stop axiomdb-gateway
sudo systemctl stop pgbouncer
sudo systemctl stop postgresql

# 3. Upgrade packages
sudo apt update
sudo apt install postgresql-14

# 4. Start PostgreSQL
sudo systemctl start postgresql

# 5. Verify
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT version();"

# 6. Start remaining services
sudo systemctl start pgbouncer
sudo systemctl start axiomdb-gateway
sudo systemctl start astradb-ops

# 7. Run health checks
/opt/axiomdb/scripts/health-check.sh

Post-Maintenance Verification

# Run full health check
/opt/axiomdb/scripts/health-check.sh

# Verify all branches are accessible
curl -s http://127.0.0.1:4060/api/branches | jq '.branches[] | {name, status}'

# Check for errors in logs
sudo journalctl -u axiomdb-gateway --since "15 minutes ago" | grep -i "error\|fatal\|panic"
sudo journalctl -u postgresql --since "15 minutes ago" | grep -i "error\|fatal"
sudo journalctl -u pgbouncer --since "15 minutes ago" | grep -i "error\|fatal"

Quick Reference

Common Commands

# Service status
systemctl status axiomdb-gateway postgresql pgbouncer nginx redis-server

# Gateway API calls
curl -s http://127.0.0.1:4060/health | jq .
curl -s http://127.0.0.1:4060/api/branches | jq .
curl -s http://127.0.0.1:4060/api/branches/<id> | jq .

# Database operations
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT * FROM pg_stat_activity;"
psql -h 127.0.0.1 -p 5432 -U axiomdb -c "SELECT pg_database_size(datname), datname FROM pg_database;"

# PgBouncer admin
psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c "SHOW POOLS;"
psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c "SHOW STATS;"

# Backup operations
pgbackrest --stanza=axiomdb info
pgbackrest --stanza=axiomdb --type=full backup
pgbackrest --stanza=axiomdb verify

# Firewall
sudo ufw status
sudo ufw allow from 10.0.0.0/8 to any port 5432

# Logs
journalctl -u axiomdb-gateway -f
tail -f /var/log/postgresql/postgresql-14-main.log
tail -f /var/log/nginx/error.log

# square-dbctl provisioning
square-dbctl provision --branch <name> --database <db>
square-dbctl status --branch <name>

Emergency Contacts

Role	Contact	When to Escalate
On-call engineer	Slack #oncall	Any production incident
DBA	Slack #dba-team	Data corruption, performance degradation
Infrastructure	Slack #infra	VPS issues, network outages
Security	Slack #security	Credential compromise, unauthorized access

Appendix: Port Reference

Port	Service	Binding	Access
22	SSH	0.0.0.0	Restricted by IP
80	Nginx (HTTP→HTTPS redirect)	0.0.0.0	Public
443	Nginx (TLS)	0.0.0.0	Public
3000	Ops Console	127.0.0.1	Internal (proxied via Nginx)
4060	AxiomDB Gateway	127.0.0.1	Internal (proxied via Nginx)
5432	PostgreSQL	127.0.0.1	Internal only
6379	Redis	127.0.0.1	Internal only
6432	PgBouncer	127.0.0.1	Internal only

Production Runbook

Service Management

Health Checks

Connectivity Tests

Incident Response

Maintenance

Runbook Commands

On this page