Observability

AionDB starts an HTTP observability server for local health and metrics.

Defaults

AIONDB_OBSERVABILITY_BIND=127.0.0.1
AIONDB_OBSERVABILITY_PORT=9187

Endpoints:

Local checks

curl http://127.0.0.1:9187/livez
curl http://127.0.0.1:9187/healthz
curl http://127.0.0.1:9187/readyz
curl http://127.0.0.1:9187/info
curl http://127.0.0.1:9187/metrics

Startup behavior

By default, the server can continue in degraded mode if observability startup fails. To fail startup instead:

AIONDB_OBSERVABILITY_FAIL_FAST=true

Security posture

The server treats public observability exposure as unsafe in v0.1. Keep observability on loopback unless the environment is explicitly secured.

What to check first

Use /livez to check whether the observability HTTP process is alive. Use /readyz for supervisors and load balancers that need a pgwire readiness gate. /healthz is kept as a compatibility alias for readiness. Use /info to inspect basic runtime identity and configuration. Use /metrics for counters and gauges that are useful during local evaluation.

Exposed metric families

/metrics emits Prometheus-compatible plain-text counters and gauges. The metric names below are stable enough to be used by local evaluation dashboards, but the exact set may grow between releases.

FamilyNames
Query lifecycleaiondb_queries_total, aiondb_queries_failed_total, aiondb_rows_returned_total, aiondb_rows_affected_total
Query latencyaiondb_query_duration_micros_total, aiondb_query_duration_micros_bucket{le="..."}, aiondb_query_duration_micros_sum, aiondb_query_duration_micros_count, aiondb_query_duration_micros_p50, aiondb_query_duration_micros_p95, aiondb_query_duration_micros_p99
Concurrencyaiondb_query_queue_depth_current, aiondb_query_queue_depth_peak, aiondb_session_lock_wait_total, aiondb_session_lock_wait_micros_total, aiondb_session_lock_wait_micros_max
Graph DDLaiondb_graph_ddl_operations_total
Distributed executionaiondb_distributed_fragments_total, aiondb_distributed_fragment_errors_total
pgwire listeneraiondb_pgwire_connections_total, aiondb_pgwire_connections_active, aiondb_pgwire_queries_total, aiondb_pgwire_successful_startups_total, aiondb_pgwire_failed_startups_total, aiondb_pgwire_authentication_failures_total
Product contractaiondb_product_single_node_mode, aiondb_product_clustering_supported, aiondb_product_encryption_at_rest_supported, aiondb_product_backup_restore_supported
Distributed topologyaiondb_distributed_remote_nodes_total, aiondb_distributed_remote_nodes_available, aiondb_distributed_remote_circuits_open, aiondb_distributed_remote_circuits_half_open, aiondb_distributed_remote_node_available{node=...}, aiondb_distributed_remote_node_circuit_state{node=...}, aiondb_distributed_remote_node_consecutive_failures{node=...}
Control planeaiondb_distributed_control_plane_nodes_total, aiondb_distributed_control_plane_nodes_live, aiondb_distributed_control_plane_node_live{node=...}, aiondb_distributed_control_plane_shards_total, aiondb_distributed_control_plane_placement_epoch
Distributed replicationaiondb_distributed_replication_shards_total, aiondb_distributed_replication_shards_with_live_quorum, aiondb_distributed_replication_shards_without_live_quorum, aiondb_distributed_replication_under_replicated_shards, aiondb_distributed_replication_shards_with_down_voters, aiondb_distributed_replication_shards_with_learners, aiondb_distributed_replication_learner_replicas, aiondb_distributed_replication_shard_live_quorum{shard_id=...}, aiondb_distributed_replication_node_leaders{node_id=...}, aiondb_distributed_replication_node_voters{node_id=...}, aiondb_distributed_replication_node_learners{node_id=...}
Replica runtimeaiondb_replica_runtime_sessions_started, aiondb_replica_runtime_sessions_succeeded, aiondb_replica_runtime_sessions_failed, aiondb_replica_runtime_reconnects, aiondb_replica_runtime_wal_bytes_received, aiondb_replica_runtime_standby_status_updates_sent, aiondb_replica_runtime_last_session_started_at_us
Replica WAL receiveraiondb_replica_wal_receiver_write_lsn, aiondb_replica_wal_receiver_flush_lsn, aiondb_replica_wal_receiver_apply_lsn, aiondb_replica_wal_receiver_write_apply_lag_lsn, aiondb_replica_wal_receiver_flush_apply_lag_lsn

The aiondb_product_* gauges are dimensional booleans that describe what the running binary actually supports. They are useful for dashboards that need to refuse production-readiness claims a build cannot back.

During a benchmark or compatibility run, record:

That information makes a performance or reliability report useful after the machine has been shut down.

Local debugging pattern

Start the server in one terminal:

AIONDB_BOOTSTRAP_USER=dev \
AIONDB_BOOTSTRAP_PASSWORD='DevPassword42!' \
cargo run -p aiondb-server --bin aiondb -- --ephemeral

Check health from another terminal:

curl -s http://127.0.0.1:9187/livez
curl -s http://127.0.0.1:9187/healthz
curl -s http://127.0.0.1:9187/readyz
curl -s http://127.0.0.1:9187/info

If the database accepts client connections but observability does not respond, check the bind address, port, and whether another process already owns the port.

Production-style guidance

For v0.1, do not expose observability directly to the public internet. Put it behind local networking, firewall rules, or a trusted collection agent. Treat metrics as operational data: they may reveal database names, runtime shape, workload volume, or error patterns.

What is not covered yet

The v0.1 observability story is intentionally small. A mature deployment story would also need structured tracing, stable metric names, documented alert thresholds, log redaction policy, dashboard examples, and integration tests for degraded observability startup.