Metrics
The Resonate server exposes a metrics endpoint :9090/metrics that is compatible with Prometheus.
The aio prefix refers to stuff that “goes out” of the server such as requests to the store, sending tasks to nodes, etc.
Coroutines refers to the units of business logic in the Server.
Metrics exposed
aio_connection
Number of aio subsystem connections.
- gauge
aio_connection{type="sender:poll"} 0
aio_in_flight_submissions
Number of in flight aio submissions.
- gauge
aio_in_flight_submissions{type="store"} 0
aio_total_submissions
Total number of aio submissions.
- counter
aio_total_submissions{status="success",type="store"} 0
aio_worker_count
Number of aio subsystem workers.
- gauge
aio_worker_count{type="router"} 0aio_worker_count{type="sender"} 0aio_worker_count{type="sender:http"} 0aio_worker_count{type="sender:poll"} 0aio_worker_count{type="store:sqlite"} 0
aio_worker_in_flight_submissions
Number of in flight aio submissions.
- gauge
aio_worker_in_flight_submissions{type="router",worker="0"} 0aio_worker_in_flight_submissions{type="sender",worker="0"} 0aio_worker_in_flight_submissions{type="sender:http",worker="0"} 0aio_worker_in_flight_submissions{type="sender:poll",worker="0"} 0aio_worker_in_flight_submissions{type="store:sqlite",worker="0"} 0
coroutines_in_flight number
Number of in flight coroutines.
- gauge
coroutines_in_flight{type="EnqueueTasks"} 0coroutines_in_flight{type="SchedulePromises"} 0coroutines_in_flight{type="TimeoutLocks"} 0coroutines_in_flight{type="TimeoutPromises"} 0coroutines_in_flight{type="TimeoutTasks"} 0
coroutines_total
Total number of coroutines.
- counter
coroutines_total{type="EnqueueTasks"} 0coroutines_total{type="SchedulePromises"} 0coroutines_total{type="TimeoutLocks"} 0coroutines_total{type="TimeoutPromises"} 0coroutines_total{type="TimeoutTasks"} 0
Using Prometheus
Quick start
-
Download Prometheus: https://prometheus.io/download/
-
Configure Prometheus to scrape Resonate metrics:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "resonate-server"
static_configs:
- targets: ["localhost:9090"] # Resonate metrics endpoint
labels:
app: "resonate"
env: "production"
- Start Prometheus:
# Run on port 9091 to avoid conflict with Resonate (which uses 9090)
./prometheus --config.file=prometheus.yml --web.listen-address=:9091
- Access Prometheus UI:
Open http://localhost:9091 to query metrics and build dashboards.
Prometheus in Docker
version: '3.8'
services:
resonate-server:
image: resonatehq/resonate:latest
ports:
- "8001:8001"
- "9090:9090" # Metrics endpoint
prometheus:
image: prom/prometheus:latest
ports:
- "9091:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
volumes:
prometheus-data:
Prometheus in Kubernetes
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'resonate'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- default
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: resonate-server
action: keep
Using Grafana
Grafana visualizes metrics from Prometheus.
Quick start
-
Download Grafana: https://grafana.com/grafana/download
-
Start Grafana:
./grafana server
-
Add Prometheus as a data source:
- Open http://localhost:3000 (default Grafana UI)
- Go to Configuration → Data Sources
- Add Prometheus
- URL:
http://localhost:9091(your Prometheus instance)
-
Create dashboards using PromQL queries.
Example dashboard panels
Promise workload:
# Total pending promises
promises_total{state="pending"}
# Promise rate by state
rate(promises_total[5m])
Task processing:
# Total tasks by state
tasks_total{state="claimed"}
tasks_total{state="completed"}
# Task completion rate
rate(tasks_total{state="completed"}[5m])
Server load:
# API request rate
rate(api_requests_total[5m])
# Coroutines in flight (internal queue depth)
sum(coroutines_in_flight)
Key metrics to track
Workload metrics
Promise rate:
rate(promises_total[5m])
Indicates incoming workload. Track trends and spikes across all promise states.
Pending promises:
promises_total{state="pending"}
Backlog of work. Should stay near zero. Growth indicates insufficient worker capacity or processing issues.
Promise state distribution:
promises_total{state="resolved"}
promises_total{state="rejected"}
promises_total{state="canceled"}
Track outcomes to understand success/failure patterns.
Task metrics
Task completion rate:
rate(tasks_total{state="completed"}[5m])
How fast tasks are being processed. Compare with promise rate to understand throughput.
Claimed vs completed:
tasks_total{state="claimed"}
tasks_total{state="completed"}
Large gap between claimed and completed indicates tasks are stalling.
Server metrics
API request rate:
rate(api_requests_total[5m])
Overall server load. High values indicate heavy traffic.
API request latency:
histogram_quantile(0.95, rate(api_duration_seconds_bucket[5m]))
P95 latency for API requests. High values indicate performance bottlenecks.
HTTP requests:
rate(http_requests_total[5m])
HTTP API usage. Track by method and path to understand traffic patterns.
Coroutines in flight:
sum(coroutines_in_flight)
Server's internal work queue. High values indicate server capacity issues.
AIO submissions:
rate(aio_total_submissions{status="success"}[5m])
rate(aio_total_submissions{status="failure"}[5m])
Server's asynchronous I/O operations (database, worker communication). Failures indicate infrastructure issues.
Alerting rules
Set up alerts for critical conditions:
groups:
- name: resonate-alerts
rules:
- alert: ResonateServerDown
expr: up{job="resonate-server"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Resonate server is down"
description: "Resonate server has been down for more than 1 minute"
- alert: HighAPIErrorRate
expr: rate(api_requests_total{status=~"5.."}[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High API error rate"
description: "More than 10 API errors per minute"
- alert: PromiseBacklogGrowing
expr: increase(promises_total{state="pending"}[5m]) > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Promise backlog is growing"
description: "Pending promises are accumulating - scale workers or check processing"
- alert: TasksNotCompleting
expr: tasks_total{state="claimed"} - tasks_total{state="completed"} > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "Tasks claimed but not completing"
description: "Large gap between claimed and completed tasks"
- alert: HighAPILatency
expr: histogram_quantile(0.95, rate(api_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High API latency"
description: "P95 API latency is over 5 seconds"
Load alerting rules in Prometheus:
global:
scrape_interval: 15s
rule_files:
- "alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093'] # Alertmanager endpoint
scrape_configs:
- job_name: "resonate-server"
static_configs:
- targets: ["localhost:9090"]
Cloud provider integration
AWS CloudWatch
Export metrics to CloudWatch using a bridge:
# Use Prometheus CloudWatch exporter
- name: cloudwatch-exporter
image: prom/cloudwatch-exporter:latest
Or use AWS Managed Prometheus (AMP):
- Creates a workspace for Prometheus
- Automatically scrapes metrics from ECS/EKS
- Integrated with CloudWatch dashboards
Google Cloud Monitoring
Use Google Cloud Managed Prometheus:
global:
external_labels:
cluster: 'my-cluster'
project_id: 'my-project'
scrape_configs:
- job_name: 'resonate'
kubernetes_sd_configs:
- role: pod
Google Cloud Monitoring automatically imports Prometheus metrics.
Datadog
Use the Datadog Agent to scrape Prometheus metrics:
apiVersion: v1
kind: ConfigMap
metadata:
name: datadog-config
data:
prometheus.yaml: |
- prometheus_url: http://resonate-server:9090/metrics
namespace: resonate
metrics:
- resonate_*
Best practices
- Set retention policies - Prometheus defaults to 15 days. Adjust based on needs:
./prometheus --storage.tsdb.retention.time=90d
- Use recording rules for expensive queries:
groups:
- name: resonate-recording-rules
interval: 1m
rules:
- record: resonate:promise_creation_rate:5m
expr: rate(resonate_promises_created_total[5m])
- Monitor Prometheus itself:
prometheus_tsdb_storage_blocks_bytes # Storage usage
prometheus_target_scrapes_exceeded_sample_limit_total # Cardinality issues
-
Use labels strategically - Don't add high-cardinality labels (e.g., user IDs, promise IDs)
-
Alert on trends, not thresholds - Use
rate()and time windows to detect anomalies -
Test alerts - Trigger alerts intentionally to verify they work
Troubleshooting metrics
Metrics endpoint not accessible
Check server is running:
curl http://localhost:9090/metrics
Should return Prometheus-formatted metrics.
Check metrics port configuration:
api:
metrics:
port: 9090 # Must be accessible to Prometheus
Prometheus not scraping metrics
Check Prometheus targets:
- Open http://localhost:9091/targets
- Look for your Resonate server job
- Status should be "UP" with green indicator
Common issues:
- Wrong target address (hostname/port)
- Network/firewall blocking access
- Prometheus config not loaded (restart Prometheus)
Missing metrics
Not all metrics appear immediately. Some are only emitted when events occur:
promises_total- Only increments when promises are created/resolvedtasks_total- Only increments when tasks are created/completedhttp_requests_total- Only increments when HTTP requests are made
Run workload to generate metrics.
Metrics not currently exposed:
- Worker heartbeat failures (monitor worker pod restarts at infrastructure level)
- Database connection pool stats (PostgreSQL exposes these separately)
- Per-worker execution metrics (not tracked at server level)
Summary
For development:
- Run Prometheus locally
- Use Prometheus UI for ad-hoc queries
- Focus on understanding baseline metrics
For production:
- Use managed Prometheus (AWS AMP, GCP Managed Prometheus, Grafana Cloud)
- Set up Grafana dashboards
- Configure alerting for critical conditions
- Monitor trends, not just point-in-time values
- Integrate with your existing observability stack
Key metrics to always track:
promises_totalby state (workload)tasks_totalby state (processing throughput)api_requests_total(server load)coroutines_in_flight(server capacity)
What's not metricked:
- Individual worker health (monitor at infrastructure level: K8s pod restarts, container health)
- Database connection pools (use PostgreSQL's own metrics)
- Promise/task latency histograms (not currently exposed)
Metrics give you visibility into system health. Set them up before you need them.