Skip to main content

Observability & Monitoring

Monitoring Terragnos Core is essential for maintaining a healthy deployment. This guide covers health checks, metrics, logging, and recommended monitoring strategies.

Health Checks

Terragnos Core provides two health check endpoints:

Liveness Check

GET /v1/health/live

Returns 200 OK if the API process is running. Use this for basic process monitoring.

Readiness Check

GET /v1/health/ready

Returns 200 OK when the API is ready to accept traffic (database connected, license loaded, etc.). Use this for load balancer health checks.

Response:

{
"status": "ready",
"checks": {
"database": "ok",
"license": "ok"
}
}

Metrics

Prometheus Metrics

Terragnos Core exposes Prometheus-compatible metrics at:

GET /metrics

Key metrics include:

  • http_requests_total – Total HTTP requests
  • http_request_duration_seconds – Request duration histogram
  • license_limit_exceeded_total{limitCode} – License limit violations
  • workflow_transitions_total{workflowId,state} – Workflow transition counts
  • automation_rules_executed_total{ruleId} – Automation rule executions
  • database_query_duration_seconds – Database query performance

Example Prometheus Configuration

scrape_configs:
- job_name: 'terragnos-core'
scrape_interval: 15s
metrics_path: '/metrics'
static_configs:
- targets: ['api:3000']

Logging

Log Levels

Configure log level via LOG_LEVEL environment variable:

  • error – Only errors
  • warn – Warnings and errors
  • info – Informational messages (default)
  • debug – Detailed debugging information

Log Format

Logs are structured JSON by default:

{
"timestamp": "2024-01-15T10:30:00Z",
"level": "info",
"message": "Object created",
"context": "ObjectService",
"objectId": "obj-123",
"tenant": "default"
}

Key Log Events

Monitor these log events:

  • Authentication failures401 Unauthorized responses
  • License violations403 License limit exceeded
  • Database errors – Connection failures, query errors
  • Workflow transitions – State changes and guard failures
  • Automation executions – Rule matches and effect results

Monitoring Stack Recommendations

Basic Setup

  1. Health checks – Configure load balancer to check /v1/health/ready
  2. Log aggregation – Use Docker logging drivers or log shippers (Fluentd, Filebeat)
  3. Metrics collection – Prometheus + Grafana

Advanced Setup

  1. Distributed tracing – OpenTelemetry for request tracing
  2. APM – Application Performance Monitoring (New Relic, Datadog, etc.)
  3. Alerting – Alertmanager or PagerDuty integration

Key Metrics to Monitor

API Performance

  • Request rate (requests/second)
  • Request latency (p50, p95, p99)
  • Error rate (4xx, 5xx responses)

License Usage

  • Current usage vs. limits
  • License expiration date
  • Limit violation frequency

Database Performance

  • Query duration
  • Connection pool usage
  • Transaction rate

Workflow Performance

  • Transition success rate
  • Average transition time
  • Guard failure rate

Automation Performance

  • Rule execution rate
  • Rule success/failure rate
  • Webhook call duration

Alerting Recommendations

Set up alerts for:

  1. Health check failures – API or worker unhealthy
  2. High error rate – > 5% 5xx responses
  3. License expiration – Expiring within 30 days
  4. License limits – > 80% of limit reached
  5. Database connection failures – Cannot connect to database
  6. High latency – p95 latency > 1 second
  7. Automation failures – Rule execution failure rate > 10%

Log Aggregation

Docker Logging

Configure Docker logging drivers:

services:
api:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"

Centralized Logging

For production, use centralized logging:

  • ELK Stack – Elasticsearch, Logstash, Kibana
  • Loki – Grafana Loki for log aggregation
  • Cloud logging – AWS CloudWatch, Google Cloud Logging, Azure Monitor

Best Practices

  1. Monitor health endpoints – Set up alerts for health check failures
  2. Track key metrics – Monitor API performance, license usage, database health
  3. Centralize logs – Aggregate logs for easier troubleshooting
  4. Set up alerts – Configure alerts for critical issues
  5. Review regularly – Regularly review metrics and logs for trends
  6. Test alerts – Ensure alerting systems work correctly
  7. Document runbooks – Create runbooks for common issues