Observability & Monitoring

Monitoring Terragnos Core is essential for maintaining a healthy deployment. This guide covers health checks, metrics, logging, and recommended monitoring strategies.

Health Checks

Terragnos Core provides two health check endpoints:

Liveness Check

GET /v1/health/live

Returns 200 OK if the API process is running. Use this for basic process monitoring.

Readiness Check

GET /v1/health/ready

Returns 200 OK when the API is ready to accept traffic (database connected, license loaded, etc.). Use this for load balancer health checks.

Response:

{
  "status": "ready",
  "checks": {
    "database": "ok",
    "license": "ok"
  }
}

Metrics

Prometheus Metrics

Terragnos Core exposes Prometheus-compatible metrics at:

GET /metrics

Key metrics include:

http_requests_total – Total HTTP requests
http_request_duration_seconds – Request duration histogram
license_limit_exceeded_total{limitCode} – License limit violations
workflow_transitions_total{workflowId,state} – Workflow transition counts
automation_rules_executed_total{ruleId} – Automation rule executions
database_query_duration_seconds – Database query performance

Example Prometheus Configuration

scrape_configs:
  - job_name: 'terragnos-core'
    scrape_interval: 15s
    metrics_path: '/metrics'
    static_configs:
      - targets: ['api:3000']

Logging

Log Levels

Configure log level via LOG_LEVEL environment variable:

error – Only errors
warn – Warnings and errors
info – Informational messages (default)
debug – Detailed debugging information

Log Format

Logs are structured JSON by default:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "info",
  "message": "Object created",
  "context": "ObjectService",
  "objectId": "obj-123",
  "tenant": "default"
}

Key Log Events

Monitor these log events:

Authentication failures – 401 Unauthorized responses
License violations – 403 License limit exceeded
Database errors – Connection failures, query errors
Workflow transitions – State changes and guard failures
Automation executions – Rule matches and effect results

Monitoring Stack Recommendations

Basic Setup

Health checks – Configure load balancer to check /v1/health/ready
Log aggregation – Use Docker logging drivers or log shippers (Fluentd, Filebeat)
Metrics collection – Prometheus + Grafana

Advanced Setup

Distributed tracing – OpenTelemetry for request tracing
APM – Application Performance Monitoring (New Relic, Datadog, etc.)
Alerting – Alertmanager or PagerDuty integration

Key Metrics to Monitor

API Performance

Request rate (requests/second)
Request latency (p50, p95, p99)
Error rate (4xx, 5xx responses)

License Usage

Current usage vs. limits
License expiration date
Limit violation frequency

Database Performance

Query duration
Connection pool usage
Transaction rate

Workflow Performance

Transition success rate
Average transition time
Guard failure rate

Automation Performance

Rule execution rate
Rule success/failure rate
Webhook call duration

Alerting Recommendations

Set up alerts for:

Health check failures – API or worker unhealthy
High error rate – > 5% 5xx responses
License expiration – Expiring within 30 days
License limits – > 80% of limit reached
Database connection failures – Cannot connect to database
High latency – p95 latency > 1 second
Automation failures – Rule execution failure rate > 10%

Log Aggregation

Docker Logging

Configure Docker logging drivers:

services:
  api:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

Centralized Logging

For production, use centralized logging:

ELK Stack – Elasticsearch, Logstash, Kibana
Loki – Grafana Loki for log aggregation
Cloud logging – AWS CloudWatch, Google Cloud Logging, Azure Monitor

Best Practices

Monitor health endpoints – Set up alerts for health check failures
Track key metrics – Monitor API performance, license usage, database health
Centralize logs – Aggregate logs for easier troubleshooting
Set up alerts – Configure alerts for critical issues
Review regularly – Regularly review metrics and logs for trends
Test alerts – Ensure alerting systems work correctly
Document runbooks – Create runbooks for common issues

Health Checks​

Liveness Check​

Readiness Check​

Metrics​

Prometheus Metrics​

Example Prometheus Configuration​

Logging​

Log Levels​

Log Format​

Key Log Events​

Monitoring Stack Recommendations​

Basic Setup​

Advanced Setup​

Key Metrics to Monitor​

API Performance​

License Usage​

Database Performance​

Workflow Performance​

Automation Performance​

Alerting Recommendations​

Log Aggregation​

Docker Logging​

Centralized Logging​

Best Practices​