Observability & Monitoring
Monitoring Terragnos Core is essential for maintaining a healthy deployment. This guide covers health checks, metrics, logging, and recommended monitoring strategies.
Health Checks
Terragnos Core provides two health check endpoints:
Liveness Check
GET /v1/health/live
Returns 200 OK if the API process is running. Use this for basic process monitoring.
Readiness Check
GET /v1/health/ready
Returns 200 OK when the API is ready to accept traffic (database connected, license loaded, etc.). Use this for load balancer health checks.
Response:
{
"status": "ready",
"checks": {
"database": "ok",
"license": "ok"
}
}
Metrics
Prometheus Metrics
Terragnos Core exposes Prometheus-compatible metrics at:
GET /metrics
Key metrics include:
http_requests_total– Total HTTP requestshttp_request_duration_seconds– Request duration histogramlicense_limit_exceeded_total{limitCode}– License limit violationsworkflow_transitions_total{workflowId,state}– Workflow transition countsautomation_rules_executed_total{ruleId}– Automation rule executionsdatabase_query_duration_seconds– Database query performance
Example Prometheus Configuration
scrape_configs:
- job_name: 'terragnos-core'
scrape_interval: 15s
metrics_path: '/metrics'
static_configs:
- targets: ['api:3000']
Logging
Log Levels
Configure log level via LOG_LEVEL environment variable:
error– Only errorswarn– Warnings and errorsinfo– Informational messages (default)debug– Detailed debugging information
Log Format
Logs are structured JSON by default:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "info",
"message": "Object created",
"context": "ObjectService",
"objectId": "obj-123",
"tenant": "default"
}
Key Log Events
Monitor these log events:
- Authentication failures –
401 Unauthorizedresponses - License violations –
403 License limit exceeded - Database errors – Connection failures, query errors
- Workflow transitions – State changes and guard failures
- Automation executions – Rule matches and effect results
Monitoring Stack Recommendations
Basic Setup
- Health checks – Configure load balancer to check
/v1/health/ready - Log aggregation – Use Docker logging drivers or log shippers (Fluentd, Filebeat)
- Metrics collection – Prometheus + Grafana
Advanced Setup
- Distributed tracing – OpenTelemetry for request tracing
- APM – Application Performance Monitoring (New Relic, Datadog, etc.)
- Alerting – Alertmanager or PagerDuty integration
Key Metrics to Monitor
API Performance
- Request rate (requests/second)
- Request latency (p50, p95, p99)
- Error rate (4xx, 5xx responses)
License Usage
- Current usage vs. limits
- License expiration date
- Limit violation frequency
Database Performance
- Query duration
- Connection pool usage
- Transaction rate
Workflow Performance
- Transition success rate
- Average transition time
- Guard failure rate
Automation Performance
- Rule execution rate
- Rule success/failure rate
- Webhook call duration
Alerting Recommendations
Set up alerts for:
- Health check failures – API or worker unhealthy
- High error rate – > 5% 5xx responses
- License expiration – Expiring within 30 days
- License limits – > 80% of limit reached
- Database connection failures – Cannot connect to database
- High latency – p95 latency > 1 second
- Automation failures – Rule execution failure rate > 10%
Log Aggregation
Docker Logging
Configure Docker logging drivers:
services:
api:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
Centralized Logging
For production, use centralized logging:
- ELK Stack – Elasticsearch, Logstash, Kibana
- Loki – Grafana Loki for log aggregation
- Cloud logging – AWS CloudWatch, Google Cloud Logging, Azure Monitor
Best Practices
- Monitor health endpoints – Set up alerts for health check failures
- Track key metrics – Monitor API performance, license usage, database health
- Centralize logs – Aggregate logs for easier troubleshooting
- Set up alerts – Configure alerts for critical issues
- Review regularly – Regularly review metrics and logs for trends
- Test alerts – Ensure alerting systems work correctly
- Document runbooks – Create runbooks for common issues