HatchedDocs
Platform

Observability

Request correlation, error envelope, health endpoints, metrics, and how to triage a production incident.

Hatched bakes three primitives into every request so that a log line, an error returned to the client, and a metric emitted to Prometheus can always be correlated.

Request correlation

  • apps/api/src/common/interceptors/request-id.interceptor.ts either honors an incoming X-Request-Id header or generates a UUID v4.
  • The id is:
    • stored on request.requestId for downstream handlers,
    • echoed back on the response as the X-Request-Id header,
    • included in every log line produced by LoggingInterceptor,
    • surfaced to the client as error.requestId inside the canonical error envelope whenever an exception reaches GlobalExceptionFilter.
  • SDK clients (@hatched/sdk-js) expose it as HatchedError.requestId so downstream consumers can paste it directly into a support ticket or log search.

Error envelope

Every HTTP error — HatchedException, HttpException, or an unexpected exception — is serialized by GlobalExceptionFilter into:

{
  "error": {
    "code": "stable_snake_case_code",
    "message": "Human-readable message",
    "details": { "_": "optional structured context" },
    "requestId": "uuid-matching-X-Request-Id-header"
  }
}

See apps/api/src/common/exceptions/hatched.exception.ts for the typed exception hierarchy. Prefer throwing a specific subclass (ResourceNotFoundException, AuthException, RateLimitException, ValidationException, UpstreamImageException, ConfigVersionMismatchException) over HttpException so the envelope carries a stable code.

Health endpoints

EndpointStatus codesConsumer
GET /health200Human-readable status dashboard
GET /health/ready200 when all deps up, 503 otherwiseLoad balancer / Fly.io readiness probe
GET /health/live200 as long as the process is aliveLoad balancer liveness probe

/health/ready checks Postgres, Redis, BullMQ wait/active depths, and the primary image provider. A 503 response removes the instance from rotation.

Metrics

GET /metrics emits Prometheus text-format counters/gauges. It is protected by X-Internal-Service-Token matching the INTERNAL_SERVICE_TOKEN env var — requests without the token receive 401 (or 403 when the token is not configured). Never expose this endpoint publicly; scrape it from a trusted network or a Prometheus instance that can attach the header.

Logs

All HTTP access logs are emitted as single-line JSON by LoggingInterceptor with shape:

{
  "requestId": "...",
  "method": "POST",
  "path": "/api/v1/events",
  "statusCode": 200,
  "duration": 42,
  "ip": "...",
  "userAgent": "..."
}

Error logs additionally carry an error field with the exception message.

Diagnosing a production incident

  1. Grab the X-Request-Id the client received (or the requestId inside the error envelope / HatchedError).
  2. Grep logs for that id — you will find the access log, any error stack, and any downstream service calls that forwarded the id.
  3. Cross-reference with /metrics via Prometheus to see whether the request was part of a broader spike (check hatched_http_requests_total and queue depth gauges).