CRD Health

2 min read

Each operatorBox tracks its own health state independently using a CRDHealth instance. Health is updated on every reconcile cycle.


What CRD health tracks

FieldDescription
startedWhether the reconciler has begun processing events
healthyWhether the reconciler is currently considered healthy
totalReconcilesTotal reconcile attempts
failedReconcilesNumber of failed reconciles
consecutiveFailsConsecutive failure counter — drives degradation
lastErrorLast error message
lastReconcileTimestamp of last reconcile
startTimeWhen the reconciler first started

All fields are atomic and safe for concurrent updates from multiple workers.


On success

RecordSuccess()
  • increments total reconciles
  • resets consecutive failures to zero
  • marks healthy
  • updates lastReconcile timestamp

On failure

RecordFailure(err, degradeThreshold)
  • increments total and failed reconcile counts
  • increments consecutive failures
  • stores lastError
  • marks unhealthy if consecutiveFails >= degradeThreshold

Degradation

A CRD becomes unhealthy when:

consecutiveFails >= degradeThreshold

The threshold is configurable per CRD in the Katalog. Unhealthy CRDs are visible in the Control Center and can trigger rollback if configured.


Health endpoints

Each CRD exposes its health through the operator’s health server:

GET /katalog/{crd}/health   — live health status (200 healthy, 503 unhealthy)
GET /katalog/{crd}          — configuration + health summary + provider stats
GET /katalog                — all CRDs with health

These endpoints power the Control Center dashboard, readiness checks, and any automation that needs to detect a failing CRD without watching the CR directly.