Autoscaler Metrics

The metrics namespace used by the Operator Autoscaler.

The Operator Autoscaler evaluates conditions using a dedicated metrics namespace:
metrics.*.

These metrics are collected inside the Orkestra runtime, per CRD, inside each operatorBox.
They are in‑memory, O(1) to read, and require no API calls, no informers, and no Prometheus.

Metrics are updated continuously by:

the worker pool
the workqueue
the reconcile loop
the provider subsystem

They represent the true runtime load of an operator.

Metric fields

The autoscaler supports the following metric fields:

Field	Type	Description
`metrics.workersBusyPercent`	float	Percentage of workers currently reconciling
`metrics.workersIdlePercent`	float	Percentage of workers waiting for work
`metrics.queueDepth`	int	Current number of items in the operator’s queue
`metrics.reconcileDurationP95Ms`	float	P95 reconcile duration (ms) over the last window
`metrics.errorRatePercent`	float	Percentage of reconciles that failed in the last window

These fields are validated at Katalog load time.
Unknown fields cause a fail‑fast error.

How metrics are collected

1. Worker utilization

The worker pool tracks:

total workers
workers currently executing Reconcile
workers blocked on the semaphore

From this, Orkestra computes:

workersBusyPercent = (busy / total) * 100
workersIdlePercent = (idle / total) * 100

These values update on every reconcile start/finish.

2. Queue depth

The workqueue exposes its current length:

queueDepth = queue.Len()

This is read directly from the in‑memory queue structure.

3. Reconcile duration (P95)

Each reconcile records its duration into a ring buffer.
Every autoscaler tick, Orkestra computes:

P95 = 95th percentile of durations in the last N seconds

This is a rolling window, not a cumulative metric.

4. Error rate

The autoscaler maintains counters:

total reconciles
failed reconciles

Over the last window:

errorRatePercent = (failed / total) * 100

This is reset periodically to avoid unbounded growth.

Metric resolution

When the autoscaler evaluates a condition like:

field: metrics.queueDepth
greaterThan: "500"

The condition engine:

Detects the metrics. prefix
Routes the lookup to the autoscaler metrics subsystem
Retrieves the current in‑memory value
Compares it using the declared operator (greaterThan, lessThan, etc.)

This lookup is constant‑time and never touches the API server.

Validation

At Katalog load time, Orkestra validates:

all metric field names
all comparison operators
all numeric values

Invalid metric fields produce an error like:

unknown autoscale metric field "metrics.queueDpth" — valid fields:
metrics.workersBusyPercent, metrics.workersIdlePercent,
metrics.queueDepth, metrics.reconcileDurationP95Ms,
metrics.errorRatePercent

This ensures autoscaler conditions are always resolvable at runtime.

Examples

Scale when queue is deep and workers are saturated

when:
  - field: metrics.queueDepth
    greaterThan: "300"
  - field: metrics.workersBusyPercent
    greaterThan: "80"

Scale down when workers are mostly idle

when:
  - field: metrics.workersIdlePercent
    greaterThan: "60"

Scale when reconciles are slow

when:
  - field: metrics.reconcileDurationP95Ms
    greaterThan: "250"

Scale when error rate spikes

when:
  - field: metrics.errorRatePercent
    greaterThan: "10"

Reading the signals

Metric	What it measures
Queue depth	Backlog — how much work is waiting
Worker utilization	Saturation — are all workers busy
Reconcile duration	Latency — how long each reconcile takes
Error rate	Instability — how often reconciles fail

These metrics describe operator load directly, without needing external Prometheus rules or Pod-level resource metrics.