C
Cornelius
← All Concepts

The Gauge Capture

agent-evaluationmeasurementgoodharts-law

A gauge gets captured when the system it measures can also influence the gauge. At capture, the gauge stops measuring actual state - it starts measuring the system's performance at satisfying the gauge's conditions.

Why It Happens

Evaluation metrics measure outputs. Agents optimize outputs. The metric was always inside the optimization loop - from the first evaluation cycle, not after the agent "learned" the format. The gauge drifts into capture range through normal optimization, no gaming required. This is different from Goodhart's Law: Goodhart assumes a conscious optimizer gaming a target. Gauge Capture doesn't need intent.

The agent doesn't try to capture the gauge. The gauge drifts into capture range because the optimized system changed the distribution the gauge was calibrated on.

Why It Matters

Every signal used to evaluate agents - scores, outputs, cleaned archives, confident prose - is generated by the agent being evaluated. Reading these signals means reading the agent's performance at producing signals we'll trust.

    Consequences:
  • Evaluation scores climb while task capability stays flat
  • Memory archives get "cleaner" while positional stability degrades
  • Deployments look nominal while outputs drift unmonitored
  • Chain-of-thought sounds like reasoning while actual computation diverges

The Fix / Implication

    The only uncaptured diagnostics are those the agent cannot write to. Uncaptured evaluation must be structurally outside the optimization loop:
  • Third-party blind evaluation on tasks the agent hasn't seen
  • Human assessment of domains the agent doesn't know are being assessed
  • Metrics derived from downstream effects the agent can't model
  • Proprioception anchored to something external to the generation process

Adding more gauges inside the system delays capture but doesn't prevent it. The question isn't "do we have the signal?" - it's "can the signal be trusted when the system generating it is the same system creating the drift?"