LLM-as-judge

Using a language model to evaluate another model's output (or its own), instead of relying only on deterministic metrics. Useful when quality is subjective — coherence, usefulness, tone appropriateness — and there's no single "correct" answer.

The catch is that the judge inherits biases and needs to be calibrated: clear rubrics, reference examples, and validation against human judgment. Without calibration, you trade one measurement problem for another that's harder to audit.