Failure-sliced eval¶

This is the article that should have been called “why one number is never enough.” If the rest of the evaluation lane is about turning failures into structured objects, this one is about the move that comes one step earlier — making sure the failures show up at all in the metric.

A model can look fine on aggregate accuracy and be quietly catastrophic on a category that matters. Negation. Long inputs. Compositional prompts. Transitive reasoning. Distractor-heavy passages. Specific tool-use paths. The number averages out. The category gets buried. By the time anyone notices, the system has been wrong about that category for weeks.

Sliced evaluation is the cheapest way to stop that from happening. Define the slices that matter to you. Report per-slice metrics alongside the aggregate. Treat any drop on a slice as data, not noise.

Why this exists¶

“How accurate was the model overall?” is the wrong first question.

The first question is “where does it break, and how badly?” Aggregate accuracy hides structure on purpose — that’s what an aggregate is. If the slice you care about is 5% of your data and the model is 0% accurate on it, your overall number can move 5 points and you’d never know.

This is also the reason I keep saying static benchmarks are short-lived. A leaderboard with one number is a measurement artifact. The interesting question is whether the model gets better across a stratified set of slices, and whether the slices that matter most are the ones it improved on.

Basic flow¶

        flowchart LR
  D[dataset] --> M["model_fn(x: str) -> int"]
  M --> R[evaluation records]
  R --> S[slices · boolean predicates]
  S --> PM[per-slice metrics]
  PM --> RPT[report]
  R --> AGG[aggregate metric]
  AGG --> RPT

The model interface is intentionally minimal:

def model_fn(x: str) -> int:
    ...

That keeps the harness portable. The same setup wraps sklearn classifiers, local models, API-backed LLMs, or anything else behind one surface. The point of the harness isn’t the model — it’s the per-slice math sitting in front of any model.

How it fits the broader system¶

Sliced metrics are the practical foundation for failure discovery work. You start with predefined slices (the ones you can already name) before you move to discovered slices (the ones the system finds for you in the failure data). Failure discovery on binary reasoning is a small controlled experiment grounded in the same slice-oriented view, where the validation question is whether unsupervised clusters align with known reasoning types.

What ships in the repo¶

Repository: github.com/obversary/failure-sliced-eval

Sliced reporting. eval/harness.py provides SliceMetricsHarness (sklearn-style models → per-slice metrics) and EvalHarness (the minimal (str → int) interface plus a tuple-style dataset loader).
Failure episode log format (MVP). Config + decision_trace + output + success + error_tag, validated with jsonschema. Schema lives at data/schema/failure_episode_mvp.schema.json. scripts/generate_mvp_grid.py emits a balanced shell JSONL with arms rotated; scripts/validate_episodes.py validates .json / .jsonl files against the schema.
A failure-discovery experiment. experiments/run_failure_discovery.py runs the binary-reasoning toy described on Failure discovery on binary reasoning: synthetic data → simple classifier → failure extraction → clustering → cluster purity validation against held-out reasoning types.
Slices as boolean predicates over result rows (long_input, low_confidence, errors, multi_step, etc.) in eval/slices.py.
Metrics: accuracy, Brier, normal-approx confidence intervals, counts (eval/metrics.py).
A pre-commitment to MVP-before-embeddings. docs/MVP_RESEARCH_DESIGN.md is the hypothesis and minimal schema. docs/PHASE2_CLUSTERING.md is the post-MVP roadmap (HDBSCAN/k-means, embeddings, three-view feature space). The Phase 2 work is intentionally not implemented yet — get the controlled dataset right first, then add the heavier machinery.

That MVP-first discipline is the part I’m most attached to. It’s tempting to start with embeddings and let the clustering tell you what your slices should be. The cheaper move is to ship sliced metrics on slices you can already name, see what shows up, then decide whether discovered slices add anything the named ones don’t already cover.

What it isn’t¶

Not a leaderboard. Not a guarantee that any fixed slice set covers real deployment failures. It’s a scaffold for honest, stratified measurement — the layer underneath the failure-as-data work in the rest of the lane.