Failure-induced benchmarks¶
If failure is the second memory — and I think it is — then the obvious next question is what do you do with it. This article and its repo answer one version of that. You take the failures the system already produced, treat them as evidence about where the system actually breaks, and use that evidence to build the next benchmark. Not pull yesterday’s static set off a shelf. Build one that targets the failure shapes you just observed.
That’s the bet of this repo. It’s a working scaffold for testing the bet, not a finished benchmark product.
Repository: github.com/obversary/failure-induced-benchmarks (currently at v0.2.0)
The reframe¶
Most evaluation looks like this:
flowchart LR
D["benchmark<br/>(fixed)"] --> M[model] --> S[score]
The dataset is fixed. The model is the variable. Failures are noise.
What I’m exploring looks like this:
flowchart LR
M[model f_θ] --> F["failures F<br/>(extraction operator)"]
F --> Mh["estimated structure M̂<br/>(features · clusters)"]
Mh --> G["induced benchmark D′<br/>(generator G)"]
G -.->|re-evaluate| M
Now the failures are the input and the benchmark is the variable. The central claim of the repo is one sentence:
Benchmarks should be definable as functions of model failure distributions, not only as fixed datasets.
If that’s true, evaluation isn’t passive measurement anymore. It’s inference over the geometry of how a system breaks. That reframe is the whole point, and it’s also the practical reason I think static benchmarks are going to be short-lived (see Why memory is the substrate).
Three hypotheses, all testable¶
The repo doesn’t ask you to take any of this on faith. It pre-commits to three concrete claims so the experiments can be honest about what’s being checked.
H1 — separation. A benchmark induced from failure structure produces more separation between capability levels than a static benchmark of comparable budget. Operationalized as model_separation_S — the population standard deviation of per-model accuracy on a slice. Higher means models are more distinguishable.
H2 — latent structure. An induced benchmark recovers latent reasoning structure (held-out structure_id labels in the toy data) more faithfully than a static one. Operationalized as macro-average cluster purity on failure clusters against those held-out labels.
H3 — convergence. Iterated failure feedback — feed the system, extract failures, induce a new slice, repeat — converges. Cluster geometry stabilizes across rounds. Items stop drifting. Operationalized in convergence.convergence_series as per-round separation, purity, Hungarian-matched centroid drift, and ARI between cluster assignments on overlapping failure ids round to round.
These are the three claims. None of them are proven yet. That’s the whole point of the harness — to make checking them mechanical instead of vibes-based. I’m putting them in writing because I’d rather be wrong about a specific claim in public than vague about a general one in private.
A real H1 example, with the honesty¶
The repo ships a complete worked H1 example at data/reports/example_initial_h1/ — committed, replayable, with the full provenance block. The honest read of that example is that it does not support H1 yet, and the report says so explicitly. I’m walking through it here because the discipline of running the apparatus on real artifacts and writing down what it actually shows is exactly what I want this work to model.
The run used failure_geometry.harness.run_harness with:
two slices:
static(a 3-row fixture) andcarb(200 CARB items built in-process withseed=2026),three stub predictors (
discovery_skew,heuristic_a,heuristic_b),budget 20 per slice per seed, 10 seeds, 500-iteration percentile bootstrap.
The headline numbers from harness_report.json (seed 0, both slices):
{
"slice": "static",
"n": 3,
"accuracy_by_model": {
"discovery_skew": 0.0,
"heuristic_a": 0.667,
"heuristic_b": 0.667
},
"model_separation_S": 0.314,
"model_separation_S_ci": {
"value": 0.314, "lo": 0.118, "hi": 0.471, "n_boot": 500
}
}
{
"slice": "carb",
"n": 20,
"accuracy_by_model": {
"discovery_skew": 0.6,
"heuristic_a": 1.0,
"heuristic_b": 1.0
},
"model_separation_S": 0.189,
"model_separation_S_ci": {
"value": 0.189, "lo": 0.086, "hi": 0.293, "n_boot": 500
}
}
A separate static_vs_induced summary on the same seed produced static_separation_S = 0.314, failure_gen_separation_S = 0.157, n_failures = 7, n_generated = 3, failure_geometry_purity = 0.75.
What this does show:
The harness produces tidy multi-seed, budget-matched reports with bootstrap CIs.
The provenance block embeds
git_sha,git_dirty, package version, Python and platform, dataset SHA256, model ids, seeds, and config — the report can be replayed.The event log captures every
eval_record, every per-slice metric, and every aggregate, in JSONL.failure_geometry_purity = 0.75on a 7-failure clustering atk=2is a sanity check that the geometry features and clustering wire together.The static slice’s
S_std = 0.0confirms the budget-cap branch of the harness (withn=3items, every seed sees the same 3 items by construction). The carb slice’sS_std = 0.055over 10 seeds confirms the resampling branch.
What this does not show:
It does not support H1. Across seeds,
S_mean(carb) < S_mean(static). That tells me nothing about whether induced benchmarks separate models better than static ones — it tells me the three rule-based stubs do not stratify on CARB the way they do on the seed-aligned static slice. H1 needs real models.It does not support H2. A single
failure_geometry_purity = 0.75on 7 failures with a 2-cluster split has very wide CIs ([0.43, 0.92]).It does not support H3. No iterative run is included in this example; the apparatus exists at
python -m failure_geometry iterate.It does not establish that CARB hardness axes are exercised. Stub predictors don’t “reason” on relations, distractors, or counterfactuals.
docs/reports/H1_initial.md writes that out plainly and lists the five things that would have to change for the report to be H1-relevant: real predictors, a static slice with n ≥ 200, more seeds, multiple sampling seeds at the model level, and pre-registration before looking at induced-slice accuracy. That’s the kind of receipt I want every claim on this site backed by.
The observability layer¶
Every CLI run writes two things to data/reports/run_<timestamp>_<id>_<label>/:
events.jsonl— a typed JSONL event log. One line per typed event (run_started,harness_config,eval_record,metric_computed,run_finished). Sample lines from the H1 run:{"ts_utc":"2026-04-29T17:34:57Z","run_id":"e71d99c4","stage":"run","event":"harness_config","payload":{"slices":["static","carb"],"sizes":{"static":3,"carb":200},"budget":20,"n_seeds":10,"n_boot":500}} {"ts_utc":"2026-04-29T17:34:57Z","run_id":"e71d99c4","stage":"metric","event":"metric_computed","payload":{"name":"slice/carb/separation","value":0.189,"seed":0,"n":20}}
You can read events back with
failure_geometry.run_log.read_events(run_dir)and post-process them withpandas/polars/duckdb.A report (
harness_report.json,summary.json, oriterate_report.json) with point estimates, bootstrap CIs, and a provenance block. The provenance block isfailure_geometry.provenance.capture_provenance(...)and containspackage_version,python_version,platform,git_sha,git_dirty,timestamp_utc,seeds,models,dataset_hashes(SHA256 per dataset path), andconfig. That’s what makes the report replayable. Without those fields, a number on a page is folklore.
This isn’t ceremony. It’s the only way I trust my own runs after the fact. If the next run produces a different number, I want to know whether the code changed, the data changed, the seeds changed, or the model set changed — and the provenance block answers that question without me having to remember.
Why CARB exists¶
You can’t test any of this on tiny toy JSONL files where every model gets 100% accuracy. Saturated benchmarks don’t have failure structure to find. So the repo ships a generator I call CARB — Compositional Adversarial Reasoning Benchmark — that produces synthetic items with controlled difficulty along four axes:
Compositional depth — chains of relational premises (“A is heavier than B; B is heavier than C; …”) that get longer as depth increases.
Distractor density — irrelevant but plausible sentences mixed into the prompt.
Logical operator mixing — items that combine negation, quantifiers, and implication.
Counterfactual perturbation — hypothetical-vs-actual world traps.
Each generated item carries the design metadata you’d need to slice failures later — type, depth, distractor_count, operator_count. That metadata is the substrate clustering will work over.
A real row from the checked-in data/static/carb_sample.jsonl:
{
"id": "carb_logic_mix_1",
"x": "Not all cats are black. All black cats are small. If some cats are not small, does it imply that some cats are not black?",
"y": 1,
"type": "logic_mix",
"depth": 2,
"distractor_count": 0,
"operator_count": 4
}
And a deeper compositional one:
{
"id": "carb_chain_3",
"x": "A is taller than B. B is taller than C. C is taller than D. D is taller than E. E is taller than F. Is A taller than F?",
"y": 1,
"type": "compositional",
"depth": 5,
"distractor_count": 0,
"operator_count": 1
}
The generator is deterministic and seedable — the same seed produces the same items, which is the only way the comparisons later are honest. CARB is not a claim of “LLM-grade” coverage. It’s a controlled substrate where failures are less degenerate than on a tiny seed set, so the structure question can actually be tested.
Running the whole thing¶
A single CLI now drives every stage:
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pytest
# Build CARB
python -m failure_geometry build-carb --n 2000 --seed 42 --output data/static/carb_benchmark.jsonl
# Static eval with bootstrap CI + provenance + event log
python -m failure_geometry eval-static --dataset data/static/static_benchmark.jsonl
# Static vs failure-induced comparison
python -m failure_geometry compare
# Multi-seed budget-matched harness for H1
python -m failure_geometry harness \
--slice static=data/static/static_benchmark.jsonl \
--slice carb=data/static/carb_benchmark.jsonl \
--budget 200 --n-seeds 10 --n-boot 1000
# H3 iterative-feedback diagnostics
python -m failure_geometry iterate --rounds 5 --update substitute
No GPUs, no API keys. The point of the whole stack is that the apparatus runs locally so the expensive part of an honest experiment — the model — is the only thing you have to swap in.
What’s actually in the repo¶
Concrete things:
src/failure_geometry/— core library.schemas.py,dataset.py,evaluate.py,failures.pyfor the eval core;geometry.py,embedding.py,clustering.py,slices.py,profiles.py,discovery.pyfor failure-mode discovery;generator.py,experiment.pyfor induce-and-iterate loops;carb.py,adapters.pyfor the CARB substrate.Observability:
provenance.py,run_log.py,stats.py(bootstrap CIs).Harness:
harness.py(multi-seed budget-matched runner) andconvergence.py(round-over-round H3 diagnostics).CLI:
cli.py/__main__.py—python -m failure_geometry ....Tests: the suite covers purity, kmeans, embedding, slices, profiles, CARB diversity, provenance, run-log, stats, harness, convergence, determinism, dataset metadata. The hypotheses don’t get to be vague; the code paths do get to be tested.
docs/—failure_manifold_theory.md(the failure manifold / benchmark induction / cross-model invariance hypotheses written as testable claims),carb_design.md,failure_discovery.md,experiment_design.md,metrics.md,observability.md,limitations.md,reports/H1_initial.md.data/reports/example_initial_h1/— the H1 example walked through above. Committed. Replayable.
Where the math lives¶
The formal sketch in the repo (kept short on purpose):
Static benchmark: $D_{\text{static}} \sim P(X, Y)$
Failure-induced benchmark: $D_{\text{fg}} = G(F(f_\theta, D_{\text{seed}}))$
Where $F$ is the failure extraction operator, $G$ is a generator conditioned on estimated failure geometry, $f_\theta$ is the system being evaluated, and $D_{\text{seed}}$ is the input set. The point of writing it that way isn’t to dress the work up. It’s to make the operators explicit so when something doesn’t work, you know whether $F$ or $G$ broke. Most evaluation pipelines blur the two together and then can’t debug their own conclusions.
The longer version of the framing — failure manifold hypothesis, benchmark induction hypothesis, cross-model invariance hypothesis — lives in docs/failure_manifold_theory.md in the repo, next to the code that tests it.
What this isn’t¶
Brief, because every page on the site has been over-defending its scope.
It’s not a leaderboard. The current predictors are rule-based stubs, not language models. The generator $G$ doesn’t call an LLM yet — induced items are templated, on purpose, so the repo stays small and auditable. The H1 example walked through above is an example of the apparatus working, not evidence for H1. The point is to make the pipeline that would prove or disprove the hypotheses runnable, with the receipts to back any number it produces.
How this connects¶
This work doesn’t stand alone. It assumes the rest of the substrate exists.
Structured failure traces define the comparable record format that makes any of this possible.
Failure-sliced eval supplies the slice-level measurement upstream of benchmark synthesis.
Failure clusters as interventions is the version of this loop where you don’t induce a new benchmark — you change the system. Same input data, different output.
Memory-guided evaluation is where the failure traces come from in the first place.
Why memory is the substrate is the broader argument: cognitive range is what matters, static benchmarks measure the wrong thing, and induced benchmarks are one version of the fix.
The whole arc is one idea: failure is data, that data has structure, and the structure is good enough to act on — either by changing the system or by building harder questions for it. Both are the same loop, run in different directions.