Structured failure traces

Most evaluation logs treat a failure as a bad output — wrong label, failed assertion, low score. Easy to store, weak for diagnosis. The path that produced the output — which model the system picked, which tool it called, which memory it consulted or skipped — usually disappears.

This page describes a different unit of record: a structured failure trace — one trajectory through the system that ended in failure, with fields that stay stable across runs so traces stay comparable to each other.

For machine validation, the canonical home of the schema is the memoryevalguided repository. This page is the human-readable companion.

Repository: github.com/obversary/memoryevalguided

You can validate one example yourself after pip install jsonschema:

python -m jsonschema schemas/failure_episode_trace.schema.json \
  -i schemas/examples/retrieval_omission.example.json

Why failures need structure

Unstructured logs mix signal with noise. Without shared fields you can’t reliably ask:

  • Which configurations or trajectory shapes tend to fail together?

  • Did the system fail early (bad premise) or late (reasonable steps, wrong finish)?

  • Did memory state contribute, or did the failure happen elsewhere?

A schema makes failures comparable objects — same slots, different contents. That’s the prerequisite for clustering, mining patterns, and turning failures into failure-induced benchmarks. It’s the discipline that lets the second memory — the memory of mistakes — actually accumulate into something useful.

A real trace, end to end

This is schemas/examples/retrieval_omission.example.json, checked into the memoryevalguided repo and validated against the schema in CI. I’m reproducing the full document here because the shape is easier to read than the prose:

{
  "episode_id": "fet_retrieval_001",
  "timestamp": "2026-04-27T00:00:00Z",
  "task": {
    "input": "Summarize the prior notes about edge inference constraints.",
    "task_type": "retrieval",
    "constraints": {
      "latency_budget": null,
      "tool_constraints": ["local_memory_search"],
      "context_limits": 8000
    }
  },
  "system_state": {
    "model": "local_test_model",
    "routing_policy_version": "router_v0.1",
    "bandit_weights_snapshot": {
      "memory_search": 0.25,
      "direct_answer": 0.75
    },
    "active_strategy": "direct_answer",
    "memory_policy_version": "memory_v0.1",
    "toolset_available": ["local_memory_search"],
    "context_window_snapshot": "short_context"
  },
  "decision_trace": [
    {
      "step": 1,
      "decision_type": "strategy_choice",
      "options_considered": ["memory_search", "direct_answer"],
      "selected": "direct_answer",
      "probabilities": {"memory_search": 0.25, "direct_answer": 0.75},
      "context_features": {"query_mentions_prior_notes": true}
    }
  ],
  "execution_trace": [
    {
      "step": 1,
      "action": "answer_without_retrieval",
      "input": "Summarize the prior notes about edge inference constraints.",
      "output": "Generic answer about edge inference without specific prior constraints.",
      "latency": 0.42,
      "error": null
    }
  ],
  "memory_state": {
    "retrieved_memories": [],
    "memory_updates_applied": [],
    "memory_conflicts": [],
    "recency_bias_indicators": {
      "recent_memory_available": true,
      "retrieval_skipped": true
    }
  },
  "evaluation": {
    "success": false,
    "score": 0.2,
    "rubric_scores": {
      "correctness": 0.2,
      "reasoning_quality": 0.4,
      "tool_accuracy": 0.0,
      "efficiency": 0.7
    },
    "failure_mode_label": "retrieval_omission"
  },
  "failure_annotation": {
    "primary_failure_mode": "retrieval_omission",
    "secondary_failure_modes": ["under_routing_to_memory"],
    "suspected_cause": "routing",
    "human_interpretable_summary": "The system answered directly even though the task required prior memory retrieval.",
    "confidence": 0.9
  }
}

That’s one failure. Read what it tells you.

The task was to summarize prior notes about edge inference. The router had bandit weights of {memory_search: 0.25, direct_answer: 0.75} and picked direct_answer — even though the context features explicitly noted that the query mentioned prior notes. The execution layer answered without retrieval. The evaluator gave it 0.2 on correctness, 0.0 on tool accuracy, 0.7 on efficiency (efficient because it skipped the right thing). The annotation labels this retrieval_omission, suspected cause routing, with high confidence.

Now compare two traces with that shape and you can ask sharp questions: Across all retrieval_omission failures, what bandit weights were active? What routing policy version? What context features got missed? Compare two clusters of failures and you can ask what changed between policy v0.1 and v0.2? That’s diagnosis. None of it is possible without the trace having the layered structure above.

The layers, visually

The schema is layered on purpose, in the order events happen during a run:

        flowchart TB
  task["task<br/>input · task_type · constraints"]
  state["system_state<br/>model · routing_policy · bandit_weights<br/>active_strategy · memory_policy · toolset"]
  decision["decision_trace<br/>step · decision_type · options · selected · probabilities"]
  exec["execution_trace<br/>step · action · input · output · latency · error"]
  mem["memory_state<br/>retrieved · updates · conflicts · recency"]
  eval["evaluation<br/>success · score · rubric · failure_mode_label"]
  ann["failure_annotation<br/>primary · secondary · suspected_cause · summary · confidence"]
  embed["embedding_fields<br/>task · trajectory · decision · failure · memory"]

  task --> state --> decision --> exec --> mem --> eval --> ann
  ann -. companion signal .-> embed
    

Each block is a slot in the schema. Each block stays the same shape across runs, which is what makes traces comparable objects. Then in prose:

  • task — what was asked. Without a stable description of inputs and success criteria, “failure” is ambiguous. Includes input, task_type (reasoning, retrieval, tool use, planning, coding, hybrid), constraints (latency budget, tool constraints, context limits).

  • system_state — identity and knobs at the boundaries that matter. model, routing_policy_version, bandit_weights_snapshot, active_strategy, memory_policy_version, toolset_available, context_window_snapshot. Not every hyperparameter — what would change conclusions if swapped.

  • decision_trace — ordered control flow above raw generations. Discrete steps where the system chose a branch before execution recorded concrete effects: step, decision_type (model_choice, tool_choice, strategy_choice), options_considered, selected, probabilities, context_features.

  • execution_trace — concrete operations. step, action, input, output, latency, error (nullable). Tool calls, answer actions, environment steps, with a forensic trail.

  • memory_state — what the system used or skipped from memory. retrieved_memories, memory_embedding_snapshot, memory_updates_applied, memory_conflicts, recency_bias_indicators. Misleading or skipped memory is its own failure class; these fields separate retrieve faults from reason faults when the evidence supports the split.

  • evaluation — how failure was adjudicated. success, score, rubric_scores (correctness, reasoning quality, tool accuracy, efficiency), failure_mode_label (nullable).

  • failure_annotation — interpretive layer on top of raw scores. primary_failure_mode, secondary_failure_modes, suspected_cause (routing, memory, model, tool, evaluation, interaction), human_interpretable_summary, confidence.

  • embedding_fields — multi-view slots for clustering and similarity. task_embedding, trajectory_embedding, decision_embedding, failure_embedding, memory_embedding. Sibling signal at scale, not a substitute for reading traces.

The schema file linked above is authoritative for required vs optional fields and for embedding shapes (see $defs.embeddingValue).

Why this matters for clustering

Structured traces partition failure space far beyond wrong vs right. The same superficial wrong answer — “generic edge inference summary” — can unpack into:

  • a routing cluster (the bandit kept picking direct_answer when memory should have been consulted),

  • a memory cluster (retrieval ran but missed the right notes),

  • a tool cluster (the right tool was called but produced bad output),

  • a model cluster (everything upstream worked, the model just hallucinated).

These are different problems with different fixes. The trace is what makes telling them apart possible. Failure clusters as interventions is what telling them apart turns into next.

Research takeaway

  1. Persist full trajectories that ended in failure — not just summarized outputs.

  2. Keep consistent layers: task → state → decisions → execution → memory → eval → annotation.

  3. Treat embedding slots as versioned companions to the narrative fields, not replacements.

A corpus shaped this way produces clusters that point to repeatable failure geometries — raw material for failure-induced benchmarks, harder synthetic sets, or targeted retrains. Not just another aggregate error rate.

What this isn’t

Automatic diagnosis. The trace makes diagnosis possible. It doesn’t guarantee the correct root cause without further evidence — that’s why the schema has both evaluation.failure_mode_label (machine-assignable) and failure_annotation.suspected_cause (interpretive, with confidence). The two layers are intentionally separate, because conflating them is exactly the move that makes failure analysis stop working.