Memory-guided evaluation¶

The substrate (Memory Dropbox) is where this stack wants durable memory to live. The runtime (Obversary-OS) is where decisions get emitted as events. Memory-guided evaluation is the layer that asks the next question:

Given a memory of past runs, which next move should the system pick — and how does it learn from the outcome without collapsing routing and learning into the same opaque loop?

That’s it. That’s the whole layer. Routing is one job. Learning is another job. Treating them as the same job is how most “self-improving” agents become unobservable.

There are two repos in this lane and they cover different angles:

memory-guided-eval — fast loop / slow loop separation, with priors, weights, and a constraint optimizer.
memoryevalguided — the canonical home of the failure-trace JSON Schema and validated examples (see Structured failure traces).

This article is about the first one. The second has its own page.

Repository¶

Browse: github.com/obversary/memory-guided-eval
Clone: git clone https://github.com/obversary/memory-guided-eval.git

The core loop¶

        flowchart LR
  M[(memory<br/>priors · weights · failure modes)]
  S[suggest · ranked pipelines]
  C[user choice]
  T[transform · pipeline run]
  E[evaluate · log outcome]
  L[learn · update weights]

  M -->|read only| S
  S --> C
  C --> T
  T --> E
  E --> L
  L -->|slow loop · only writer| M

  classDef fast fill:#1f2937,stroke:#60a5fa,color:#e5e7eb;
  classDef slow fill:#1f2937,stroke:#f59e0b,color:#e5e7eb;
  class S,C,T,E fast;
  class L slow;

Blue is the fast loop (read-only). Amber is the slow loop (the only thing that touches memory). The loop has two distinct halves and the discipline of the repo is keeping them separate.

Piece	Role	May mutate long-term memory?
Fast loop	Primer + scoring → ranked pipeline suggestions	No
Slow loop	Aggregate logs → update weights / failure modes	Yes
Memory	Priors, weights, failure counts (shared snapshot)	Updated only by slow loop

The design rule the repo enforces: the fast loop does not call memory.update(...). It ranks and suggests; execution and logging feed the slow loop. That’s the only way the routing decisions stay auditable — if the router is also rewriting its own priors mid-run, you can’t tell what the router was looking at when it decided.

Why this matters¶

Most agent stacks I’ve read collapse the two loops. The model decides what to do next, and the same call updates whatever passes for memory. That works as a demo and falls apart as evaluation, because there’s no fixed reference frame. What was the prior the router was looking at when it picked this branch? You can’t answer that question after the fact if the prior got rewritten in the same step.

Splitting fast from slow is the cheapest possible way to make routing auditable. You read priors. You score. You suggest. You execute. You log. Some time later (next batch, next day, next pre-registered cycle), the slow loop reads the logs and updates the priors. The fast loop in the next run is operating on a different prior, and you know exactly which one because the slow loop is the only thing that touched it.

What ships in `memory-guided-eval`¶

The repo is small on purpose. It’s an inspectable scaffold, not a framework.

memory-guided-eval/
├── docs/             # design notes (e.g., failure clusters as interventions)
├── memory/           # priors, clippings, failure modes (on disk JSON)
│   ├── priors/edge_inference.json
│   ├── failure_modes/failure_modes.json
│   └── clippings/
├── eval/             # experiments and logs
├── interventions/    # candidate policies from cluster signals
│   └── failure_cluster_interventions.py
├── pipelines/        # transformation stubs — replace with your steps
│   ├── extract.py
│   ├── chunk.py
│   ├── summarize.py
│   └── benchmark.py
├── router/           # fast loop
│   └── attention_router.py
├── optimize/         # slow loop
│   └── constraint_optimizer.py
└── run_eval.py       # minimal end-to-end demo

Memory is just JSON on disk. priors/edge_inference.json is read by the router; failure_modes/failure_modes.json is the count log the slow loop maintains. That’s it. The substrate’s the substrate; the eval layer doesn’t reinvent it.

Router (router/attention_router.py) is the fast loop. Reads memory, scores candidate pipelines from pipelines/, returns a ranked list of suggestions. Read-only.

Optimizer (optimize/constraint_optimizer.py) is the slow loop. Reads logged outcomes, updates pipeline weights and failure-mode counts. Mutates memory, but only here.

Interventions (interventions/failure_cluster_interventions.py) is the bridge between failure clustering (Failure clusters as interventions) and the router. A cluster doesn’t just describe failures — it can tighten a tool constraint, flip a guardrail, or shift a bandit weight.

Running it¶

From the repo root:

python run_eval.py

That prints ranked suggestions for a sample document, then applies one slow-loop step from a synthetic log line. End to end on one console screen. No hidden state, no API keys.

What this is and isn’t¶

What it is:

A small scaffold that demonstrates fast-loop / slow-loop separation as a real architectural rule.
A place to test how clustered failures can become routing or memory interventions.
A reference for what “memory-guided” should mean operationally — guided by a substrate the router didn’t write itself.

What it isn’t:

Production routing. The scoring rules are toy. The priors are hand-authored.
A full bandit / RL stack. There’s no regret minimization, no exploration policy, no off-policy evaluation. The repo is explicit about that being roadmap, not done.
An LLM router. The pipelines are stubs.

The honest evidence posture, paraphrased from the repo’s own README: run_eval.py runs and the loops are deterministic for the demo inputs. Regret-based updates, richer priors, embedding-based retrieval, real pipeline execution are described but not verified — they’re the roadmap, not results.

How this fits¶

This layer sits between the substrate and the failure-driven feedback loops above it.

In the current repo, it reads priors and failure counts from JSON on disk. In the stack, that is the same boundary Memory Dropbox is meant to serve once the substrate integration is real.
The trajectories it produces become structured failure traces with a documented schema (in memoryevalguided), so failed runs become comparable objects rather than ad-hoc strings.
Failure clustering on those traces feeds back through Failure clusters as interventions, which is the bridge between “we noticed this pattern” and “we changed routing.”
When the same data is used to generate harder questions instead of change routing, that’s failure-induced benchmarks — the same substrate, the other direction.

The thread connecting all of these: aggregate scores hide structure. The only way to know which part of a memory-augmented or tool-using system broke — routing, retrieval, tool misuse, or model limits — is to make the layers addressable in the data you store. This repo demonstrates that discipline at scaffold scale.