Research Engineering Map¶

This is how the work splits up in my head, and why it splits that way.

I started this site because I needed somewhere to put my notes that wasn’t a private folder, but the longer I spent on it the more obvious it became that the work isn’t a list of repos — it’s a stack. Every layer assumes the one underneath. If I describe the layers in the wrong order, none of it makes sense. So this page is the order.

If you want the bet the stack is built on, that’s Why memory is the substrate. This page is the architecture diagram.

The stack¶

        flowchart TB
  subgraph adapt["adaptive feedback"]
    fib["failure-induced-benchmarks"]
  end

  subgraph evalLane["evaluation"]
    mge["memory-guided-eval"]
    meg["memoryevalguided · trace schema"]
    fse["failure-sliced-eval"]
  end

  subgraph runtime["runtime / orchestration"]
    oos["Obversary-OS · agents · workflows · APIs"]
  end

  subgraph ingestion["applied ingestion"]
    pdf["pdf-intelligence-core"]
  end

  subgraph substrate["memory substrate · sibling layers"]
    ed["earth-database<br/>local canonical core<br/>(SQLite · FTS5 · trust · observability)"]
    md["memory-dropbox<br/>event-sourced substrate experiment<br/>(Postgres · Redis · Qdrant)"]
  end

  subgraph foundations["foundations"]
    math["math-boundaries-ai-systems"]
  end

  ingestion --> ed
  ingestion --> md
  ed <-.-> md
  substrate --> runtime
  runtime --> evalLane
  evalLane --> adapt
  adapt -. failure as second memory .-> substrate
  foundations -. translation layer .-> runtime
  foundations -. translation layer .-> evalLane

The two memory repos sit at the same layer as deliberate siblings, not as “primary and backup.” earth-database is the small local canonical core where provenance, trust-boundary work, and observability discipline can ride close to the hot path. memory-dropbox is the larger event-sourced substrate experiment where derived memories, observation memories, and agent-facing workflows get tested at larger scope. Obversary-OS is the runtime layer above them. In the current prototype it records decisions through its own event-memory surface; the substrate adapters are the boundary this stack is built toward, not something I want to pretend is finished.

Reading order is bottom to top. You can start anywhere, but the dependencies only flow one direction.

Substrate — memory, as two sibling repos¶

The bottom of the stack is the memory substrate, and I want to be specific about how it splits.

earth-database is the local canonical core. SQLite in WAL mode, FTS5, provenance, deterministic trust utilities, JSONL event logging, a scheduler table for slow background work. One process. One hot path. Small enough that provenance, trust labels, and observability can stay attached to the record instead of being explained later by a sidecar. If I had to pick the canonical memory substrate, it’s this one, because it’s the smallest honest version of the idea.

Memory Dropbox is the event-sourced substrate experiment. Postgres as the system of record, Redis as the queue, Qdrant as the vector store, a worker process for slow derived work. It’s where the same discipline — capture, events, derived memory, observation memory, inspectable retrieval — gets tested at a larger scope than fits in one embedded database. Derived memories, observation memories, and agent-facing memory experiments live here as first-class layers.

The two are deliberate siblings. Neither subsumes the other. Keeping them separate is the design.

I put memory first on purpose. Most stacks I see start with the model and bolt memory on as a side feature. I think that’s backward. The substrate is the part that makes a model into a system, and a system into something that can develop instincts.

Runtime — what to do, given the substrate¶

Obversary-OS sits conceptually on top of memory. It’s the layer that decides what to do when something happens. Roles, workflows, signals, tools, model interfaces. The thing that turns “we received a request” into “we picked a role, ran a workflow, called a tool, and recorded what happened.” Today that record lives in the prototype’s own event-memory view; the next honest step is wiring those events into the substrate below instead of letting the runtime pretend it owns memory.

This is where modularity earns its keep. Not because separation of concerns is academically clean, but because the runtime needs to be the thing that makes the system’s own decisions inspectable. If the runtime hides its decisions inside one big agent loop, the substrate has nothing to remember.

Applied ingestion — does the substrate work on real input¶

PDF Intelligence Core is the first applied ingestion lane. Documents are common, messy, and structurally honest about being hard. They force the stack to deal with extraction, normalization, provenance, chunking, indexing, and evaluation — all the things a memory substrate has to ingest cleanly before it can claim to be useful on real input.

The same pattern can extend later to chats, notes, datasets, agent traces. PDFs are first because if the pipeline can’t survive PDFs, it won’t survive anything else.

Evaluation — making behavior measurable¶

There are three repos in this lane and they each cover a different angle:

memory-guided-eval — fast-loop routing and slow-loop learning. Decision-time vs learning-time, kept separate so you can debug the system. Article: Memory-guided evaluation.
memoryevalguided — the canonical home of the failure-trace JSON Schema and example traces. Article: Structured failure traces.
failure-sliced-eval — sliced metrics, the MVP failure-episode log format, and the controlled binary-reasoning failure-discovery toy. Article: Failure-sliced eval.

The lane hub article is Evaluation Systems if you want to start there.

The thread connecting all three: aggregate scores hide structure. If you only have a number, you only know something is wrong. If you have traces, slices, and structured failures, you can find out what’s wrong, and you can do something about it.

Adaptive feedback — failure as the second memory¶

Failure-induced benchmarks is where failures become the next test. Instead of running yesterday’s static benchmark forever, you build a benchmark that targets the failure shapes the system actually produced. The repo frames separation, latent structure, and convergence as hypotheses to measure, not conclusions already won; the value is the harness, provenance, and run artifacts that make the question inspectable.

This is also where the long-term thesis on this site gets tested. If static benchmarks are short-lived — which is what I think — then induced benchmarks are one of the concrete ways to do better.

Security research — the adversary layer¶

The lane where the rest of the stack meets the adversary. Privacy as freedom has been the throughline of my thinking from before any of the AI work, and the AI frontier made that throughline practical in a new way: prompt injection, agent sandbox escapes, trust-boundary failures. This lane covers the architecture of that problem.

Security Research — the lane hub.
Prompt injection — doctrine article. Direct vs indirect injection, why it’s especially dangerous for agents, the three recent public disclosures (Google Antigravity, OpenAI Atlas, Five Eyes guidance), and the layered-defense playbook.

The implementation of that doctrine lives inside the canonical memory core itself. earth-database carries a trust/ module — deterministic classifier, injection scanner, allowlist-based policy gate, retrieval wrappers — and every decision it makes rides on the same JSONL event log as every other substrate operation. Trust isn’t a bolt-on layer. It’s part of what the canonical substrate is.

Foundations — knowing where math stops¶

Math Boundaries for AI Systems is the layer underneath the stack that I keep available so I don’t lie to myself about where engineering judgment takes over from formalism. Math defines the optimization problem. It doesn’t decide whether the problem was worth optimizing. That distinction matters everywhere in the rest of the stack, especially in evaluation.

I built it as a translator for myself when I was reading papers and code at the same time and trying to figure out which lines of one corresponded to which lines of the other. Other people might find it useful. Mostly it’s there so I have something to point to when I catch myself dressing up a hand-wave as math.

Why this is a stack and not a list¶

The reason I keep insisting it’s a stack:

A model on its own is just a function. A model with a memory substrate starts to look like a system. A runtime that records its decisions gives evaluation something real to inspect. Evaluation turns failures into harder questions. That feedback loop is the difference between “AI tool” and “research substrate.” Some of that loop is implemented now, and some of it is still the roadmap. The point of this map is to keep the boundary visible instead of hiding the unfinished parts behind a clean diagram.

That’s the architecture. The rest of the docs walk through each layer one at a time.

The recurring question¶

The question I keep coming back to, the one that sits underneath all of this:

How do we build AI systems that can preserve experience, evaluate their own behavior, learn from where they break, and stay legible while doing it?

That’s the thread. Every page here is one attempt at one piece of the answer.