PDF Intelligence Core

This is the first applied ingestion lane in the stack. The substrate (Memory Dropbox) needs real input to be tested against, and PDFs are the right first stress test — they’re common, they’re messy, and they’re structurally honest about being hard.

The point of this repo isn’t to be a “PDF tool.” It’s to be a verifiable spine for document intelligence: PDF in, layered artifacts out, every step inspectable, every transformation written to disk where you can open it.

Repository: github.com/obversary/pdf-intelligence-core

Three phases, in order

The pipeline is intentionally split into three phases, and the order matters:

  1. Core ingestion — PDF → validated Markdown + audit.

  2. Indexing / memory layer — Markdown → chunks → embeddings → FAISS index → trace artifacts.

  3. Deterministic graph structure — chunks → entities → co-occurrence edges → graph JSON with per-chunk provenance.

That’s the public story of the repo, in the order I want a reader to encounter it. It reads less like “I made a PDF tool” and more like I’m building observable document intelligence from first principles. Every phase adds structure you can open, diff, and rebuild from artifacts.

Artifact flow

        flowchart LR
  PDF[("PDF<br/>data/inbox/")] -->|phase 1<br/>ingest| MD[("Markdown<br/>+ audit JSON")]
  MD -->|phase 2<br/>chunk| CHUNKS[("chunks<br/>+ trace")]
  CHUNKS -->|embed| EMB[("embeddings<br/>+ trace")]
  EMB -->|index| VEC[("FAISS index<br/>+ map.json")]
  CHUNKS -->|phase 3<br/>regex extract| GRAPH[("nodes + edges<br/>+ per-chunk provenance")]
    

Each arrow writes files to disk under predictable paths. That isn’t decoration. It’s the contract: if a stage doesn’t leave a file behind, it didn’t really happen.

data/inbox/        # raw PDFs in
data/markdown/     # extracted Markdown
data/audit/        # ingestion audit JSON
data/chunks/       # chunked Markdown with metadata
data/embeddings/   # per-chunk embedding payloads
data/vectors/      # FAISS index + map.json
data/traces/       # per-stage trace files
data/graphs/       # nodes.json, edges.json, graph_index.json
data/graphs/traces/  # per-chunk extraction traces

That layout is what makes the repo a substrate-friendly ingestion lane instead of a black box.

Phase 1 — core ingestion

pdf-core-ingest (or python -m cli ingest) walks every *.pdf in data/inbox/. Per file:

  1. Validate — PyMuPDF opens the file, counts pages, computes a SHA-256 digest of the bytes.

  2. Extractpymupdf4llm.to_markdown is the preferred path for structured text. On failure, the pipeline falls back to pdfplumber page text and records the fallback reason in the audit.

  3. Normalize — collapse whitespace, join hyphenated line breaks, deduplicate exact repeated lines. Lossy by design — noisy PDFs need a deterministic normalizer more than they need a faithful one.

  4. Persist — Markdown to data/markdown/{stem}.md; audit to data/audit/{stem}.json with a timestamp.

Two things to notice. First, the audit captures which extractor ran and why. That means later, if the chunker produces weird output for a specific document, you can trace it back to whether pdfplumber got involved. Second, the SHA-256 of the input bytes ends up in the audit, which means the substrate downstream can hash-link a chunk all the way back to a specific binary file.

This phase doesn’t OCR raster pages. Image-only PDFs often yield empty or thin text. That’s a real limitation, called out in the phase doc, not papered over.

Phase 2 — indexing / memory layer

pdf-core-index (or python -m cli index) walks the Markdown produced by Phase 1 and turns each document into:

  • data/chunks/{stem}.json — chunked text with metadata.

  • data/embeddings/{stem}.json — one embedding payload per chunk.

  • data/vectors/index.faiss and data/vectors/map.json — the searchable index plus the chunk-id mapping.

  • data/traces/*.json — per-step trace artifacts.

pdf-core-query "what is this document about?" runs a query against the built index. The interface is small on purpose; the value is the artifacts under it.

Phase 3 — deterministic graph structure

This is the phase I care about most as a research design choice. The inputs are chunks only — no PDF reload, no embeddings passed in, no LLM edge synthesis. Per chunk (doc_id, chunk_id, text):

  1. regex_capital_phrase collects multi-token capital phrases (ordered, unique).

  2. Nodes are emitted, one per (entity × chunk) pair, with source_chunk_id preserved as "{doc_id}:{chunk_id}".

  3. co_occurs_in edges are emitted between every unordered distinct pair of entities that appeared in the same chunk text, stamped with source_chunk_id and doc_id.

The actual node and edge dataclasses from src/pdf_core/graph/schema.py:

@dataclass
class Node:
    id: str
    label: str
    source_chunk_id: str

@dataclass
class Edge:
    id: str
    source: str
    target: str
    relation: str
    source_chunk_id: str

That source_chunk_id field is the entire point. Every edge in the graph can be traced back to the exact chunk it came from. There’s no opaque step between “raw text” and “graph link” — the rule is a regex, the relation is co_occurs_in, and the provenance is structural.

Outputs:

  • data/graphs/nodes.json

  • data/graphs/edges.json

  • data/graphs/graph_index.json — counts and paths

  • data/graphs/traces/<doc_id>_chunk_<id>.json — method, extracted entities, edge count, one trace per indexed chunk occurrence.

pdf-core-graph (or python -m cli graph) runs the whole thing.

Why no LLM edges (yet)

The graph phase deliberately doesn’t call an LLM. That’s not because LLM-generated edges are bad — it’s because they belong in a layer above this one. The point of this repo is to build a graph whose every edge can be audited against a documented rule. Once that auditable spine exists, you can layer LLM-derived edges on top with their own provenance. If you start with LLM edges, you can never go back and say “this edge came from regex rule X on chunk Y.” You’ve already lost the ground truth.

That’s the same discipline the rest of the stack uses: the substrate has to be honest before the speculative layers above it are allowed to be useful.

Running the whole thing

Native:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -e .

mkdir -p data/inbox
cp path/to/file.pdf data/inbox/
pdf-core-ingest
pdf-core-index
pdf-core-query "what is this document about?"
pdf-core-graph

Docker:

docker compose build
docker compose up -d
docker exec -it pdf_intelligence_core bash

python -m cli ingest
python -m cli index
python -m cli query "what is this document about?"
python -m cli graph

docker compose down

The container is the execution layer; the host-mounted ./data/ tree is the persistent artifact layer. The image pins Python, system packages, and PDF-parsing dependencies so python -m cli ... behaves the same across machines. Stop or rebuild the container without losing files as long as data/ is kept on the host.

Design principles

  • Deterministic first. Same chunks → same graph (regex entities, enumerated co-edges). Randomness is not the organizing idea.

  • Observable transformations. Files on disk (Markdown, JSON, FAISS, traces) over hidden in-memory-only state where practical.

  • Provenance-bearing edges. source_chunk_id ties every edge back to its origin so you can answer why does this link exist.

  • No LLM-generated graph edges in v0.1. Auditability stays in the v0.1 layer; speculative LLM steps belong above it.

What this isn’t

  • Not a production RAG platform.

  • Not a full knowledge graph product. The graph is structural scaffolding with provenance, not an exhaustive ontology.

  • Not an autonomous agent. The pipeline exposes deterministic transformations and searchable indices.

  • Not a hosted UI product. The value is the cloneable pipeline and the artifacts.

If you need polish or a single-button demo, that’s out of scope. Wrap or extend deliberately.

Where this fits

  • The artifacts produced here (Markdown, chunks, traces, vectors, graphs) are exactly the kind of input the substrate is designed to ingest. This lane is the first concrete source of real-world artifacts for testing whether Memory Dropbox holds up beyond hand-authored text.

  • The audit JSON and per-chunk traces are also exactly the shape that the failure-tracing work in Memory-guided evaluation and Structured failure traces can latch onto when something goes wrong downstream.

  • For tool selection (which extractor for which document, why no single converter wins), see the companion article: PDF to Markdown Tools for AI Pipelines.

  • The thesis (Why memory is the substrate) is the long version of why this lane matters: ingestion isn’t a side feature, it’s how the substrate accumulates the experience that everything else gets built from.

Research takeaway

Document intelligence has a verifiable spine: PDF in, layered outputs out, each step explicit, every edge tied back to source text. This phase makes intermediate results inspectable. It doesn’t, by itself, claim the stack is “complete” or “intelligent.” It’s a deterministic skeleton meant to compose with honest evaluation downstream.