PDF staging workspace¶
A staging layout keeps raw PDFs, intermediate files, and outputs separated before or beside a full pipeline run. This page is a small directory convention, nothing more — PDF to Markdown Tools for AI Pipelines covers converter choice and evaluation in depth, and PDF Intelligence Core is the runnable pipeline this convention feeds into.
Why have one at all¶
Document work fails quietly when inputs, half-finished Markdown, and logs share one folder. Two days in and you can’t tell what’s safe to delete, what’s an intermediate, or which version of a converted document the downstream system actually saw. A small directory convention makes that obvious — inbox, processing, artifact, logs — and it keeps the staging discipline aligned whether you stop at Markdown or push files into a real pipeline like pdf-intelligence-core.
Suggested layout¶
pdf-staging/
├── inbox/ # raw PDF input
├── processing/ # intermediate files
├── markdown/ # Markdown output
├── logs/ # conversion logs
├── scripts/ # pipeline code
└── env/ # isolated Python environment
Minimal install notes live in the practical setup section of the PDF–MD tools article — same install, same intent.
Operating conventions¶
Treat the tree as one-way as much as you can: raw files land in inbox/, work happens in processing/, finished Markdown lands in markdown/. If you overwrite outputs in place while a downstream job still has an old path open, you get “it worked yesterday” ghosts — so prefer copy forward instead of editing the same filename in multiple stages.
Logs: write one log file per batch or per document with a timestamp in the name; when conversion fails halfway, the log tells you whether the crash was ingestion, OCR, or a downstream sanitizer without reopening fifteen terminals.
Naming: optional but useful prefix:
YYYYMMDD-or short content id so two versions of the “same” PDF don’t clobber each other inmarkdown/.Suspect inputs: malware scans and untrusted uploads belong in their own subdirectory (for example
inbox/quarantine/) until you’ve decided they’re allowed to touch your toolchain — same layout, stricter ingress.
None of this replaces provenance inside a pipeline like pdf-intelligence-core; it keeps your local workspace legible enough that promoting files into that pipeline doesn’t inherit a messy history.
How this fits¶
The staging workspace is the local organization layer around extraction. pdf-intelligence-core uses its own data/ tree (inbox, markdown, audit, chunks, vectors, graphs) as the persistent artifact layer for a runnable pipeline. Staging is a lightweight precursor to that, or a manual drop zone before files move into the structured pipeline.
Repository¶
There is no separate repository for “staging” alone. It’s a convention. Runnable pipeline code lives in pdf-intelligence-core.
It’s not a hosted product or a substitute for proper provenance in production. It’s a practical default for local experimentation that keeps you from making a mess of your own desk.