PDF to Markdown Tools for AI Pipelines¶
Turning PDFs into clean Markdown is the unglamorous step almost every document-AI pipeline depends on, and it’s where most of them quietly fail. This article is the tool landscape: what exists, what the trade-offs are, and how to evaluate converters on your files instead of generic ones. PDF Intelligence Core is the runnable companion — an assembled pipeline built from these ideas, with pdf-intelligence-core on GitHub as the cloneable codebase.
Treat PDF ingestion as infrastructure, not a script you wrote once on a Friday. Conversion is rarely one checkbox. It’s a pipeline:
flowchart LR
P[PDF parsing] --> L[layout detection]
L --> O[reading order]
O --> T[table extraction]
T --> I[image / diagram handling]
I --> SC[semantic cleanup]
SC --> MD[Markdown formatting]
MD --> CH[chunking]
CH --> ST[storage]
ST --> E[evaluation]
Markdown helps because it gives predictable headings, lists, and code blocks — chunking aligned with sections, navigation, versioning like normal source. PDFs rarely arrive with that structure intact, so quality is mostly won or lost at extraction and cleanup.
Brief extraction modes you will bump into:
Text-first — fast when the PDF is mostly linear text; tables, footnotes, and multi-column layouts often collapse into the wrong order.
Layout-aware — block- and reading-order-aware tools cost more CPU or dependencies but map better to
##/###in Markdown.OCR-backed — scanned PDFs need OCR first; quality drives everything downstream.
No single converter handles every document. Prefer evaluation on your PDFs—legal scans, decks, datasheets, and papers behave differently—using a small set of deliberately “nasty” files over only an average-case sample.
Why PDF conversion is a pipeline¶
“PDF-to-Markdown” is not one task. Vendor demos often hide the steps above; production systems feel every layer. Lightweight tools skim text quickly; heavier stacks try to recover blocks, order, and tables. The right investment depends on what must be true for downstream chunking, search, or models—not on the brand name printed on the box.
Before you trust a corpus: spot-check structure (headings, lists, tables), and if you chunk for retrieval, inspect what the model actually sees—garbled order hurts more than small typos.
How to choose a tool¶
The correct choice depends on the structure and quality of the PDF, not the logo on the GitHub repo. Use this as a starting map, then validate on your files.
Situation |
Start with |
Why |
|---|---|---|
Simple text PDF |
pypdf or PyMuPDF |
Lightweight extraction |
Research paper |
Marker or PyMuPDF4LLM |
Better reading order and structure |
Table-heavy report |
pdfplumber |
Strong table and bounding-box handling |
RAG ingestion |
PyMuPDF4LLM, Unstructured, or Docling |
Structured output for downstream systems |
Enterprise document pipeline |
Docling or Unstructured |
Semantic document blocks |
Benchmarking converters |
pdfsmith |
Compare multiple backends |
Large-scale local pipeline |
PyMuPDF, PyMuPDF4LLM, or PDF Oxide |
Speed and local control |
Messy scanned or visual PDFs |
OCR-capable tool or LLM cleanup layer |
Extraction alone may fail |
Exact winners still depend on license constraints, table and math needs, and ops budget. Teams often mix CLI or library extractors, hosted APIs where overhead should stay low, and manual spot fixes for the few pages that drive most of the value.
Practical setup¶
A minimal staging layout keeps raw input, intermediates, and outputs separated without turning this article into a full workspace guide:
mkdir -p ~/pdf-staging
cd ~/pdf-staging
mkdir -p inbox processing markdown logs scripts env
python3 -m venv env
source env/bin/activate
python -m pip install --upgrade pip
python -m pip install pymupdf pymupdf4llm pdfplumber unstructured marker-pdf watchdog rich
Folder roles:
pdf-staging/
├── inbox/ # raw PDF input
├── processing/ # intermediate files
├── markdown/ # final Markdown output
├── logs/ # conversion logs
├── scripts/ # Python pipeline code
└── env/ # isolated Python environment
Install only what you need for the path you are testing; the list above is a reasonable batch for exploring several backends locally.
Quick examples by tool¶
The snippets below are minimal “does it run?” examples. Extend them with your own error handling, paths, and logging (see What to log).
PyMuPDF¶
import fitz
doc = fitz.open("file.pdf")
for page in doc:
data = page.get_text("dict")
print(data)
PyMuPDF shines when you want speed, control, and access to blocks, spans, coordinates, and custom processing.
pdfminer.six¶
from pdfminer.high_level import extract_text
text = extract_text("file.pdf")
print(text)
pdfminer.six is often slower but useful when text layout and extraction detail matter.
pypdf¶
from pypdf import PdfReader
reader = PdfReader("file.pdf")
for page in reader.pages:
print(page.extract_text())
pypdf is handy for lightweight preprocessing, splitting, merging, and simple text extraction.
pdfplumber¶
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
page = pdf.pages[0]
text = page.extract_text()
tables = page.extract_tables()
print(text)
print(tables)
pdfplumber is a strong first test for table-heavy PDFs.
PyMuPDF4LLM¶
import pymupdf4llm
markdown = pymupdf4llm.to_markdown("file.pdf")
with open("output.md", "w", encoding="utf-8") as f:
f.write(markdown)
For many teams, PyMuPDF4LLM is one of the best practical defaults for local PDF-to-Markdown conversion.
Unstructured¶
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("file.pdf")
for element in elements:
print(type(element).__name__, element.text[:200])
Unstructured is useful when the goal is semantic chunks: titles, narrative text, lists, tables, and related block types—not a single unstructured string.
Marker¶
Shell-style invocation (names and flags vary by release—check marker --help or the docs for your installed version):
marker_single file.pdf --output_dir markdown/
Verify the exact command against your installed Marker version; command names differ across releases.
Docling¶
Docling fits best when you treat it as document understanding, not merely “dump text”:
Input PDF → Docling conversion → structured document object → Markdown or JSON export.
Precise CLI and Python entry points evolve; follow the upstream documentation for installation and APIs. Docling is a structured layer—plan for exporting to Markdown or JSON and for integrating with downstream storage or search.
Other tools referenced elsewhere in ecosystem discussions—PDF Oxide, pdfsmith, pdf-to-md-llm, markdrop, appjsonify, pdf2markdown—fill niches such as Rust-backed speed, multi-backend benchmarking, or opinionated pipelines. Evaluate them against the same PDF set you use for your primary extractor.
Minimal adapter pattern¶
Avoid hard-coding one converter forever. A small Protocol-based boundary leaves room to swap backends, add benchmarks, or fall back when a PDF defeats your first choice:
from pathlib import Path
from typing import Protocol
class PDFConverter(Protocol):
name: str
def convert(self, pdf_path: Path) -> str:
...
class PyMuPDF4LLMConverter:
name = "pymupdf4llm"
def convert(self, pdf_path: Path) -> str:
import pymupdf4llm
return pymupdf4llm.to_markdown(str(pdf_path))
class PdfPlumberConverter:
name = "pdfplumber"
def convert(self, pdf_path: Path) -> str:
import pdfplumber
pages = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
text = page.extract_text()
if text:
pages.append(text)
return "\n\n".join(pages)
def convert_with_fallbacks(pdf_path: Path, converters: list[PDFConverter]) -> tuple[str, str]:
errors = []
for converter in converters:
try:
return converter.convert(pdf_path), converter.name
except Exception as error:
errors.append(f"{converter.name}: {error}")
raise RuntimeError("All converters failed:\n" + "\n".join(errors))
This is the beginning of a serious ingestion layer:
PDF input → selected backend → normalized Markdown → metadata → logs → evaluation.
What to log¶
Observable pipelines record what ran, whether it succeeded, and where output landed. Example JSON-shaped record:
{
"source": "paper.pdf",
"output": "paper.md",
"method": "pymupdf4llm",
"status": "success",
"notes": "tables preserved"
}
Logging method, status, output path, and failure messages makes regressions and odd PDFs tractable—you can correlate chunk quality back to extractor choice.
Where this connects next¶
Primary bridge: PDF Intelligence Core describes how the concepts above assemble into an inspectable pipeline—ingestion through indexing and a deterministic graph layer—with pdf-intelligence-core on GitHub carrying the runnable implementation and disk artifacts. Read the tool article for choices; read that page for system shape.
Other topics belong in separate articles when you ship them, for example:
PDF staging workspace — directory layout and staging discipline
Chunk-only retrieval patterns beyond what the core repo already stages
Broader evaluation and memory-guided workflows tied to traced failures
Treat this article as the tool map; PDF Intelligence Core is the runnable core anchored to code you can fork or audit.
Final takeaway¶
PDF ingestion is infrastructure. Pick tools from document shape and downstream requirements, log runs honestly, and keep a swap-friendly boundary so one bad converter does not own your pipeline forever. Wit is allowed; letting the toolchain become mystery meat is not.