PDF to Markdown Tools for AI Pipelines¶

Turning PDFs into clean Markdown is the unglamorous step almost every document-AI pipeline depends on, and it’s where most of them quietly fail. This article is the tool landscape: what exists, what the trade-offs are, and how to evaluate converters on your files instead of generic ones. PDF Intelligence Core is the runnable companion — an assembled pipeline built from these ideas, with pdf-intelligence-core on GitHub as the cloneable codebase.

Treat PDF ingestion as infrastructure, not a script you wrote once on a Friday. Conversion is rarely one checkbox. It’s a pipeline:

        flowchart LR
  P[PDF parsing] --> L[layout detection]
  L --> O[reading order]
  O --> T[table extraction]
  T --> I[image / diagram handling]
  I --> SC[semantic cleanup]
  SC --> MD[Markdown formatting]
  MD --> CH[chunking]
  CH --> ST[storage]
  ST --> E[evaluation]

Markdown helps because it gives predictable headings, lists, and code blocks — chunking aligned with sections, navigation, versioning like normal source. PDFs rarely arrive with that structure intact, so quality is mostly won or lost at extraction and cleanup.

Brief extraction modes you will bump into:

Text-first — fast when the PDF is mostly linear text; tables, footnotes, and multi-column layouts often collapse into the wrong order.
Layout-aware — block- and reading-order-aware tools cost more CPU or dependencies but map better to ## / ### in Markdown.
OCR-backed — scanned PDFs need OCR first; quality drives everything downstream.

No single converter handles every document. Prefer evaluation on your PDFs—legal scans, decks, datasheets, and papers behave differently—using a small set of deliberately “nasty” files over only an average-case sample.

Why PDF conversion is a pipeline¶

“PDF-to-Markdown” is not one task. Vendor demos often hide the steps above; production systems feel every layer. Lightweight tools skim text quickly; heavier stacks try to recover blocks, order, and tables. The right investment depends on what must be true for downstream chunking, search, or models—not on the brand name printed on the box.

Before you trust a corpus: spot-check structure (headings, lists, tables), and if you chunk for retrieval, inspect what the model actually sees—garbled order hurts more than small typos.

How to choose a tool¶

The correct choice depends on the structure and quality of the PDF, not the logo on the GitHub repo. Use this as a starting map, then validate on your files.

Situation	Start with	Why
Simple text PDF	pypdf or PyMuPDF	Lightweight extraction
Research paper	Marker or PyMuPDF4LLM	Better reading order and structure
Table-heavy report	pdfplumber	Strong table and bounding-box handling
RAG ingestion	PyMuPDF4LLM, Unstructured, or Docling	Structured output for downstream systems
Enterprise document pipeline	Docling or Unstructured	Semantic document blocks
Benchmarking converters	pdfsmith	Compare multiple backends
Large-scale local pipeline	PyMuPDF, PyMuPDF4LLM, or PDF Oxide	Speed and local control
Messy scanned or visual PDFs	OCR-capable tool or LLM cleanup layer	Extraction alone may fail

Exact winners still depend on license constraints, table and math needs, and ops budget. Teams often mix CLI or library extractors, hosted APIs where overhead should stay low, and manual spot fixes for the few pages that drive most of the value.

Practical setup¶

A minimal staging layout keeps raw input, intermediates, and outputs separated without turning this article into a full workspace guide:

mkdir -p ~/pdf-staging
cd ~/pdf-staging

mkdir -p inbox processing markdown logs scripts env

python3 -m venv env
source env/bin/activate

python -m pip install --upgrade pip
python -m pip install pymupdf pymupdf4llm pdfplumber unstructured marker-pdf watchdog rich

Folder roles:

pdf-staging/
├── inbox/          # raw PDF input
├── processing/     # intermediate files
├── markdown/       # final Markdown output
├── logs/           # conversion logs
├── scripts/        # Python pipeline code
└── env/            # isolated Python environment

Install only what you need for the path you are testing; the list above is a reasonable batch for exploring several backends locally.

Quick examples by tool¶

The snippets below are minimal “does it run?” examples. Extend them with your own error handling, paths, and logging (see What to log).

PyMuPDF¶

import fitz

doc = fitz.open("file.pdf")

for page in doc:
    data = page.get_text("dict")
    print(data)

PyMuPDF shines when you want speed, control, and access to blocks, spans, coordinates, and custom processing.

pdfminer.six¶

from pdfminer.high_level import extract_text

text = extract_text("file.pdf")
print(text)

pdfminer.six is often slower but useful when text layout and extraction detail matter.

pypdf¶

from pypdf import PdfReader

reader = PdfReader("file.pdf")

for page in reader.pages:
    print(page.extract_text())

pypdf is handy for lightweight preprocessing, splitting, merging, and simple text extraction.

pdfplumber¶

import pdfplumber

with pdfplumber.open("file.pdf") as pdf:
    page = pdf.pages[0]

    text = page.extract_text()
    tables = page.extract_tables()

    print(text)
    print(tables)

pdfplumber is a strong first test for table-heavy PDFs.

PyMuPDF4LLM¶

import pymupdf4llm

markdown = pymupdf4llm.to_markdown("file.pdf")

with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown)

For many teams, PyMuPDF4LLM is one of the best practical defaults for local PDF-to-Markdown conversion.

Unstructured¶

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf("file.pdf")

for element in elements:
    print(type(element).__name__, element.text[:200])

Unstructured is useful when the goal is semantic chunks: titles, narrative text, lists, tables, and related block types—not a single unstructured string.

Marker¶

Shell-style invocation (names and flags vary by release—check marker --help or the docs for your installed version):

marker_single file.pdf --output_dir markdown/

Verify the exact command against your installed Marker version; command names differ across releases.

Docling¶

Docling fits best when you treat it as document understanding, not merely “dump text”:

Input PDF → Docling conversion → structured document object → Markdown or JSON export.

Precise CLI and Python entry points evolve; follow the upstream documentation for installation and APIs. Docling is a structured layer—plan for exporting to Markdown or JSON and for integrating with downstream storage or search.

Other tools referenced elsewhere in ecosystem discussions—PDF Oxide, pdfsmith, pdf-to-md-llm, markdrop, appjsonify, pdf2markdown—fill niches such as Rust-backed speed, multi-backend benchmarking, or opinionated pipelines. Evaluate them against the same PDF set you use for your primary extractor.

Minimal adapter pattern¶

Avoid hard-coding one converter forever. A small Protocol-based boundary leaves room to swap backends, add benchmarks, or fall back when a PDF defeats your first choice:

from pathlib import Path
from typing import Protocol


class PDFConverter(Protocol):
    name: str

    def convert(self, pdf_path: Path) -> str:
        ...


class PyMuPDF4LLMConverter:
    name = "pymupdf4llm"

    def convert(self, pdf_path: Path) -> str:
        import pymupdf4llm

        return pymupdf4llm.to_markdown(str(pdf_path))


class PdfPlumberConverter:
    name = "pdfplumber"

    def convert(self, pdf_path: Path) -> str:
        import pdfplumber

        pages = []

        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                if text:
                    pages.append(text)

        return "\n\n".join(pages)


def convert_with_fallbacks(pdf_path: Path, converters: list[PDFConverter]) -> tuple[str, str]:
    errors = []

    for converter in converters:
        try:
            return converter.convert(pdf_path), converter.name
        except Exception as error:
            errors.append(f"{converter.name}: {error}")

    raise RuntimeError("All converters failed:\n" + "\n".join(errors))

This is the beginning of a serious ingestion layer:

PDF input → selected backend → normalized Markdown → metadata → logs → evaluation.

What to log¶

Observable pipelines record what ran, whether it succeeded, and where output landed. Example JSON-shaped record:

{
  "source": "paper.pdf",
  "output": "paper.md",
  "method": "pymupdf4llm",
  "status": "success",
  "notes": "tables preserved"
}

Logging method, status, output path, and failure messages makes regressions and odd PDFs tractable—you can correlate chunk quality back to extractor choice.

Where this connects next¶

Primary bridge: PDF Intelligence Core describes how the concepts above assemble into an inspectable pipeline—ingestion through indexing and a deterministic graph layer—with pdf-intelligence-core on GitHub carrying the runnable implementation and disk artifacts. Read the tool article for choices; read that page for system shape.

Other topics belong in separate articles when you ship them, for example:

PDF staging workspace — directory layout and staging discipline
Chunk-only retrieval patterns beyond what the core repo already stages
Broader evaluation and memory-guided workflows tied to traced failures

Treat this article as the tool map; PDF Intelligence Core is the runnable core anchored to code you can fork or audit.

Final takeaway¶

PDF ingestion is infrastructure. Pick tools from document shape and downstream requirements, log runs honestly, and keep a swap-friendly boundary so one bad converter does not own your pipeline forever. Wit is allowed; letting the toolchain become mystery meat is not.