PDF to Markdown for RAG (Retrieval-Augmented Generation)

RAG quality starts at ingestion. Convert PDFs to Markdown first and your chunks break on real boundaries — headings, paragraphs — instead of mid-sentence column wraps. Retrieval scores climb, hallucinations drop.

Why convert PDFs to Markdown for this?

Retrieval-augmented generation lives or dies on chunk quality. If your chunks are coherent (one idea per chunk, clean boundaries), retrieval finds the right one and the LLM has clean context to answer from. If your chunks are noisy (PDF page numbers in the middle, sentences split across chunks, columns interleaved), the retrieval scores degrade and the LLM gets confused inputs. PDFs are notoriously bad source material for this because their text-extraction yields exactly those problems. Converting to Markdown first solves it at the source: t0md emits proper headings (which most chunkers split on), preserves paragraph boundaries, and drops the layout cruft. The same RAG pipeline with the same embedder produces measurably better recall on Markdown chunks than on raw-PDF chunks.

How to use t0md

Run your PDFs through t0md once at ingestion time, then chunk the Markdown. For LangChain, use `MarkdownHeaderTextSplitter` to split on H1/H2/H3 — that gives you semantically meaningful chunks tagged with their section. For LlamaIndex, the equivalent is `MarkdownNodeParser`. For custom pipelines, split on `\n\n## ` or any heading level that matches your document depth. Embed the resulting chunks with your usual model (OpenAI ada-3, Cohere embed-v3, BGE, etc.) and you're done. Retrieval queries become more specific because each chunk has a clean topic boundary instead of an arbitrary token-count cut.

# LangChain example
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [("#", "H1"), ("##", "H2"), ("###", "H3")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_text)  # from t0md

Related guides

Frequently asked questions

Can I skip Markdown and just chunk PDF text directly?

You can, but you'll fight artefacts: page numbers, headers/footers repeated each page, columns interleaved. Markdown removes those at ingestion so your chunker only sees signal.

What chunk size should I use?

Start with header-based splitting and add a max-token cap of 800–1200 tokens (model-dependent). Header-split chunks are usually well-sized to begin with; the cap is just for the occasional long section.

Does this work with LlamaIndex / Haystack / custom RAG stacks?

Yes. Markdown is the universal interchange format for RAG ingestion. Every major framework has a Markdown parser; the t0md output drops in as-is.