RAG quality starts at ingestion. Convert PDFs to Markdown first and your chunks break on real boundaries — headings, paragraphs — instead of mid-sentence column wraps. Retrieval scores climb, hallucinations drop.
Retrieval-augmented generation lives or dies on chunk quality. If your chunks are coherent (one idea per chunk, clean boundaries), retrieval finds the right one and the LLM has clean context to answer from. If your chunks are noisy (PDF page numbers in the middle, sentences split across chunks, columns interleaved), the retrieval scores degrade and the LLM gets confused inputs. PDFs are notoriously bad source material for this because their text-extraction yields exactly those problems. Converting to Markdown first solves it at the source: t0md emits proper headings (which most chunkers split on), preserves paragraph boundaries, and drops the layout cruft. The same RAG pipeline with the same embedder produces measurably better recall on Markdown chunks than on raw-PDF chunks.
Run your PDFs through t0md once at ingestion time, then chunk the Markdown. For LangChain, use `MarkdownHeaderTextSplitter` to split on H1/H2/H3 — that gives you semantically meaningful chunks tagged with their section. For LlamaIndex, the equivalent is `MarkdownNodeParser`. For custom pipelines, split on `\n\n## ` or any heading level that matches your document depth. Embed the resulting chunks with your usual model (OpenAI ada-3, Cohere embed-v3, BGE, etc.) and you're done. Retrieval queries become more specific because each chunk has a clean topic boundary instead of an arbitrary token-count cut.
# LangChain example
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [("#", "H1"), ("##", "H2"), ("###", "H3")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_text) # from t0md
You can, but you'll fight artefacts: page numbers, headers/footers repeated each page, columns interleaved. Markdown removes those at ingestion so your chunker only sees signal.
Start with header-based splitting and add a max-token cap of 800–1200 tokens (model-dependent). Header-split chunks are usually well-sized to begin with; the cap is just for the occasional long section.
Yes. Markdown is the universal interchange format for RAG ingestion. Every major framework has a Markdown parser; the t0md output drops in as-is.