PDF to Markdown for Vector Databases

Embeddings inherit the quality of their input. A chunk that contains a page footer and a half-sentence embeds badly; a chunk that's one clean Markdown section embeds well. Run your PDFs through Markdown first.

Why convert PDFs to Markdown for this?

Vector databases (Pinecone, Weaviate, Qdrant, Chroma, pgvector, Milvus) all do the same thing: store an embedding per chunk, retrieve by cosine similarity. What varies is everything that happens before the vector — and that's where PDF input fails. Raw PDF text dumps include page-break noise, column wrap artefacts, and headers/footers repeated on every page. Each of those pollutes the embedding: the model spends part of the 768/1024/1536-dimensional vector encoding "page 14 of 87" instead of encoding the semantic content. Markdown strips all of that. Each chunk you embed becomes a tight representation of one section's meaning, which is exactly what nearest-neighbour retrieval needs.

How to use t0md

Convert each source PDF to Markdown once at ingest. Chunk on heading boundaries (with a token cap as fallback). Embed each chunk with your provider's API. Store the vector with the chunk text and a metadata blob — the document filename, the section heading, the page range — so retrieval can return both the match and where it came from. t0md's HTTP API is callable from any ingest pipeline; the same call works whether you're indexing 10 PDFs or 10,000.

# Pseudo-pipeline
for pdf in pdfs:
    markdown = t0md_convert(pdf)        # HTTP POST /convert
    chunks   = split_on_headings(markdown)
    for chunk in chunks:
        vec  = embed(chunk.text)        # OpenAI / Cohere / local
        store.upsert(id=..., vec=vec, meta={"src": pdf, "heading": chunk.heading})

Related guides

Frequently asked questions

Which vector database works best with Markdown chunks?

All of them — the database doesn't care about input format, only the embedding shape. The win is on the ingestion side: Markdown chunks produce better embeddings regardless of where you store them.

Should I embed Markdown formatting or strip it first?

Embed it. Most modern embedding models were trained on Markdown-flavoured text and the structure helps. Stripping to bare prose loses signal about what's a heading vs. body.

What about hybrid search (vectors + keywords)?

Markdown is also better for keyword search: cleaner tokens, less garbage. The same converted file feeds both your vector index and your BM25 / full-text index without separate preprocessing.