Embeddings inherit the quality of their input. A chunk that contains a page footer and a half-sentence embeds badly; a chunk that's one clean Markdown section embeds well. Run your PDFs through Markdown first.
Vector databases (Pinecone, Weaviate, Qdrant, Chroma, pgvector, Milvus) all do the same thing: store an embedding per chunk, retrieve by cosine similarity. What varies is everything that happens before the vector — and that's where PDF input fails. Raw PDF text dumps include page-break noise, column wrap artefacts, and headers/footers repeated on every page. Each of those pollutes the embedding: the model spends part of the 768/1024/1536-dimensional vector encoding "page 14 of 87" instead of encoding the semantic content. Markdown strips all of that. Each chunk you embed becomes a tight representation of one section's meaning, which is exactly what nearest-neighbour retrieval needs.
Convert each source PDF to Markdown once at ingest. Chunk on heading boundaries (with a token cap as fallback). Embed each chunk with your provider's API. Store the vector with the chunk text and a metadata blob — the document filename, the section heading, the page range — so retrieval can return both the match and where it came from. t0md's HTTP API is callable from any ingest pipeline; the same call works whether you're indexing 10 PDFs or 10,000.
# Pseudo-pipeline
for pdf in pdfs:
markdown = t0md_convert(pdf) # HTTP POST /convert
chunks = split_on_headings(markdown)
for chunk in chunks:
vec = embed(chunk.text) # OpenAI / Cohere / local
store.upsert(id=..., vec=vec, meta={"src": pdf, "heading": chunk.heading})
All of them — the database doesn't care about input format, only the embedding shape. The win is on the ingestion side: Markdown chunks produce better embeddings regardless of where you store them.
Embed it. Most modern embedding models were trained on Markdown-flavoured text and the structure helps. Stripping to bare prose loses signal about what's a heading vs. body.
Markdown is also better for keyword search: cleaner tokens, less garbage. The same converted file feeds both your vector index and your BM25 / full-text index without separate preprocessing.