657 thousand documents later: how we converted ALL of the DOF to Markdown
From 657,867 .doc files from Mexico's Official Journal of the Federation to clean Markdown: tools, results, and what's left to do.
DOF-RAG Team Read more
3 posts with this tag
From 657,867 .doc files from Mexico's Official Journal of the Federation to clean Markdown: tools, results, and what's left to do.
From the downloaded WORD file to structured Markdown ready for embeddings: a walkthrough of our complete processing pipeline that includes LibreOffice conversion, custom LUA filters, Gemini image analysis, and a robust directory architecture.
A comparative analysis of different tools for converting PDFs to markdown and why we chose Marker for our DOF-RAG project.