657 thousand documents later: how we converted ALL of the DOF to Markdown
From 657,867 .doc files from Mexico's Official Journal of the Federation to clean Markdown: tools, results, and what's left to do.
DOF-RAG Team Read more
2 posts with this tag
From 657,867 .doc files from Mexico's Official Journal of the Federation to clean Markdown: tools, results, and what's left to do.
From the downloaded WORD file to structured Markdown ready for embeddings: a walkthrough of our complete processing pipeline that includes LibreOffice conversion, custom LUA filters, Gemini image analysis, and a robust directory architecture.