Who Describes DOF Images Best?
We compare 6 vision models (Gemini, GPT, Qwen, Claude) on the task of generating image descriptions for RAG indexing of Mexico's Official Journal of the Federation.
Tracking and documenting progress in the development of retrieval and augmented generation systems for the Official Journal of the Federation.
We compare 6 vision models (Gemini, GPT, Qwen, Claude) on the task of generating image descriptions for RAG indexing of Mexico's Official Journal of the Federation.
From 657,867 .doc files from Mexico's Official Journal of the Federation to clean Markdown: tools, results, and what's left to do.
From the downloaded WORD file to structured Markdown ready for embeddings: a walkthrough of our complete processing pipeline that includes LibreOffice conversion, custom LUA filters, Gemini image analysis, and a robust directory architecture.
How the reality of massive document processing led us to rethink our strategy: from downloading complete PDFs to obtaining segmented WORD files, dramatically reducing computational cost without sacrificing quality.
A detailed analysis of storage projections for the DOF-RAG project, evaluating different embedding dimensions and their scalability implications over a 25-year horizon.
A comparative analysis of three embedding models (Nomic Embed, Gemini, Jina) evaluating speed, quality, and stability in vector search for Mexican official documents.
A comparative analysis of different tools for converting PDFs to markdown and why we chose Marker for our DOF-RAG project.
An analysis of the challenges encountered during the integration of Google's AI models in the DOF RAG project, managing evolving libraries, and solving API-related issues.
How we solved the problem of lack of context in text chunks to improve the accuracy of our RAG system.
A comparative analysis of different AI models in the task of describing images for the DOF-RAG project.
An initiative to improve the accessibility and understanding of information from the Official Journal of the Federation.