DOF-RAG

Published on

Written by

DOF-RAG Team

Share

When AI Models Describe Images: The Brilliant, The Absurd, and The Comical

A comparative analysis of different AI models in the task of describing images for the DOF-RAG project.

Image for When AI Models Describe Images: The Brilliant, The Absurd, and The Comical

The Challenge of Invisible Information in Official Documents

The Official Journal of the Federation (DOF) publishes thousands of documents each year that contain critical information for citizens, businesses, and government entities. A particular challenge we have identified is the large amount of valuable information contained exclusively in images: data tables, statistical graphs, explanatory infographics, technical diagrams, and maps. This visual information remains “invisible” to traditional search and information retrieval systems.

When building Retrieval Augmented Generation (RAG) systems like DOF-RAG, we face a fundamental problem: if important information is contained in images, but our system can only process text, how can we make that information accessible?

Why Do We Need Textual Descriptions of Images?

The DOF’s digital historical archive contains more than 25,000 images accumulated over approximately 20 years. These images are not simple decorative elements, but frequently contain crucial government information that must be accessible to both humans and automated systems.

There are several approaches to address this problem:

  1. Automatic description extraction (our approach): Using multimodal AI models to generate detailed textual descriptions of the content of each image, which are then indexed along with the document text.
  2. Image embedding systems: Alternatives such as CLIP from Jina AI or Nomic Embed Vision that generate vector representations of images, allowing semantic searches directly on the visual content without the need for intermediate textual descriptions.
  3. Specialized OCR: For images that mainly contain text (such as tables or textual infographics), advanced optical character recognition techniques could be applied.

For our project, we have opted for the first approach: automatic extraction of textual descriptions. This approach allows us to:

  • Easily integrate visual information into our existing text indexes
  • Maintain a coherent processing flow for all types of content
  • Leverage the multimodal comprehension capabilities of the most recent AI models

In our project of automatic extraction of image descriptions, we have tested several AI models to evaluate their ability to generate accurate and useful descriptions. The results have been fascinating, from incredibly detailed descriptions to interpretations that have made us question whether the AI and we were looking at the same image.

The Case of the Mysterious Image

While testing how AI models describe images, we analyzed several photos from a DOF file. For this example, we’ll use a single image (the one we call “_page_337_Figure_4.jpeg”). The results were curious: each model gave very different descriptions of the same image. This made us wonder if these systems are really reliable when used in real situations.

The image in question is the following:

Guarantee Prices Program for Basic Food Products - SEGALMEX

As can be seen, it is an infographic about a guarantee price program for basic food products (corn and beans) from the Mexican government, through SEGALMEX. However, the interpretations we received from the different models were interestingly diverse.

Below are the results obtained with different models, analyzing both semantic accuracy and processing times:

Note on metrics: For each model we present two key values: Time (seconds): How long it took the model to generate the description. Tokens: Basic units of generated text (words or parts of words). The higher the number, the more detailed the response tends to be.

Ollama with gemma3:4b - The Compassionate but Slow Model

Question: What can be observed in this image?

Description:

The image shows a photograph of a young person, presumably a child, sitting in a wheelchair. The child has a serious expression and appears focused. The photo seems to be from an educational or informative context, possibly related to disability awareness or child mobility. No specific place or region is identified, but the composition suggests a message of support, understanding, and possibly, promotion of inclusion.METRICS: Time: 129.4466s | Tokens: 110

The description generated by gemma3:4b presents a coherent and contextually plausible narrative about inclusion and disability. However, after comparing with the original image, we found that the image does not contain any child in a wheelchair. This phenomenon, known in the field as “hallucination,” is particularly concerning when more than two minutes are spent generating completely fictional information.

Same Model, Second Chance

In a second attempt with the same model (gemma3:4b), we obtained:

Description:

The image shows a flowchart that represents the software development process. A life cycle is distinguished that includes phases such as planning, analysis, design, implementation, testing, and maintenance. The diagram highlights continuous iteration and the importance of feedback at each stage of development.METRICS: Time: 121.9237s | Tokens: 137

It is notable to observe how, given the same image, the model goes from describing “a child in a wheelchair” to “a software development flowchart.” This fundamental inconsistency between sequential executions of the same model with identical parameters suggests a significant instability in the visual interpretation process, in addition to maintaining processing times exceeding two minutes.

Gemini - Fast and Accurate

The Gemini model offered a substantially different description:

Description:

The document is a “PROGRAM OF GUARANTEE PRICES FOR BASIC FOOD PRODUCTS: CORN AND BEANS.” The SEGALMEX logo appears with the slogan “MEXICAN FOOD SECURITY.” The document is organized into eight numbered sections that explain that the program seeks to incentivize the income of small and medium agricultural producers and is aimed at corn and bean producers who meet eligibility criteria.METRICS: Time: 3.8s | Tokens: 187

After verification with the original image, we confirm that Gemini’s description corresponds with high accuracy to the actual content of the image: an informative document about a government agricultural support program. In addition to its semantic accuracy, the exceptionally low processing time (less than 4 seconds) and the generation of a higher number of tokens than the local alternatives stand out.

In additional tests with other images, Gemini consistently maintained this level of performance, combining detailed and accurate descriptions with notably efficient response times. During the time that local models required to describe a single image (approximately 2 minutes), Gemini could effectively process about 30 images.

However, it is important to note that the use of Gemini is subject to the limitations of its API, which currently allows only 15 requests per minute. This restriction forced us to implement a batch processing system with cooling periods between them. Despite this limitation, Gemini’s speed and accuracy remain far superior to local alternatives, making it the preferred option for productive environments where descriptive quality is a priority.

When Models Answer the Question… with Another Question

Some Ollama models decided it was more interesting to give us instructions on how we should respond, instead of describing the image:

Response:

Summarize its content because it has a more complete answer. Describe the type of image and its likely purpose, but don’t go into minor visual details, but make sure to IDENTIFY all important elements.METRICS: Time: 87.6s | Tokens: 43

It’s as if the model were saying: “Describe the image? No, thanks. I prefer to tell you how you should do it.” The audacity! And to top it off, it took almost 1.5 minutes to give us these uninvited instructions.

The Language Barrier: Moondream Only Speaks English (and Slowly)

Another challenge we faced was with the Moondream models, which only generate descriptions in English. We had to perform post-processing to translate their responses to Spanish. An example of their original descriptions:

The image presents a vibrant infographic that is a riot of colors and shapes. At the heart of the infographic, a person is depicted in a dynamic pose, suggesting movement or action. The person’s head is tilted to one side, adding a dynamic element to the otherwise static structure.METRICS: Time: 145.7s | Tokens: 63

Which translated would be: “The image presents a vibrant infographic that is a burst of colors and shapes. At the heart of the infographic, a person is depicted in a dynamic pose, suggesting movement or action. The person’s head is tilted to one side, adding a dynamic element to the otherwise static structure.”

Burst of colors? Person in a dynamic pose? Are we sure this model wasn’t describing a Pepsi ad from the 90s? And to make matters worse, it took almost 2.5 minutes to generate this colorful fantasy.

Performance Comparison: David vs. Goliath

When comparing local models (Ollama) with cloud services (Gemini), the difference is overwhelming:

ModelAverage TimeTokens GeneratedEstimated AccuracyLimitations
Gemini3-5 seconds150-20085-95%• 15 requests/minute
• 1500 requests/day
• Requires internet
Ollama (gemma3:4b)120-130 seconds100-14020-30%• No API limits
• Performance limited by local hardware
Moondream140-150 seconds60-8015-25%• No API limits
• Only generates in English

These metrics demonstrate not only the superiority in speed of cloud models (30-40 times faster), but also in the quality and accuracy of the generated descriptions. Local models, although brave in their attempt, seem to be playing in a completely different league.

Despite the limitation of 15 requests per minute imposed by the Gemini API, its performance remains far superior. In an hour of processing with cooling periods included, Gemini can process approximately 750-800 images with accurate descriptions, while a local model would only manage to process about 25-30 images in the same period, with significantly less reliable results.

However, Gemini’s daily limit of 1500 requests represents a significant obstacle for large-scale projects. In our tests with DOF documents from the last three years, we identified more than 4,000 images that would require processing. Considering that the DOF has been published in digital format for approximately 20 years, we estimate that the total volume could exceed 25,000-30,000 images, which would make it unfeasible to exclusively use the free version of Gemini for a complete project.

The following table illustrates the estimated time it would take to process different volumes of images with each model:

Image VolumeGeminiOllama (gemma3:4b)Hybrid
1,000 images~1.5 days*~36 hours~20 hours
5,000 images~3.5 days*~7.5 days~4 days
30,000 images~20 days*~45 days~30 days

*Limited by the daily quota of 1500 requests

This reality leads us to consider hybrid approaches that combine the accuracy of cloud models for critical documents, with the massive processing of local models for the rest of the content.

Technical Aspects: How the Image Description System Works

To better understand the results we have presented, it is important to know the architecture of the system we used to test the different AI models. This analysis allows us to understand not only the results, but also the internal functioning of the solution.

General System Architecture

Our system follows an interface-oriented design pattern that allows different AI providers to be interchanged without modifying the main code. This flexible architecture has allowed us to compare several models under the same test conditions.

Main Components:

  1. Abstract client interface (AbstractAIClient):
    • Defines the standard methods that all AI providers must implement
    • Ensures consistency in the way images are processed and descriptions generated
    • Provides methods to configure parameters such as temperature, maximum tokens, etc.
  2. Concrete implementations:
    • GeminiClient: Client for Google Gemini, optimized for visible text extraction and official documents
    • OllamaClient: Allows using local models like gemma3:4b, eliminating dependency on external APIs
    • OpenAIClient: Although we implemented the connection with GPT-4o models for image analysis, it is important to highlight that we did not use this implementation in our final tests due to budget limitations. The cost per request of the OpenAI API was prohibitive for a project of this scale, especially considering the high volume of images we needed to process. We focused on free or lower-cost alternatives that could scale to our needs.
  3. Processing utilities:
    • File management system for batch processing
    • Checkpoint mechanism to resume interrupted processing
    • Detailed metrics logging to analyze performance

Technical Performance Analysis

When examining the code and execution time metrics, we found technical explanations for the performance differences:

  1. Local vs. cloud models:
# Example of Ollama client configuration (local)
client = OllamaClient(
    model="gemma3:4b",          # Local model
    max_tokens=512,             # Maximum output tokens
    temperature=0.5,            # Creativity control
    num_ctx=8192                # Context size
)

Local models like gemma3:4b process everything on local hardware, which explains the processing times of 120-130 seconds per image, while cloud models use specialized high-performance infrastructure.

  1. Resource locking:
api_lock = threading.Lock()

# Inside the process_image method
with api_lock:
    # API call

This synchronization is necessary for cloud API clients, but not for local models, which can affect performance in batch processing.

  1. Parameter configuration:
ParameterEffect on generation
TemperatureLower values (0.2-0.4) produce more deterministic responses
Max tokensLimits the maximum length of the response
Top_pControls diversity through nucleus sampling
Top_kRestricts vocabulary to the k most probable words

During our tests, we kept these parameters constant to ensure a fair comparison between models.

How We Process Images

Image processing follows this flow:

  1. Image preparation: Conversion to byte format compatible with each API
  2. Prompt generation: We use a standardized prompt in Spanish for all models
  3. Model call: Sending the image and prompt to the corresponding model
  4. Response processing: Validation and formatting of the generated description
  5. Storage: Saving the description in a .txt file alongside the original image

For the Gemini model, which proved to be the most efficient, the actual code we used is the following:

# Generation process with Gemini (extract from actual code)
try:
    # Process the image to get bytes and mime type
    image_bytes, mime_type = self.process_image_to_bytes(image_path)
    
    # Prepare content parts
    contents = [
        # First part is the text prompt
        types.Part.from_text(text=self.question),
        # Second part is the image
        types.Part.from_bytes(data=image_bytes, mime_type=mime_type),
    ]
    
    with api_lock:
        # Use non-streaming API
        response = self._client.models.generate_content(
            model=self.model,
            contents=contents,
            config=self.get_generate_config()
        )
        description = response.text if hasattr(response, 'text') else str(response)
except Exception as e:
    error_msg = str(e)
    result["error"] = f"Error processing image with Gemini: {error_msg}"

To manage the limitations of the Gemini API (15 requests per minute), we implemented careful batch processing handling in our FileUtil system:

# Actual extract from process_images in FileUtil
for batch_start in range(start_index, len(image_paths), self.batch_size):
    batch_end = min(batch_start + self.batch_size, len(image_paths))
    
    print(f"\nProcessing batch {batch_start//self.batch_size + 1} ({batch_end}/{len(image_paths)} images)")
    
    # Process each image in the batch
    for i, img_path in enumerate(batch):
        # Code for processing each image...
    
    # Cooling period between batches, except if it's the last batch
    if batch_end < len(image_paths):
        print(f"\nCooling down for {self.cooling_period} seconds...")
        # Check for interruption during cooling every second
        for i in range(self.cooling_period):
            if interrupt_flag_file and os.path.exists(interrupt_flag_file):
                # Interruption handling...
                pass
            time.sleep(1)

This approach allows us to maximize API usage without exceeding its limits, maintaining efficient and continuous processing.

Technical Challenges and Solutions

During our tests we encountered several technical challenges:

  1. Models that don’t answer the question: Some Ollama models interpreted the prompt as instructions on how to respond, rather than answering the question. This is due to problems in the instruction format and limitations in context understanding by smaller models.
  2. Language barrier: Moondream models only respond in English, which required implementing a post-processing system to translate responses to Spanish.
  3. Accuracy vs. speed: The most accurate models (Gemini) were also the fastest, while local models, despite being slower, produced less accurate responses.
  4. Gemini API limits:
    • Per-minute limit: The Gemini API allows only 15 requests per minute, which forced us to implement considerable cooling periods between batches to avoid errors. In our tests, a 30-second cooling after each batch of 10 images proved optimal.
    • Daily limit: Beyond the per-minute limit, there is a 1500 daily request limit that represents a significant barrier for large-scale projects. In a context like the DOF, where we have identified more than 4,000 images just in the documents of the last 3 years, this limit is insufficient.
    • Historical volume: Considering that the DOF has been operating for approximately 20 years in digital format, the total estimated volume of images could exceed 25,000-30,000, which would make complete processing with the free Gemini API impossible in a reasonable time (it would require more than 20 days dedicated exclusively to this task).
  5. API failure management: We implemented checkpoint and recovery mechanisms to manage possible failures or interruptions during processing, which is critical in long-duration processes with API limitations.

These challenges led us to consider a hybrid approach, using Gemini for critical documents where accuracy is fundamental, and local models for the initial massive processing of historical documents, despite their lower accuracy.

This hybrid approach, which is currently in the prototype phase, would be implemented based on the current architecture of our system, which follows an interface-oriented design pattern as shown in our code:

# Configuration of different models (based on our actual code)
gemini_client = GeminiClient(
    model="gemini-2.0-flash",
    max_tokens=256,
    temperature=0.6,
    top_p=0.6,
    top_k=20,
    response_mime_type="text/plain",
    api_key=os.getenv("GEMINI_API_KEY")
)

ollama_client = OllamaClient(
    model="gemma3:4b",
    max_tokens=512,
    temperature=0.5,
    top_p=0.5,
    top_k=20,
    num_ctx=8192,
    api_key=None
)

The implementation of this mixed strategy allows us to optimize the use of resources while complying with the limits established by the APIs of cloud services.

Conclusions: AI and the Art of “Creative Interpretation”

Our tests with automated image description systems have shown that, despite significant advances in multimodal models, the reliability and consistency of these systems continues to be a notable challenge in the current state of technology.

On one hand, Gemini demonstrates impressive capabilities, generating fast, detailed descriptions with high semantic accuracy. This superior performance positions it as the preferred option for production applications where precision and efficiency are critical. However, the limitations of its API (15 requests per minute and especially the limit of 1500 daily requests) represent a significant obstacle for large-scale projects, such as the complete processing of the DOF’s historical archive with its thousands of images accumulated over two decades.

In contrast, local models such as those executed through Ollama, despite the inherent advantages of their independence from external services and having no per-minute request limitations, present substantial limitations in both speed and interpretive accuracy. Our empirical data demonstrates that these models frequently confabulate non-existent information or radically reinterpret visual elements between consecutive executions.

From a practical implementation perspective, the decision between cloud or local models must carefully consider the balance between accuracy, speed, API limits, and data privacy. For extensive projects like ours, a hybrid approach seems to be the most viable solution: using Gemini for critical or recent documents where accuracy is fundamental, complemented with local models for the massive processing of historical files where we can accept a greater margin of error in exchange for overcoming quota limitations.

And, as a final note, the enigma remains of what “_page_337_Figure_4.jpeg” really showed. Did it contain a moving portrait of a child in a wheelchair? A sophisticated flowchart about software development? Or perhaps a government document about agricultural guarantees for corn and bean producers? The evidence suggests that the third option is correct, but it is fascinating to contemplate that, in the interpretive universe of some AI models, all three realities could coexist simultaneously in a sort of quantum superposition of visual meanings that only artificial intelligence can perceive.


This post is part of our project of automatic extraction of image descriptions for documents from the Official Journal of the Federation. For more information about the system’s architecture and operation, check our repository.

Comentarios