DEV Community

Cover image for Gemma 3 + MistralOCR + RAG Just Revolutionized Agent OCR Forever
Gao Dalie (Ilyass)
Gao Dalie (Ilyass)

Posted on

Gemma 3 + MistralOCR + RAG Just Revolutionized Agent OCR Forever

Not a Month Ago, I made a video about Ollama-OCR. Many of you like this video

One of the followers had a problem with OCR Chatbot and asked me if I could help, and I thought this video might help many developers.

Good news! Mistral AI released Mistral OCR, a new product known as “the best OCR model in the world”

Mistral OCR is an optical character recognition API that sets a new standard for document understanding. Unlike other models, Mistral OCR understands every document element (media, text, tables, formulas) with unprecedented accuracy and cognitive capabilities. It inputs images and PDFs and extracts content from ordered interleaved text and images.

Mistral OCR is, therefore, an ideal model to be used in conjunction with RAG systems that take multimodal documents such as slides or complex PDFs as input.

But Gao Mistral-OCR is not enough to create a powerful OCR Agent

I know you are right. Just now, Google’s Gemma series of open-source models has been updated, and Gemma 3, which is optimized for multimodality and long context, has been released. The performance of the 27B version is comparable to Gemini-1.5-Pro.

Google claims that Gemma 3 is the “world’s best single-accelerator model,” outperforming competitors such as Meta, DeepSeek, and OpenAI, especially on a host computer using a single GPU. The new model’s visual encoder has been enhanced to support high-resolution and non-square images.

So, let me give you a quick demo of a live chatbot to show you what I mean.

Check my video on YouTube

I went to the streamlit app and input API keys for the Mistral and Google APIs via the sidebar. If valid, the Mistral client is initialized, and the Google API is checked.

Once the API is connected, I will upload a PDF including a table, invoice, text, and charts. Also, in this video, I will demo the image, and I will click on the process PDF button once I upload the PDF, which will display in the sidebar. It creates a temporary directory to manage files.

If anything goes wrong during the upload, it catches and raises a ValueError with a clear message. In case you upload an image, it will convert the image into markdown and loop through each key-value pair in images_dict. After replacing all image placeholders, it returns the modified Markdown string with the embedded base64 images.

Then, it will process multiple pages of OCR-extracted Markdown and their respective images. It creates an empty list of markdowns to store processed Markdown content from each page. It iterates bypage.images extracting each image’s ID as the key and its base64-encoded string as the value. It then appends the updated Markdown to the markdowns list. and it combines all processed Markdown sections, ensuring a clean separation between pages. Then it inspects the document source type to determine how to process the document.

By the end of this Video, you will understand what makes Mistral-OCR and Gemma 3 unique, how Gemma 3 is trained, and how we can use Gemma 3, Mistral-OCR, and RAG to create a powerful OCR agent

It is naturally multilingual and multimodal, and its lightweight design makes it much faster than similar models. A single node can process up to 2,000 pages of documents per minute. and local deployment options keep sensitive data within reach.

What’s more, it can convert the read data into the Markdown format. This is revolutionary because the AI ​​model itself can easily understand the Markdown format data, so it can understand document data much better.

What Makes Gemma 3 Unique?

Gemma 3 is not just about size, it also has many features to help. Gemma 3 delivers state-of-the-art performance for its size. In preliminary evaluations, it supports over 35 languages ​​out of the box and is pre-trained for over 140 languages. Seamlessly analyze images, text, and short videos, and Massive context windows of 128K tokens allow your applications to process and understand large amounts of data at once.

How Gemma 3 It Trained

Gemma 3 uses distillation techniques and performs optimization through reinforcement learning and model merging during pre-training and post-training.

This approach can improve performance in mathematics, encoding, and instruction.

Moreover, Gemma 3 uses a brand new tokenizer, provides support for more than 140 languages, and is trained on Google TPU using the JAX framework for 1B of 2T tokens, 4B of 4T tokens, 12B of 12T tokens, and 27B of 14T tokens.

In the post-training stage, Gemma 3 mainly uses 4 components:

Extract from a larger instruction model into a Gemma 3 pre-trained checkpoint
Reinforcement learning with human feedback (RLHF) aligns model predictions with human preferences.
Machine feedback reinforcement learning (RLMF), enhances mathematical reasoning.
Reinforcement Learning Execution Feedback (RLEF) to improve coding ability.
These updates significantly improved the model’s math, programming, and command-following capabilities, allowing Gemma 3 to score 1338 in LMArena.

The fine-tuned version of Gemma 3 commands uses the same dialogue format as Gemma 2. Therefore, developers do not need to update the tools and can directly input plain text.

Let’s start coding

Let us now explore step by step and unravel the answer to creating a powerful reasoning app. We will install the libraries that support the model. For this, we will do a pip install requirements

pip install -r requirements.txt 
Enter fullscreen mode Exit fullscreen mode

Once installed, we import the important dependencies like Mistralai

,Google and streamlit

import streamlit as st import base64 import tempfile import os from mistralai import Mistral from PIL import Image import io from mistralai import DocumentURLChunk, ImageURLChunk from mistralai.models import OCRResponse from dotenv import find_dotenv, load_dotenv import google.generativeai as genai 
Enter fullscreen mode Exit fullscreen mode

I designed this function to securely upload a PDF to Mistral’s OCR API and get a signed URL for further processing. I first check if the client object is provided; if it’s None, I raise an error since the function requires a properly initialized Mistral API client.

I create a temporary directory, define a file path, and write the PDF content. I then open the file in "rb" mode and upload it to the Mistral API using the client, specifying the filename, content, and "purpose" as "ocr".

After a successful upload, I retrieve a signed URL with client enabling access to the file. If an error occurs, I catch the exception and raise a ValueError with a clear message. Finally, I ensure cleanup by deleting the temporary file if it exists.

OCR Processing Functions

def upload_pdf(client, content, filename): """Uploads a PDF to Mistral's API and retrieves a signed URL for processing.""" if client is None: raise ValueError("Mistral client is not initialized") with tempfile.TemporaryDirectory() as temp_dir: temp_path = os.path.join(temp_dir, filename) with open(temp_path, "wb") as tmp: tmp.write(content) try: with open(temp_path, "rb") as file_obj: file_upload = client.files.upload( file={"file_name": filename, "content": file_obj}, purpose="ocr" ) signed_url = client.files.get_signed_url(file_id=file_upload.id) return signed_url.url except Exception as e: raise ValueError(f"Error uploading PDF: {str(e)}") finally: if os.path.exists(temp_path): os.remove(temp_path 
Enter fullscreen mode Exit fullscreen mode

)

Then I created the replace_images_in_markdown function, which takes a Markdown string and a dictionary mapping image names to base64-encoded images. I iterate through the dictionary, where each key represents an image placeholder and each value contains the corresponding base64 string.

I used .replace() to find occurrences img_name in the Markdown and replace them with img_name. This ensures that placeholders are converted into embedded base64 images. Finally, I returned the updated Markdown string with the images replaced.

def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str: """Replace image placeholders with base64 encoded images in markdown.""" for img_name, base64_str in images_dict.items(): markdown_str = markdown_str.replace(f"![{img_name}]({img_name})", f"![{img_name}]({base64_str})") return markdown_str 
Enter fullscreen mode Exit fullscreen mode

Let’s define the get_combined_markdown function to process multiple pages of OCR-extracted Markdown and their images. I create an empty list of markdowns to store the processed Markdown from each page. I iterate through ocr_response.pages, collecting image data by mapping image IDs to their base64-encoded representations.

I replace image placeholders in each page’s Markdown using replace_images_in_markdown and append the modified content to markdowns. Finally, I join all processed Markdown sections with "\n\n".join(markdowns), ensuring a clear separation between pages.

def get_combined_markdown(ocr_response: OCRResponse) -> str: """Combine markdown from all pages with their respective images.""" markdowns: list[str] = [] for page in ocr_response.pages: image_data = {} for img in page.images: image_data[img.id] = img.image_base64 markdowns.append(replace_images_in_markdown(page.markdown, image_data)) return "\n\n".join(markdowns) 
Enter fullscreen mode Exit fullscreen mode

Then, I create the process_ocr function to check if the client is provided; if not, I raise an error because an initialized Mistral client is required. I inspect document_source to determine whether to process a document URL or an image URL. If it's a "document_url"I call client.ocr.process() with a Document URL Chunk, and if it's an "image_url"I use an ImageURLChunk.

I specify "mistral-ocr-latest" the model and enable it include_image_base64=True to include base64-encoded images. If the source type is unrecognized, I raise a ValueError with a clear error message.

def process_ocr(client, document_source): """Process document with OCR API based on source type""" if client is None: raise ValueError("Mistral client is not initialized") if document_source["type"] == "document_url": return client.ocr.process( document=DocumentURLChunk(document_url=document_source["document_url"]), model="mistral-ocr-latest", include_image_base64=True ) elif document_source["type"] == "image_url": return client.ocr.process( document=ImageURLChunk(image_url=document_source["image_url"]), model="mistral-ocr-latest", include_image_base64=True ) else: raise ValueError(f"Unsupported document source type: {document_source['type']}") 
Enter fullscreen mode Exit fullscreen mode

Let’s initialize the Google Gemini API by configuring it with an API key. I check if the context is empty or too short (less than 10 characters), and I return an error if it is true. I create a prompt that includes the document’s context and the query to guide the model’s response.

I configured the model with parameters like temperature, top_p, and safety settings. I generate the response using the model.generate_content(). If an error occurs, I catch it, print the error details, and return an error message.

def generate_response(context, query): """Generate a response using Google Gemini API""" try: # Initialize the Google Gemini API genai.configure(api_key=google_api_key) # Check for empty context if not context or len(context) < 10: return "Error: No document content available to answer your question." # Create a prompt with the document content and query prompt = f"""I have a document with the following content: {context} Based on this document, please answer the following question: {query} If you can find information related to the query in the document, please answer based on that information. If the document doesn't specifically mention the exact information asked, please try to infer from related content or clearly state that the specific information isn't available in the document. """ # Print for debugging print(f"Sending prompt with {len(context)} characters of context") print(f"First 500 chars of context: {context[:500]}...") # Generate response model = genai.GenerativeModel('gemma-3-27b-it') generation_config = { "temperature": 0.4, "top_p": 0.8, "top_k": 40, "max_output_tokens": 2048, } safety_settings = [ { "category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_ONLY_HIGH" }, { "category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_ONLY_HIGH" }, { "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_ONLY_HIGH" }, { "category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_ONLY_HIGH" }, ] response = model.generate_content( prompt, generation_config=generation_config, safety_settings=safety_settings ) return response.text except Exception as e: print(f"Error generating response: {str(e)}") import traceback print(traceback.format_exc()) return f"Error generating response: {str(e)}" 
Enter fullscreen mode Exit fullscreen mode

Then, we created the Streamlit app to enable users to upload documents or images for OCR processing. Users can provide API keys for the Mistral and Google APIs via the sidebar. If valid, the Mistral client is initialized, and the Google API is checked and can upload documents via PDF, image, or URL. The app processes the content using OCR, extracting text from each page and storing it for future use.

Once a document is loaded, users can ask questions about the content, and the app generates responses using the Google Gemini API. All chat messages are stored in the session state. Streamlit also handles errors, such as missing API keys or processing failures, and provides warnings when features are incomplete.

def main(): global api_key, google_api_key st.set_page_config(page_title="Document OCR & Chat", layout="wide") # Sidebar: Authentication for API keys with st.sidebar: st.header("Settings") # API key inputs api_key_tab1, api_key_tab2 = st.tabs(["Mistral API", "Google API"]) with api_key_tab1: # Get Mistral API key from environment or user input user_api_key = st.text_input("Mistral API Key", value=api_key if api_key else "", type="password") if user_api_key: api_key = user_api_key os.environ["MISTRAL_API_KEY"] = api_key with api_key_tab2: # Get Google API key user_google_api_key = st.text_input( "Google API Key", value=google_api_key if google_api_key else "", type="password", help="API key for Google Gemini to use for response generation" ) if user_google_api_key: google_api_key = user_google_api_key os.environ["GOOGLE_API_KEY"] = google_api_key # Initialize Mistral client with the API key mistral_client = None if api_key: mistral_client = initialize_mistral_client(api_key) if mistral_client: st.sidebar.success("✅ Mistral API connected successfully") # Google API key validation if google_api_key: is_valid, message = test_google_api(google_api_key) if is_valid: st.sidebar.success(f"✅ Google API {message}") else: st.sidebar.error(f"❌ Google API: {message}") google_api_key = None # Display warnings for missing API keys if not api_key or mistral_client is None: st.sidebar.warning("⚠️ Valid Mistral API key required for document processing") if not google_api_key: st.sidebar.warning("⚠️ Google API key required for chat functionality") # Initialize session state if "messages" not in st.session_state: st.session_state.messages = [] if "document_content" not in st.session_state: st.session_state.document_content = "" if "document_loaded" not in st.session_state: st.session_state.document_loaded = False # Document upload section st.subheader("Document Upload") # Only show document upload if Mistral client is initialized if mistral_client: input_method = st.radio("Select Input Type:", ["PDF Upload", "Image Upload", "URL"]) document_source = None if input_method == "URL": url = st.text_input("Document URL:") if url and st.button("Load Document from URL"): document_source = { "type": "document_url", "document_url": url } elif input_method == "PDF Upload": uploaded_file = st.file_uploader("Choose PDF file", type=["pdf"]) if uploaded_file and st.button("Process PDF"): content = uploaded_file.read() # Save the uploaded PDF temporarily for display purposes with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp: tmp.write(content) pdf_path = tmp.name try: # Prepare document source for OCR processing document_source = { "type": "document_url", "document_url": upload_pdf(mistral_client, content, uploaded_file.name) } # Display the uploaded PDF st.header("Uploaded PDF") display_pdf(pdf_path) except Exception as e: st.error(f"Error processing PDF: {str(e)}") # Clean up the temporary file if os.path.exists(pdf_path): os.unlink(pdf_path) elif input_method == "Image Upload": uploaded_image = st.file_uploader("Choose Image file", type=["png", "jpg", "jpeg"]) if uploaded_image and st.button("Process Image"): try: # Display the uploaded image image = Image.open(uploaded_image) st.image(image, caption="Uploaded Image", use_column_width=True) # Convert image to base64 buffered = io.BytesIO() image.save(buffered, format="PNG") img_str = base64.b64encode(buffered.getvalue()).decode() # Prepare document source for OCR processing document_source = { "type": "image_url", "image_url": f"data:image/png;base64,{img_str}" } except Exception as e: st.error(f"Error processing image: {str(e)}") # Process document if source is provided if document_source: with st.spinner("Processing document..."): try: ocr_response = process_ocr(mistral_client, document_source) if ocr_response and ocr_response.pages: # Extract all text without page markers for clean content raw_content = [] for page in ocr_response.pages: page_content = page.markdown.strip() if page_content: # Only add non-empty pages raw_content.append(page_content) # Join all content into one clean string for the model final_content = "\n\n".join(raw_content) # Also create a display version with page numbers for the UI display_content = [] for i, page in enumerate(ocr_response.pages): page_content = page.markdown.strip() if page_content: display_content.append(f"Page {i+1}:\n{page_content}") display_formatted = "\n\n----------\n\n".join(display_content) # Store both versions st.session_state.document_content = final_content # Clean version for the model st.session_state.display_content = display_formatted # Formatted version for display st.session_state.document_loaded = True # Show success information about extracted content st.success(f"Document processed successfully! Extracted {len(final_content)} characters from {len(raw_content)} pages.") else: st.warning("No content extracted from document.") except Exception as e: st.error(f"Processing error: {str(e)}") # Main area: Display chat interface st.title("Document OCR & Chat") # Document preview area if "document_loaded" in st.session_state and st.session_state.document_loaded: with st.expander("Document Content", expanded=False): # Show the display version with page numbers if "display_content" in st.session_state: st.markdown(st.session_state.display_content) else: st.markdown(st.session_state.document_content) # Chat interface st.subheader("Chat with your document") # Display chat messages for message in st.session_state.messages: with st.chat_message(message["role"]): st.markdown(message["content"]) # Input for user query if prompt := st.chat_input("Ask a question about your document..."): # Check if Google API key is available if not google_api_key: st.error("Google API key is required for generating responses. Please add it in the sidebar settings.") else: # Add user message to chat history st.session_state.messages.append({"role": "user", "content": prompt}) # Display user message with st.chat_message("user"): st.markdown(prompt) # Show thinking spinner with st.chat_message("assistant"): with st.spinner("Thinking..."): # Get document content from session state document_content = st.session_state.document_content # Generate response directly response = generate_response(document_content, prompt) # Display response st.markdown(response) # Add assistant message to chat history st.session_state.messages.append({"role": "assistant", "content": response}) else: # Show a welcome message if no document is loaded st.info("👈 Please upload a document using the sidebar to start chatting.") if __name__ == "__main__": main() 
Enter fullscreen mode Exit fullscreen mode

Conclusion :

The release of Mistral OCR and Gemma 3 is not only a powerful move by Mistral AI and Google in the field of OCR and state-of-the-art AI performance, but also significantly improves memory efficiency. a.nd it marks another leap forward for AI in document intelligence.

For developers, it is a powerful tool that can be used out of the box; for enterprises, it is the golden key to unlocking the value of unstructured data. For ordinary people like us, it is also a useful tool for recognizing manuscripts, invoices, contract photos, etc

If this article might be helpful to your friends, please forward it to them.

🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an inquiry here or book a 1-on-1 Consulting Call With Me.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI

Book an Appointment with me: https://topmate.io/gaodalie_ai

Support the Content (every Dollar goes back into the video):https://buymeacoffee.com/gaodalie98d

Subscribe to the Newsletter for free:https://substack.com/@gaodalie

Top comments (0)