Posted on Jan 4

Try Multimodal Search with ColQwen2!

In this article, we introduce how to use ColQwen2.

ColQwen2 is based on Qwen2-VL-2B and generates ColBERT-style multi-vector representations, enabling highly accurate searches across text and image inputs.

We will test ColQwen2 using Google Colab with an A100 GPU.

Library Installation

First, install the necessary libraries:

!pip install git+https://github.com/illuin-tech/colpali !pip install pymupdf

Preparing Image Data

Next, prepare the image data. For this tutorial, we’ll use the ColPali paper.

Using pymupdf, we’ll extract images from the PDF file:

import pymupdf import os # Constants DPI = 350 # Can be modified as needed  def convert_pdf_to_images(pdf_path, output_dir): """ Convert PDF pages to images. Args: pdf_path (str): Path to the PDF file. output_dir (str): Directory to save images. """ if not os.path.exists(output_dir): os.makedirs(output_dir) pdf_document = pymupdf.open(pdf_path) for page_number in range(pdf_document.page_count): page = pdf_document[page_number] pix = page.get_pixmap(dpi=DPI) output_file = os.path.join(output_dir, f'page_{page_number + 1:02}.png') pix.save(output_file) pdf_document.close() pdf_path = "/content/2407.01449v3.pdf" output_dir = "output_images" convert_pdf_to_images(pdf_path, output_dir)

Images will be saved in the "output_images" folder.

Searching the Images

Now, let’s use ColQwen2. Refer to the Huggingface page for sample code.

After downloading and uploading the paper PDF to Google Colab, execute the following code:

import glob, os import torch from PIL import Image from colpali_engine.models import ColQwen2, ColQwen2Processor device = "cuda:0" if torch.cuda.is_available() else "cpu" print(f"cuda available: {torch.cuda.is_available()}") model = ColQwen2.from_pretrained( "vidore/colqwen2-v0.1", torch_dtype=torch.bfloat16, device_map=device, # or "mps" if on Apple Silicon  ).eval() processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v0.1") # Your inputs images = [Image.open(filepath) for filepath in glob.glob(os.path.join(output_dir, "*.png"))] queries = [ "What is the architecture of ColPali?", "How does it differ from previous studies?", ] # Process the inputs batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_queries(queries).to(model.device) # Forward pass batch_size = 1 # Reduced batch size to 1 image_embeddings = [] for i in range(0, len(images), batch_size): batch = images[i : i + batch_size] resized_batch = [img.resize((512, 512)) for img in batch] # Resize before processing  batch_images = processor.process_images(resized_batch).to(model.device) with torch.no_grad(): embeddings = model(**batch_images) image_embeddings.extend(embeddings) with torch.no_grad(): query_embeddings = model(**batch_queries) image_embeddings = torch.stack(image_embeddings) scores = processor.score_multi_vector(query_embeddings, image_embeddings) print(scores)

The scores are returned as a list (matrix):

tensor([[13.2500, 8.4375, 11.3750, 11.1875, 13.8125, 12.0000, 8.3125, 9.0000, 10.4375, 8.7500, 10.4375, 11.6250, 7.8438, 7.4375, 9.9375, 8.0625, 7.5000, 10.9375, 9.7500, 7.8750], [ 8.3750, 7.5000, 9.6250, 8.3125, 7.5625, 8.1250, 7.9688, 8.4375, 8.5000, 9.0625, 7.7812, 8.3125, 7.5000, 7.9062, 8.6875, 7.9688, 7.9062, 7.9688, 8.7500, 7.5000]])

Visualizing Scores

Let’s visualize the scores:

import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np scores_df = pd.DataFrame(scores.cpu().numpy(), columns=[f'Image {i+1}' for i in range(scores.shape[1])]).T scores_df.index.name = 'Images' # Create two separate bar plots side-by-side plt.figure(figsize=(12, 6)) # First bar plot plt.subplot(1, 2, 1) sns.barplot(x=scores_df.index, y=scores_df[0], color="skyblue") plt.title("Query: " + queries[0]) plt.xticks(rotation=45, ha="right") plt.ylabel('Score') # Second bar plot plt.subplot(1, 2, 2) sns.barplot(x=scores_df.index, y=scores_df[1], color="lightcoral") plt.title("Query: " + queries[1]) plt.xticks(rotation=45, ha="right") plt.ylabel('Score') plt.tight_layout() plt.show()

Inspecting the Top Results

Let’s check the top 2 results:

for query, high_idx in zip(queries, highest_score_indices.tolist()): print(f"{query}: {high_idx}") # Display the image  image_path = os.path.join(output_dir, f"page_{high_idx+1}.png") display(Image.open(image_path))

Query 1: What is the architecture of ColPali?

Top 2 results:

1st: Page 5

2nd: Page 1

The Page 5 with the highest relevance includes the word “Architecture.” However, the architecture diagram on page 2 received a lower score.

Query 2: How does it differ from previous studies?

Top 2 results:

1st: Page 3

2nd: Page 10

Page 3 has contents of “Related Work,” but the start of the related work section on page 2 scored lower. Page 10, which includes references, scored higher, as expected.

Conclusion

We tested image search using ColQwen2. Searching entire PDF pages proved challenging; for practical use, extracting figures as standalone images might improve results.

To extract text, images, and tables more effectively from PDFs, consider tools like pymupdf2llm.