In this article, we introduce how to use ColQwen2.
ColQwen2 is based on Qwen2-VL-2B and generates ColBERT-style multi-vector representations, enabling highly accurate searches across text and image inputs.
We will test ColQwen2 using Google Colab with an A100 GPU.
Library Installation
First, install the necessary libraries:
!pip install git+https://github.com/illuin-tech/colpali !pip install pymupdf
Preparing Image Data
Next, prepare the image data. For this tutorial, we’ll use the ColPali paper.
Using pymupdf
, we’ll extract images from the PDF file:
import pymupdf import os # Constants DPI = 350 # Can be modified as needed def convert_pdf_to_images(pdf_path, output_dir): """ Convert PDF pages to images. Args: pdf_path (str): Path to the PDF file. output_dir (str): Directory to save images. """ if not os.path.exists(output_dir): os.makedirs(output_dir) pdf_document = pymupdf.open(pdf_path) for page_number in range(pdf_document.page_count): page = pdf_document[page_number] pix = page.get_pixmap(dpi=DPI) output_file = os.path.join(output_dir, f'page_{page_number + 1:02}.png') pix.save(output_file) pdf_document.close() pdf_path = "/content/2407.01449v3.pdf" output_dir = "output_images" convert_pdf_to_images(pdf_path, output_dir)
Images will be saved in the "output_images"
folder.
Searching the Images
Now, let’s use ColQwen2. Refer to the Huggingface page for sample code.
After downloading and uploading the paper PDF to Google Colab, execute the following code:
import glob, os import torch from PIL import Image from colpali_engine.models import ColQwen2, ColQwen2Processor device = "cuda:0" if torch.cuda.is_available() else "cpu" print(f"cuda available: {torch.cuda.is_available()}") model = ColQwen2.from_pretrained( "vidore/colqwen2-v0.1", torch_dtype=torch.bfloat16, device_map=device, # or "mps" if on Apple Silicon ).eval() processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v0.1") # Your inputs images = [Image.open(filepath) for filepath in glob.glob(os.path.join(output_dir, "*.png"))] queries = [ "What is the architecture of ColPali?", "How does it differ from previous studies?", ] # Process the inputs batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_queries(queries).to(model.device) # Forward pass batch_size = 1 # Reduced batch size to 1 image_embeddings = [] for i in range(0, len(images), batch_size): batch = images[i : i + batch_size] resized_batch = [img.resize((512, 512)) for img in batch] # Resize before processing batch_images = processor.process_images(resized_batch).to(model.device) with torch.no_grad(): embeddings = model(**batch_images) image_embeddings.extend(embeddings) with torch.no_grad(): query_embeddings = model(**batch_queries) image_embeddings = torch.stack(image_embeddings) scores = processor.score_multi_vector(query_embeddings, image_embeddings) print(scores)
The scores are returned as a list (matrix):
tensor([[13.2500, 8.4375, 11.3750, 11.1875, 13.8125, 12.0000, 8.3125, 9.0000, 10.4375, 8.7500, 10.4375, 11.6250, 7.8438, 7.4375, 9.9375, 8.0625, 7.5000, 10.9375, 9.7500, 7.8750], [ 8.3750, 7.5000, 9.6250, 8.3125, 7.5625, 8.1250, 7.9688, 8.4375, 8.5000, 9.0625, 7.7812, 8.3125, 7.5000, 7.9062, 8.6875, 7.9688, 7.9062, 7.9688, 8.7500, 7.5000]])
Visualizing Scores
Let’s visualize the scores:
import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np scores_df = pd.DataFrame(scores.cpu().numpy(), columns=[f'Image {i+1}' for i in range(scores.shape[1])]).T scores_df.index.name = 'Images' # Create two separate bar plots side-by-side plt.figure(figsize=(12, 6)) # First bar plot plt.subplot(1, 2, 1) sns.barplot(x=scores_df.index, y=scores_df[0], color="skyblue") plt.title("Query: " + queries[0]) plt.xticks(rotation=45, ha="right") plt.ylabel('Score') # Second bar plot plt.subplot(1, 2, 2) sns.barplot(x=scores_df.index, y=scores_df[1], color="lightcoral") plt.title("Query: " + queries[1]) plt.xticks(rotation=45, ha="right") plt.ylabel('Score') plt.tight_layout() plt.show()
Inspecting the Top Results
Let’s check the top 2 results:
for query, high_idx in zip(queries, highest_score_indices.tolist()): print(f"{query}: {high_idx}") # Display the image image_path = os.path.join(output_dir, f"page_{high_idx+1}.png") display(Image.open(image_path))
Query 1: What is the architecture of ColPali?
Top 2 results:
1st: Page 5
2nd: Page 1
The Page 5 with the highest relevance includes the word “Architecture.” However, the architecture diagram on page 2 received a lower score.
Query 2: How does it differ from previous studies?
Top 2 results:
1st: Page 3
2nd: Page 10
Page 3 has contents of “Related Work,” but the start of the related work section on page 2 scored lower. Page 10, which includes references, scored higher, as expected.
Conclusion
We tested image search using ColQwen2. Searching entire PDF pages proved challenging; for practical use, extracting figures as standalone images might improve results.
To extract text, images, and tables more effectively from PDFs, consider tools like pymupdf2llm.
Top comments (0)