Skip to content

Commit eaa3d4d

Browse files
Vision docs 📝 (#42096)
* add mask generation fine-tuning docs * initial commit * update video text to text * fix autoprocessor * bump model, update API * add torch.compile * Add results * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/video_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/video_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/video_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update image_text_to_text.md * Update docs/source/en/tasks/video_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
1 parent 8c84144 commit eaa3d4d

File tree

4 files changed

+475
-115
lines changed

4 files changed

+475
-115
lines changed

docs/source/en/tasks/image_text_to_text.md

Lines changed: 48 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,8 @@ This guide focuses on inference with an instruction-tuned model.
3333
Let's begin installing the dependencies.
3434

3535
```bash
36-
pip install -q transformers accelerate flash_attn
36+
pip install -q transformers accelerate
37+
pip install flash-attn --no-build-isolation
3738
```
3839

3940
Let's initialize the model and the processor.
@@ -45,12 +46,12 @@ import torch
4546

4647
device = Accelerator().device
4748
model = AutoModelForImageTextToText.from_pretrained(
48-
"HuggingFaceM4/idefics2-8b",
49+
"Qwen/Qwen3-VL-4B-Instruct",
4950
dtype=torch.bfloat16,
5051
attn_implementation="flash_attention_2",
5152
).to(device)
5253

53-
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
54+
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Instruct")
5455
```
5556

5657
This model has a [chat template](./chat_templating) that helps user parse chat outputs. Moreover, the model can also accept multiple images as input in a single conversation or message. We will now prepare the inputs.
@@ -65,24 +66,29 @@ The image inputs look like the following.
6566
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg" alt="A bee on a pink flower"/>
6667
</div>
6768

68-
```python
69-
from PIL import Image
70-
import requests
7169

72-
img_urls =["https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png",
73-
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"]
74-
images = [Image.open(requests.get(img_urls[0], stream=True).raw),
75-
Image.open(requests.get(img_urls[1], stream=True).raw)]
70+
Structure your conversation as shown below for a single prompt with image and text inputs.
71+
72+
```python
73+
messages = [
74+
{
75+
"role": "user",
76+
"content": [
77+
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"},
78+
{"type": "text", "text": "What do we see in this image?"},
79+
]
80+
}
81+
]
7682
```
7783

78-
Below is an example of the chat template. We can feed conversation turns and the last message as an input by appending it at the end of the template.
84+
Alternate between the `user` and `assistant` role to ground the model with prior context to generate better responses.
7985

8086
```python
8187
messages = [
8288
{
8389
"role": "user",
8490
"content": [
85-
{"type": "image"},
91+
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"},
8692
{"type": "text", "text": "What do we see in this image?"},
8793
]
8894
},
@@ -95,7 +101,7 @@ messages = [
95101
{
96102
"role": "user",
97103
"content": [
98-
{"type": "image"},
104+
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
99105
{"type": "text", "text": "And how about this image?"},
100106
]
101107
},
@@ -105,19 +111,20 @@ messages = [
105111
We will now call the processors' [`~ProcessorMixin.apply_chat_template`] method to preprocess its output along with the image inputs.
106112

107113
```python
108-
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
109-
inputs = processor(text=prompt, images=[images[0], images[1]], return_tensors="pt").to(device)
114+
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(device)
110115
```
111116

112117
We can now pass the preprocessed inputs to the model.
113118

114119
```python
120+
input_len = len(inputs.input_ids[0])
121+
115122
with torch.no_grad():
116-
generated_ids = model.generate(**inputs, max_new_tokens=500)
117-
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
123+
generated_ids = model.generate(**inputs, max_new_tokens=200)
124+
generated_texts = processor.batch_decode(generated_ids[:, input_len:], skip_special_tokens=True)
118125

119126
print(generated_texts)
120-
## ['User: What do we see in this image? \nAssistant: In this image we can see two cats on the nets. \nUser: And how about this image? \nAssistant: In this image we can see flowers, plants and insect.']
127+
## ['In this image we can see flowers, plants and insect.']
121128
```
122129

123130
## Pipeline
@@ -289,19 +296,38 @@ VLMs are often large and need to be optimized to fit on smaller hardware. Transf
289296
First, install dependencies.
290297

291298
```bash
292-
pip install -U quanto bitsandbytes
299+
pip install -U optimum-quanto bitsandbytes
293300
```
294301

295-
To quantize a model during loading, we need to first create [`QuantoConfig`]. Then load the model as usual, but pass `quantization_config` during model initialization.
302+
To quantize a model during loading, we need to first create [`QuantoConfig`]. Then load the model as usual, but pass `quantization_config` during model initialization.
296303

297304
```python
298305
from transformers import AutoModelForImageTextToText, QuantoConfig
299306

300-
model_id = "HuggingFaceM4/idefics2-8b"
307+
model_id = "Qwen/Qwen3-VL-4B-Instruct"
301308
quantization_config = QuantoConfig(weights="int8")
302309
quantized_model = AutoModelForImageTextToText.from_pretrained(
303310
model_id, device_map="auto", quantization_config=quantization_config
304311
)
312+
313+
messages = [
314+
{
315+
"role": "user",
316+
"content": [
317+
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"},
318+
{"type": "text", "text": "What do we see in this image?"},
319+
]
320+
},
321+
]
322+
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
323+
input_len = len(inputs.input_ids[0])
324+
325+
with torch.no_grad():
326+
generated_ids = model.generate(**inputs, cache_implementation="static", max_new_tokens=100)
327+
generated_texts = processor.batch_decode(generated_ids[:, input_len:], skip_special_tokens=True)
328+
329+
print(generated_texts[0])
330+
## ['In this image, we see two tabby cats resting on a large, tangled pile of fishing nets. The nets are a mix of brown, orange, and red colors, with some blue and green ropes visible in the background. The cats appear relaxed and comfortable, nestled into the fibers of the nets. One cat is in the foreground, looking slightly to the side, while the other is positioned further back, looking directly at the camera. The scene suggests a coastal or fishing-related setting, possibly near']
305331
```
306332

307333
And that's it, we can use the model the same way with no changes.
@@ -312,3 +338,4 @@ Here are some more resources for the image-text-to-text task.
312338

313339
- [Image-text-to-text task page](https://huggingface.co/tasks/image-text-to-text) covers model types, use cases, datasets, and more.
314340
- [Vision Language Models Explained](https://huggingface.co/blog/vlms) is a blog post that covers everything about vision language models and supervised fine-tuning using [TRL](https://huggingface.co/docs/trl/en/index).
341+
- [Learn how to fine-tune vision language models using TRL](https://huggingface.co/blog/trl-vlm-alignment)

0 commit comments

Comments
 (0)