|
1 | 1 | # <img width="60" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/8529570/5aa4cda8-b453-40a0-9336-17012b430ae8"> InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks —— An Open-Source Alternative to ViT-22B |
2 | 2 |
|
| 3 | +\[[InternVL-Chat-V1.2 Blog](./BLOG.md)\] \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[Quick Start](#quick-start-with-huggingface)\] \[[中文解读](https://mp.weixin.qq.com/s/bdfAJRqOF9tUk8Vy9KC_XQ)\] |
| 4 | + |
3 | 5 | ## News🚀🚀🚀 |
4 | 6 |
|
| 7 | +- `2024/02/12`: InternVL-Chat-V1.2 has been released, utilizing [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the LLM. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our [blog](BLOG.md) or try our [demo](https://internvl.opengvlab.com/). The model is now available on [HuggingFace](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2), and both training/evaluation data and scripts are open-sourced. |
5 | 8 | - `2024/02/04`: [InternVL-Chat-V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) achieves 44.67% on [MMVP](https://github.com/tsb0601/MMVP), higher than GPT-4V! |
6 | 9 | - `2024/01/27`: We release 448 resolution model, achieving 76.6 on MMBench dev, see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#-evaluation-chinese-models). |
7 | 10 | - `2024/01/24`: InternVL-Chat-V1.1 is released, it supports Chinese and has stronger OCR capability, see [here](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) or try our [demo](https://internvl.opengvlab.com/). |
8 | 11 | - `2024/01/16`: We release our [customized mmcv/mmsegmentation/mmdetection code](https://github.com/OpenGVLab/InternVL-MMDetSeg), integrated with DeepSpeed, which can be used for training large-scale object detection and semantic segmentation models. |
9 | 12 |
|
10 | 13 | ## What is InternVL? |
11 | 14 |
|
12 | | -\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[Quick Start](#quick-start-with-huggingface)\] \[[中文解读](https://mp.weixin.qq.com/s/bdfAJRqOF9tUk8Vy9KC_XQ)\] |
13 | | - |
14 | 15 | InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM. |
15 | 16 |
|
16 | | -It is _**the largest open-source vision/vision-language foundation model (14B)**_ to date, achieving _**32 state-of-the-art**_ performances on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc. |
17 | | - |
18 | | -<img width="1204" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/23737120/47878df8-2aec-446e-8a58-00640a2e1327"> |
19 | | - |
20 | | -[](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-coco-2014?p=internvl-scaling-up-vision-foundation-models) |
21 | | -[](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-coco-cn?p=internvl-scaling-up-vision-foundation-models) |
22 | | -[](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-flickr30k?p=internvl-scaling-up-vision-foundation-models) |
23 | | -[](https://paperswithcode.com/sota/image-to-text-retrieval-on-flickr30k?p=internvl-scaling-up-vision-foundation-models) |
24 | | -[](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-flickr30k-cn?p=internvl-scaling-up-vision-foundation-models) |
25 | | -[](https://paperswithcode.com/sota/image-retrieval-on-flickr30k-cn?p=internvl-scaling-up-vision-foundation-models) |
26 | | -[](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-xtd10?p=internvl-scaling-up-vision-foundation-models) |
27 | | -[](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-cn?p=internvl-scaling-up-vision-foundation-models) |
28 | | -[](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-8?p=internvl-scaling-up-vision-foundation-models) |
29 | | -[](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=internvl-scaling-up-vision-foundation-models) |
30 | | -[](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-6?p=internvl-scaling-up-vision-foundation-models) |
31 | | -[](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-5?p=internvl-scaling-up-vision-foundation-models) |
32 | | -[](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-3?p=internvl-scaling-up-vision-foundation-models) |
33 | | -[](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-1?p=internvl-scaling-up-vision-foundation-models) |
34 | | - |
35 | 17 | ## Model Zoo |
36 | 18 |
|
37 | | -| Model | Date | Download | Note | |
38 | | -| ------------------ | ---------- | ------------------------------------------------------------------------------ | -------------------------------- | |
39 | | -| InternViT-6B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px) | vision foundation model | |
40 | | -| InternVL-14B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px) | vision-language foundation model | |
41 | | -| InternVL-Chat-13B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B) | English multimodal dialogue | |
42 | | -| InternVL-Chat-19B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B) | English multimodal dialogue | |
43 | | -| InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | support Chinese and stronger OCR | |
44 | | -| InternViT-6B-448px | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px) | 448 resolution | |
| 19 | +**Vision-Language Foundation Model** |
| 20 | + |
| 21 | +| Model | Date | Download | Note | |
| 22 | +| ----------------------- | ---------- | ---------------------------------------------------------------------- | -------------------------------- | |
| 23 | +| InternViT-6B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px) | vision foundation model | |
| 24 | +| InternVL-14B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px) | vision-language foundation model | |
| 25 | +| InternViT-6B-448px | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px) | 448 resolution | |
| 26 | +| InternViT-6B-448px-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) | 448 resolution (🔥new) | |
| 27 | + |
| 28 | +**Vision Large Language Model** |
| 29 | + |
| 30 | +| Model | Date | Download | Note | |
| 31 | +| ----------------------- | ---------- | ------------------------------------------------------------------------------------ | -------------------------------- | |
| 32 | +| InternVL-Chat-13B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B) | English multimodal dialogue | |
| 33 | +| InternVL-Chat-19B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B) | English multimodal dialogue | |
| 34 | +| InternVL-Chat-19B-448px | 2024.02.03 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B-448px) | 448 resolution | |
| 35 | +| InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | support Chinese and stronger OCR | |
| 36 | +| InternVL-Chat-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | scaling up LLM to 34B (🔥new) | |
45 | 37 |
|
46 | 38 | ## What can InternVL do? |
47 | 39 |
|
@@ -174,6 +166,22 @@ It is _**the largest open-source vision/vision-language foundation model (14B)** |
174 | 166 | <td>82.7</td> |
175 | 167 | <td>85.1</td> |
176 | 168 | </tr> |
| 169 | + <tr align=center> |
| 170 | + <td align=left>EVA-CLIP-8B</td> |
| 171 | + <td>95.6</td> |
| 172 | + <td>99.6</td> |
| 173 | + <td>99.9</td> |
| 174 | + <td>80.8</td> |
| 175 | + <td>95.5</td> |
| 176 | + <td>97.6</td> |
| 177 | + <td>70.3</td> |
| 178 | + <td>89.3</td> |
| 179 | + <td>93.9</td> |
| 180 | + <td>53.0</td> |
| 181 | + <td>76.0</td> |
| 182 | + <td>83.4</td> |
| 183 | + <td>86.2</td> |
| 184 | + </tr> |
177 | 185 | <tr align=center> |
178 | 186 | <td align=left>InternVL-C (ours)</td> |
179 | 187 | <td>94.7</td> |
@@ -396,7 +404,6 @@ pixel_values = image_processor(images=image, return_tensors='pt').pixel_values |
396 | 404 | pixel_values = pixel_values.to(torch.bfloat16).cuda() |
397 | 405 |
|
398 | 406 | outputs = model(pixel_values) |
399 | | - |
400 | 407 | ``` |
401 | 408 |
|
402 | 409 | </details> |
|
0 commit comments