Skip to content

Commit 88a146c

Browse files
czczupwhai362
andauthored
Release InternVL-Chat-V1.2 (#45)
* Add chat templates * Update to llama2 flash attention * Add zero3 deepspeed config * Support DeepSpeed zero3 * Fix template bug * Support internlm2 * Rename V1.1 to V1-1 * Update README.md * Support device_map='auto' * Compatible with transformers 4.36.2 * Add Hermes-2 template * Support MMVP * Support MathVista * Clean code * Update MMMU * Update select_layer to save GPU memory * Don't use beam search when model is too large * Fix wrong calculation of total params * Update trainer * Add json2jsonl tool * Update README.md * Update * Add shell scripts * Rename * Add shell * Update * Update BLOG.md * Fix bug and support loading pretrained mlp * Update README.md * Update README.md * Update * Update BLOG.md * Update README.md * Update SEED --------- Co-authored-by: Wenhai Wang <2294278735@qq.com>
1 parent ac6e5c9 commit 88a146c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+1762
-576
lines changed

BLOG.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Blog
2+
3+
## InternVL-Chat-V1.2
4+
5+
> Date: 2024/02/12<br>
6+
> Developed by: Zhe Chen, Weiyun Wang, Wenhai Wang
7+
8+
In January 2024, we released [InternVL-Chat-V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1), featuring a structure similar to LLaVA, including a ViT, an MLP projector, and an LLM. In that version, we explored increasing the resolution to 448x448, enhancing OCR capabilities, and improving support for Chinese conversations. However, it still lagged behind existing SOTA in some benchmarks.
9+
10+
<img width="600" alt="image" src="https://github.com/czczup/InternVL-MoE/assets/23737120/9b68aa35-40fd-4e81-9595-d404cbbfc6bd">
11+
12+
Today, we are excited to introduce InternVL-Chat-V1.2. Inspired by [LLaVA-NeXT-34B](https://llava-vl.github.io/blog/2024-01-30-llava-next/), we have also adopted [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the language model.
13+
From the experimental results, **we've observed that a stronger language model (34B) can better leverage the powerful capabilities of our vision foundation model ([InternViT-6B](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)).**
14+
15+
For better training reproducibility, we follow the minimalist design and data efficiency similar to LLaVA-NeXT. To reduce training costs, we provide a pre-trained MLP projector and only employ around 1 million visual instruction tuning samples for SFT. Our model has a total of 40 billion parameters and can be trained within 1.5 days using 32 A100 GPUs. The code, data, and model will be made publicly available.
16+
17+
### Data Preparation
18+
19+
Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1.2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
20+
21+
For more details about data preparation, please see [here](./internvl_chat#prepare-training-datasets).
22+
23+
### Performance
24+
25+
\* Proprietary Model
26+
27+
| name | image size | MMMU<br>(val) | MMMU<br>(test) | MathVista<br>(testmini) | MMB<br>(test) | MMB−CN<br>(test) | MMVP | MME | ScienceQA<br>(image) | POPE | TextVQA | SEEDv1<br>(image) | VizWiz<br>(test) | GQA<br>(test) |
28+
| ------------------ | ---------- | ------------- | -------------- | ----------------------- | ------------- | ---------------- | ---- | -------- | -------------------- | ---- | ------- | ----------------- | ---------------- | ------------- |
29+
| GPT-4V\* | unknown | 56.8 | 55.7 | 49.9 | 77.0 | 74.4 | 38.7 | 1409/517 | - | - | 78.0 | 71.6 | - | - |
30+
| Gemini Ultra\* | unknown | 59.4 | - | 53.0 | - | - | - | - | - | - | 82.3 | - | - | - |
31+
| Gemini Pro\* | unknown | 47.9 | - | 45.2 | 73.6 | 74.3 | 40.7 | 1497/437 | - | - | 74.6 | 70.7 | - | - |
32+
| Qwen-VL-Plus\* | unknown | 45.2 | 40.8 | 43.3 | 67.0 | 70.7 | - | 1681/502 | - | - | 78.9 | 65.7 | - | - |
33+
| Qwen-VL-Max\* | unknown | 51.4 | 46.8 | 51.0 | 77.6 | 75.7 | - | - | - | - | 79.5 | - | - | - |
34+
| | | | | | | | | | | | | | | |
35+
| LLaVA-NEXT-34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 |
36+
| InternVL-Chat-V1.2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1672/509 | 83.3 | 88.0 | 69.7 | 75.6 | 60.0 | 64.0 |
37+
38+
- MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard).
39+
- In most benchmarks, InternVL-Chat-V1.2 achieves better performance than LLaVA-NeXT-34B.
40+
41+
### Training (SFT)
42+
43+
We provide [slurm scripts](./internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
44+
45+
For more details about training, please see [here](./internvl_chat#start-training).
46+
47+
The hyperparameters used for finetuning are listed in the following table.
48+
49+
| Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
50+
| ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
51+
| InternVL-Chat-V1.2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |

INSTALLATION.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,9 @@
2323
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
2424
```
2525

26-
- Install `flash-attn==0.2.8` :
26+
- Install `flash-attn==0.2.8` or `flash-attn==2.3.6`:
2727

28-
If you want to fully replicate my results, please install `v0.2.8`, otherwise install the latest version.
28+
If you want to fully replicate my results in the paper, please install `v0.2.8`, otherwise install the `v2.3.6`.
2929

3030
This is because different versions of flash attention yield slight differences in results.
3131

@@ -44,10 +44,10 @@
4444
mim install mmcv-full==1.6.2
4545
```
4646

47-
- Install `transformers==4.32.0`:
47+
- Install `transformers==4.36.2`:
4848

4949
```bash
50-
pip install transformers==4.32.0
50+
pip install transformers==4.36.2
5151
```
5252

5353
- Install `apex` (optional):
@@ -66,4 +66,6 @@
6666

6767
```bash
6868
pip install opencv-python termcolor yacs pyyaml scipy
69+
pip install deepspeed==0.10.0
70+
pip install pycocoevalcap tqdm
6971
```

README.md

Lines changed: 37 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,39 @@
11
# <img width="60" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/8529570/5aa4cda8-b453-40a0-9336-17012b430ae8"> InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks —— An Open-Source Alternative to ViT-22B
22

3+
\[[InternVL-Chat-V1.2 Blog](./BLOG.md)\] \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[Quick Start](#quick-start-with-huggingface)\] \[[中文解读](https://mp.weixin.qq.com/s/bdfAJRqOF9tUk8Vy9KC_XQ)\]
4+
35
## News🚀🚀🚀
46

7+
- `2024/02/12`: InternVL-Chat-V1.2 has been released, utilizing [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the LLM. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our [blog](BLOG.md) or try our [demo](https://internvl.opengvlab.com/). The model is now available on [HuggingFace](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2), and both training/evaluation data and scripts are open-sourced.
58
- `2024/02/04`: [InternVL-Chat-V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) achieves 44.67% on [MMVP](https://github.com/tsb0601/MMVP), higher than GPT-4V!
69
- `2024/01/27`: We release 448 resolution model, achieving 76.6 on MMBench dev, see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#-evaluation-chinese-models).
710
- `2024/01/24`: InternVL-Chat-V1.1 is released, it supports Chinese and has stronger OCR capability, see [here](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) or try our [demo](https://internvl.opengvlab.com/).
811
- `2024/01/16`: We release our [customized mmcv/mmsegmentation/mmdetection code](https://github.com/OpenGVLab/InternVL-MMDetSeg), integrated with DeepSpeed, which can be used for training large-scale object detection and semantic segmentation models.
912

1013
## What is InternVL?
1114

12-
\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[Quick Start](#quick-start-with-huggingface)\] \[[中文解读](https://mp.weixin.qq.com/s/bdfAJRqOF9tUk8Vy9KC_XQ)\]
13-
1415
InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
1516

16-
It is _**the largest open-source vision/vision-language foundation model (14B)**_ to date, achieving _**32 state-of-the-art**_ performances on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.
17-
18-
<img width="1204" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/23737120/47878df8-2aec-446e-8a58-00640a2e1327">
19-
20-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-coco-2014?p=internvl-scaling-up-vision-foundation-models)
21-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-image-retrieval-on-coco-cn)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-coco-cn?p=internvl-scaling-up-vision-foundation-models)
22-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-flickr30k?p=internvl-scaling-up-vision-foundation-models)
23-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/image-to-text-retrieval-on-flickr30k)](https://paperswithcode.com/sota/image-to-text-retrieval-on-flickr30k?p=internvl-scaling-up-vision-foundation-models)
24-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-image-retrieval-on-flickr30k-cn)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-flickr30k-cn?p=internvl-scaling-up-vision-foundation-models)
25-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/image-retrieval-on-flickr30k-cn)](https://paperswithcode.com/sota/image-retrieval-on-flickr30k-cn?p=internvl-scaling-up-vision-foundation-models)
26-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-image-retrieval-on-xtd10)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-xtd10?p=internvl-scaling-up-vision-foundation-models)
27-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-cn)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-cn?p=internvl-scaling-up-vision-foundation-models)
28-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-8)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-8?p=internvl-scaling-up-vision-foundation-models)
29-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=internvl-scaling-up-vision-foundation-models)
30-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-6)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-6?p=internvl-scaling-up-vision-foundation-models)
31-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-5)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-5?p=internvl-scaling-up-vision-foundation-models)
32-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-3)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-3?p=internvl-scaling-up-vision-foundation-models)
33-
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-1)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-1?p=internvl-scaling-up-vision-foundation-models)
34-
3517
## Model Zoo
3618

37-
| Model | Date | Download | Note |
38-
| ------------------ | ---------- | ------------------------------------------------------------------------------ | -------------------------------- |
39-
| InternViT-6B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px) | vision foundation model |
40-
| InternVL-14B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px) | vision-language foundation model |
41-
| InternVL-Chat-13B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B) | English multimodal dialogue |
42-
| InternVL-Chat-19B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B) | English multimodal dialogue |
43-
| InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | support Chinese and stronger OCR |
44-
| InternViT-6B-448px | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px) | 448 resolution |
19+
**Vision-Language Foundation Model**
20+
21+
| Model | Date | Download | Note |
22+
| ----------------------- | ---------- | ---------------------------------------------------------------------- | -------------------------------- |
23+
| InternViT-6B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px) | vision foundation model |
24+
| InternVL-14B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px) | vision-language foundation model |
25+
| InternViT-6B-448px | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px) | 448 resolution |
26+
| InternViT-6B-448px-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) | 448 resolution (🔥new) |
27+
28+
**Vision Large Language Model**
29+
30+
| Model | Date | Download | Note |
31+
| ----------------------- | ---------- | ------------------------------------------------------------------------------------ | -------------------------------- |
32+
| InternVL-Chat-13B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B) | English multimodal dialogue |
33+
| InternVL-Chat-19B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B) | English multimodal dialogue |
34+
| InternVL-Chat-19B-448px | 2024.02.03 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B-448px) | 448 resolution |
35+
| InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | support Chinese and stronger OCR |
36+
| InternVL-Chat-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | scaling up LLM to 34B (🔥new) |
4537

4638
## What can InternVL do?
4739

@@ -174,6 +166,22 @@ It is _**the largest open-source vision/vision-language foundation model (14B)**
174166
<td>82.7</td>
175167
<td>85.1</td>
176168
</tr>
169+
<tr align=center>
170+
<td align=left>EVA-CLIP-8B</td>
171+
<td>95.6</td>
172+
<td>99.6</td>
173+
<td>99.9</td>
174+
<td>80.8</td>
175+
<td>95.5</td>
176+
<td>97.6</td>
177+
<td>70.3</td>
178+
<td>89.3</td>
179+
<td>93.9</td>
180+
<td>53.0</td>
181+
<td>76.0</td>
182+
<td>83.4</td>
183+
<td>86.2</td>
184+
</tr>
177185
<tr align=center>
178186
<td align=left>InternVL-C (ours)</td>
179187
<td>94.7</td>
@@ -396,7 +404,6 @@ pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
396404
pixel_values = pixel_values.to(torch.bfloat16).cuda()
397405

398406
outputs = model(pixel_values)
399-
400407
```
401408

402409
</details>

classification/README.md

Lines changed: 2 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -10,66 +10,7 @@ InternViT-6B follows the structure of vanilla ViT, and its hyperparameters are l
1010

1111
## 🛠️ Installation
1212

13-
> If you have already installed the environment as per the instructions in other folders, you can skip this section.
14-
15-
- Clone this repository:
16-
17-
```bash
18-
git clone https://github.com/OpenGVLab/InternVL.git
19-
cd InternVL/classification
20-
```
21-
22-
- Create a conda virtual environment and activate it:
23-
24-
```bash
25-
conda create -n internvl python=3.9 -y
26-
conda activate internvl
27-
```
28-
29-
- Install `PyTorch>=2.0` and `torchvision>=0.15.2` with `CUDA>=11.6`:
30-
31-
For examples, to install `torch==2.0.1` with `CUDA==11.8`:
32-
33-
```bash
34-
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
35-
# or
36-
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
37-
```
38-
39-
- Install `flash-attn==0.2.8` :
40-
41-
If you want to fully replicate my results, please install `v0.2.8`, otherwise install the latest version.
42-
43-
This is because different versions of flash attention yield slight differences in results.
44-
45-
```bash
46-
git clone https://github.com/Dao-AILab/flash-attention.git
47-
cd flash-attention
48-
git checkout v0.2.8
49-
python setup.py install
50-
```
51-
52-
- Install `timm==0.9.12` and `mmcv-full==1.6.2`:
53-
54-
```bash
55-
pip install -U openmim
56-
pip install timm==0.9.12
57-
mim install mmcv-full==1.6.2
58-
```
59-
60-
- Install `apex`:
61-
62-
```bash
63-
git clone https://github.com/NVIDIA/apex.git
64-
git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 # https://github.com/NVIDIA/apex/issues/1735
65-
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
66-
```
67-
68-
- Install other requirements:
69-
70-
```bash
71-
pip install opencv-python termcolor yacs pyyaml scipy
72-
```
13+
See [INSTALLATION.md](../INSTALLATION.md)
7314

7415
## 📦 Data Preparation
7516

@@ -150,7 +91,7 @@ pretrained
15091

15192
> Note, please install apex before training (see installation guide above for details).
15293
153-
To train a linear classifier for `InternViT-6b` on ImageNet with 8 GPUs, run:
94+
To train a linear classifier for `InternViT-6B` on ImageNet with 8 GPUs, run:
15495

15596
```bash
15697
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --cfg configs/intern_vit_6b_1k_224.yaml

0 commit comments

Comments
 (0)