Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 64 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,14 @@

## 📋 Contents

1. [About](#-about)
2. [Getting Started](#-getting-started)
3. [Model and Benchmark](#-model-and-benchmark)
4. [TODO List](#-todo-list)
1. [About](#topic1)
2. [Getting Started](#topic2)
3. [MMScan API Tutorial](#topic3)
4. [MMScan Benchmark](#topic4)
5. [TODO List](#topic5)

## 🏠 About
<span id='topic1'/>

<!-- ![Teaser](assets/teaser.jpg) -->

Expand Down Expand Up @@ -55,7 +57,8 @@ Furthermore, we use this high-quality dataset to train state-of-the-art 3D visua
grounding and LLMs and obtain remarkable performance improvement both on
existing benchmarks and in-the-wild evaluation.

## 🚀 Getting Started:
## 🚀 Getting Started
<span id='topic2'/>

### Installation

Expand Down Expand Up @@ -98,6 +101,7 @@ existing benchmarks and in-the-wild evaluation.
Please refer to the [guide](data_preparation/README.md) here.

## 👓 MMScan API Tutorial
<span id='topic3'/>

The **MMScan Toolkit** provides comprehensive tools for dataset handling and model evaluation in tasks.

Expand Down Expand Up @@ -137,39 +141,41 @@ Each dataset item is a dictionary containing key elements:

(1) 3D Modality

- **"ori_pcds"** (tuple\[tensor\]): Raw point cloud data from the `.pth` file.
- **"pcds"** (np.ndarray): Point cloud data, dimensions (\[n_points, 6(xyz+rgb)\]).
- **"instance_labels"** (np.ndarray): Instance IDs for each point.
- **"class_labels"** (np.ndarray): Class IDs for each point.
- **"bboxes"** (dict): Bounding boxes in the scan.
- **"ori_pcds"** (tuple\[tensor\]): Original point cloud data extracted from the .pth file.
- **"pcds"** (np.ndarray): Point cloud data with dimensions [n_points, 6(xyz+rgb)], representing the coordinates and color of each point.
- **"instance_labels"** (np.ndarray): Instance ID assigned to each point in the point cloud.
- **"class_labels"** (np.ndarray): Class IDs assigned to each point in the point cloud.
- **"bboxes"** (dict): Information about bounding boxes within the scan.

(2) Language Modality

- **"sub_class"**: Sample category.
- **"ID"**: Unique sample ID.
- **"scan_id"**: Corresponding scan ID.
- **--------------For Visual Grounding Task**
- **"target_id"** (list\[int\]): IDs of target objects.
- **"text"** (str): Grounding text.
- **"sub_class"**: The sample category of the sample.
- **"ID"**: A unique identifier for the sample.
- **"scan_id"**:Identifier corresponding to the related scan.

*For Visual Grounding Task*
- **"target_id"** (list\[int\]): IDs of target objects.
- **"text"** (str): Text used for grounding.
- **"target"** (list\[str\]): Types of target objects.
- **"anchors"** (list\[str\]): Types of anchor objects.
- **"anchor_ids"** (list\[int\]): IDs of anchor objects.
- **"tokens_positive"** (dict): Position indices of mentioned objects in the text.
- **--------------ForQuestion Answering Task**
- **"question"** (str): The question text.
- **"tokens_positive"** (dict): Indices of positions where mentioned objects appear in the text.

*For Question Answering Task*
- **"question"** (str): The text of the question.
- **"answers"** (list\[str\]): List of possible answers.
- **"object_ids"** (list\[int\]): Object IDs referenced in the question.
- **"object_names"** (list\[str\]): Types of referenced objects.
- **"input_bboxes_id"** (list\[int\]): IDs of input bounding boxes.
- **"input_bboxes"** (list\[np.ndarray\]): Input bounding boxes, 9 DoF.
- **"input_bboxes"** (list\[np.ndarray\]): Input bounding box data, with 9 degrees of freedom.

(3) 2D Modality

- **'img_path'** (str): Path to RGB image.
- **'depth_img_path'** (str): Path to depth image.
- **'intrinsic'** (np.ndarray): Camera intrinsic parameters for RGB images.
- **'depth_intrinsic'** (np.ndarray): Camera intrinsic parameters for depth images.
- **'extrinsic'** (np.ndarray): Camera extrinsic parameters.
- **'img_path'** (str): File path to the RGB image.
- **'depth_img_path'** (str): File path to the depth image.
- **'intrinsic'** (np.ndarray): Intrinsic parameters of the camera for RGB images.
- **'depth_intrinsic'** (np.ndarray): Intrinsic parameters of the camera for Depth images.
- **'extrinsic'** (np.ndarray): Extrinsic parameters of the camera.
- **'visible_instance_id'** (list): IDs of visible objects in the image.

### MMScan Evaluator
Expand All @@ -182,7 +188,9 @@ For the visual grounding task, our evaluator computes multiple metrics including

- **AP and AR**: These metrics calculate the precision and recall by considering each sample as an individual category.
- **AP_C and AR_C**: These versions categorize samples belonging to the same subclass and calculate them together.
- **gtop-k**: An expanded metric that generalizes the traditional top-k metric, offering insights into broader performance aspects.
- **gTop-k**: An expanded metric that generalizes the traditional Top-k metric, offering insights into broader performance aspects.

*Note:* Here, AP corresponds to AP<sub>sample</sub> in the paper, and AP_C corresponds to AP<sub>box</sub> in the paper.

Below is an example of how to utilize the Visual Grounding Evaluator:

Expand Down Expand Up @@ -301,11 +309,38 @@ The input structure remains the same as for the question answering evaluator:
]
```

### Models
## 🏆 MMScan Benchmark

<span id='topic4'/>

### MMScan Visual Grounding Benchmark

We have adapted the MMScan API for some [models](./models/README.md).
| Methods | gTop-1 | gTop-3 | AP<sub>sample</sub> | AP<sub>box</sub> | AR | Release | Download |
|---------|--------|--------|---------------------|------------------|----|-------|----|
| ScanRefer | 4.74 | 9.19 | 9.49 | 2.28 | 47.68 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/Scanrefer) | [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link) |
| MVT | 7.94 | 13.07 | 13.67 | 2.50 | 86.86 | ~ | ~ |
| BUTD-DETR | 15.24 | 20.68 | 18.58 | 9.27 | 66.62 | ~ | ~ |
| ReGround3D | 16.35 | 26.13 | 22.89 | 5.25 | 43.24 | ~ | ~ |
| EmbodiedScan | 19.66 | 34.00 | 29.30 | **15.18** | 59.96 | [code](https://github.com/OpenRobotLab/EmbodiedScan/tree/mmscan/models/EmbodiedScan) | [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link) |
| 3D-VisTA | 25.38 | 35.41 | 33.47 | 6.67 | 87.52 | ~ | ~ |
| ViL3DRef | **26.34** | **37.58** | **35.09** | 6.65 | 86.86 | ~ | ~ |

### MMScan Question Answering Benchmark
| Methods | Overall | ST-attr | ST-space | OO-attr | OO-space | OR| Advanced | Release | Download |
|---|--------|--------|--------|--------|--------|--------|-------|----|----|
| LL3DA | 45.7 | 39.1 | 58.5 | 43.6 | 55.9 | 37.1 | 24.0| [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LL3DA) | [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |
| LEO |54.6 | 48.9 | 62.7 | 50.8 | 64.7 | 50.4 | 45.9 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LEO) | [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link)|
| LLaVA-3D |**61.6** | 58.5 | 63.5 | 56.8 | 75.6 | 58.0 | 38.5|~ | ~ |

*Note:* These two tables only show the results for main metrics; see the paper for complete results.

We have released the codes of some models under [./models](./models/README.md).

## 📝 TODO List

- \[ \] More Visual Grounding baselines and Question Answering baselines.
<span id='topic5'/>

- \[ \] MMScan annotation and samples for ARKitScenes.
- \[ \] Online evaluation platform for the MMScan benchmark.
- \[ \] Codes of more MMScan Visual Grounding baselines and Question Answering baselines.
- \[ \] Full release and further updates.
21 changes: 20 additions & 1 deletion models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,11 @@ These are 3D visual grounding models adapted for the mmscan-devkit. Currently, t
```bash
python -u scripts/train.py --use_color --eval_only --use_checkpoint "path/to/pth"
```
#### ckpts & Logs

| Epoch | gTop-1 @ 0.25/0.50 | Config | Download |
| :-------: | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| 50 | 4.74 / 2.52 | [config](https://drive.google.com/file/d/1iJtsjt4K8qhNikY8UmIfiQy1CzIaSgyU/view?usp=drive_link) | [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link)
### EmbodiedScan

1. Follow the [EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan/blob/main/README.md) to setup the Env. Download the [Multi-View 3D Detection model's weights](https://download.openmmlab.com/mim-example/embodiedscan/mv-3ddet.pth) and change the "load_from" path in the config file under `configs/grounding` to the path where the weights are saved.
Expand All @@ -47,6 +51,11 @@ These are 3D visual grounding models adapted for the mmscan-devkit. Currently, t
# Multiple GPU testing
python tools/test.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py path/to/load_pth --launcher="pytorch"
```
#### ckpts & Logs

| Input modality | Load pretrain | Epoch | gTop-1 @ 0.25/0.50 | Config | Download |
| :-------: | :----: | :----: | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| Point cloud | True | 12 | 19.66 / 8.82 | [config](https://github.com/rbler1234/EmbodiedScan/blob/mmscan-devkit/models/EmbodiedScan/configs/grounding/pcd_4xb24_mmscan_vg_num256.py) | [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link)

## 3D Question Answering Models

Expand Down Expand Up @@ -84,6 +93,13 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
--tmp_path path/to/tmp --api_key your_api_key --eval_size -1
--nproc 4
```
#### ckpts & Logs

| Detector | Captioner | Iters | GPT score overall | Download |
| :-------: | :----: | :----: | :---------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| Vote2Cap-DETR | ll3da | 100k | 45.7 | [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |



### LEO

Expand Down Expand Up @@ -117,5 +133,8 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
--tmp_path path/to/tmp --api_key your_api_key --eval_size -1
--nproc 4
```
#### ckpts & Logs

PS : It is possible that LEO may encounter an "NaN" error in the MultiHeadAttentionSpatial module due to the training setup when training more epoches. ( no problem for 4GPU one epoch)
| LLM | 2d/3d backbones | epoch | GPT score overall | Config | Download |
| :-------: | :----: | :----: | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| Vicuna7b | ConvNeXt / PointNet++ | 1 | 54.6 | [config](https://drive.google.com/file/d/1CJccZd4TOaT_JdHj073UKwdA5PWUDtja/view?usp=drive_link) | [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link) |
Loading