🌏 Project Page • 🤗 Demo • ▶️ YouTube • 📺 Bilibili 📀 Data • 📊 Benchmark • 📦 LAMM Models
📆 Coming Soon
📆 [2023-07-28]
📆 [2023-07-06]
📆 [2023-06-30]
📆 [2023-06-20]
📆 [2023-06-16]
📆 [2023-06-12]
GPT Evaluation part available.
Our Paper will release tomorrow. Please stay tuned!
📆 [2023-06-11]
LAMM code is available for Research community!
Try out the Interactive Demo on Huggingface! (Time to build app depends on the server load)
For cases of 2D images, we provide an online demo deployed on huggingface spaces.
Due to limitation of hardware capacity, online version only supports LLM of 7B parameters and load pretrained model takes few minutes.
We also provide a CLI demo for local test. Point cloud data are required to be in format of npy
, we suggest to use data from LAMM-Benchmark-3D.
cd ./src python cli_demo.py \ --model lamm_peft \ --vision_type pcl or image \ --encoder_pretrain epcl or clip \ --encoder_ckpt_path $EPCL_CKPT_PATH or '' \ --vicuna_ckpt_path $LLM_CKPT_PATH \ --delta_ckpt_path $LAMM_CKPT_PATH
Large language models have become a potential pathway toward achieving artificial general intelligence. Recent works on multi-modal large language models have demonstrated their effectiveness in handling visual modalities. In this work, we extend the research of MLLMs to point clouds and present the LAMM-Dataset and LAMM-Benchmark for 2D image and 3D point cloud understanding. We also establish an extensible framework to facilitate the extension of MLLMs to additional modalities. Our main contribution is three-fold: 1) We present the LAMM-Dataset and LAMM-Benchmark, which cover almost all high-level vision tasks for 2D and 3D vision. Extensive experiments validate the effectiveness of our dataset and benchmark. 2) We demonstrate the detailed methods of constructing instruction-tuning datasets and benchmarks for MLLMs, which will enable future research on MLLMs to scale up and extend to other domains, tasks, and modalities faster. 3) We provide a primary but potential MLLM training framework optimized for modalities' extension. We also provide baseline models, comprehensive experimental observations, and analysis to accelerate future research.
LAMM-Dataset is a comprehensive multi-modal instruction tuning dataset, which contains 186K language-image instruction-response pairs, and 10K lanuage-3D instruction-response pairs.In LAMM-Dataset, the instruction-response pairs are gathered from 8 image datasets and 4 point cloud datasets. Here we design four type of multi-modal instruction-response pairs,
Download LAMM-Dataset from here.
If you would like to download the entire LAMM Dataset and LAMM Benchmark, you can do so from the opendatalab website using the provided LAMM link. Here is the table illustrating the correspondence between each Meta file and image collection in the LAMM dataset:
Instruction Data For Training2D_Instruct data
3D_Instruct data
└── 2D_Instruct
│ ├── coco_images.zip
│ ├── bamboo_images.zip
│ ├── textvqa_images.zip
│ ├── locount_images.zip
│ └── meta_file
│ ├── daily_dialogue_49k.json
│ ├── detailed_description_49k.json
│ ├── factual_knowledge_dialogue_42k.json
│ └── vision_task_dialogue_46k.json
└── 3D_Instruct
├── 3rscan_pcls.zip
├── shapenet_pcls.zip
└── meta_file
└── LAMM_3dinstruct_10k.json
Meta file format
[ { "id": "000000019028", # image id "image": "coco_images/000000019028.jpg", # image path "conversations": [ { "from": "human", # instruction "value": "How is the kitchen in the image furnished?" }, { "from": "gpt", # response "value": "The kitchen in the image is furnished with white cabinets and white appliances. There is a dishwasher, a stove, and a sink. On the stove, a blue towel hangs on the handle. A cutting board is placed on the dishwasher. There are also additional elements like a bowl of apples on the counter and a beige rug on the floor." } ], "task_type": "conversation", # task type "src_image": "coco2017" # original dataset }, { ... } ]
[ { "pcl": "shapenet_pcls/04256520_cb71cb7b36dbcb6f826fc8d57346a2e4_4096.npy", "conversations": [ { "from": "human", "value": "What scenario does this point cloud belong to according to the model\u2019s prediction?" }, { "from": "gpt", "value": "Through meticulous analysis, it becomes evident that the point cloud aligns with the characteristics of sofa,couch,lounge s cenario." } ], "task_type": "classification3d", "src_dataset": "ShapeNet", "src_id": "04256520_cb71cb7b36dbcb6f826fc8d57346a2e4" }, { ... } ]
Notes:
Pre-requist Packages: gcc <= 7.5.0; nvcc >= 11.1
conda create -n lamm python=3.10 -y conda activate lamm # Choose different version of torch according to your conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
Install required packages
pip install -r requirements.txt # Optional; For 3D experiments ONLY cd src/model/EPCL/third_party/pointnet2/ python setup.py install cd ../../utils/ pip install cython python cython_compile.py build_ext --inplace
Download required NLTK data
import nltk nltk.download('stopwords') nltk.download('punkt') nltk.download('wordnet')
Optional:
flash attention(v2)
Install flash attention (v2) if you are tight in GPU memory. Please refer to flash attention's installation
FlashAttention-2 currently supports Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100).
xformers
Install xformers if you are tight in GPU memory and cannot use flash attention (e.g., using Nvidia v100). Please refer to xformers's installation
Data
Follow Download to download and prepare the data for 2D and 3D tasks. Put downloaded data in ./data
folder.
├── data
├── 2D_Instruct
├── 3D_Instruct
Language Model: Vicuna
To prepare the pre-trained Vicuna model, please follow the instructions provided Here. Put the downloaded model in the ./model_zoo/vicuna_ckpt
folder.
3D Encoder: EPCL
Download Pre-trained EPCL model to tokenize point cloud from Here. Put the downloaded models in the ./ckpt
folder.
cd src sh scripts/train_lamm2d.sh or sh scripts/train_lamm2d_slurm.sh # for slurm
cd src sh scripts/train_lamm3d.sh or sh scripts/train_lamm3d_slurm.sh # for slurm
You need to dive into scripts to change data path and other hyper-parameters.
For your reference, GPU memory consumption for different models are shown as follows
Model Size Sample Num/GPU GPU Memory Vicuna_v0_7B 1 ~30GB Vicuna_v0_7B 2 ~46GB Vicuna_v0_13B 1 ~53GB Vicuna_v0_13B 2 ~70GBLAMM-Benchmark evaluates 9 common image tasks, using a total of 11 datasets with over 62,439 samples, and 3 common point cloud tasks, by utilizing 3 datasets with over 12,788 data samples, while existing works only provide quantitative results on fine-tuning and evaluating specific datasets such as ScienceQA, and most works only conduct demonstration or user studies.
2D_Benchmark data
3D_Benchmark data
├── 2D_Benchmark
│ ├── ai2d_images.zip
│ ├── celeba_images.zip
│ ├── cifar10_images.zip
│ ├── flickr30k_images.zip
│ ├── fsc147_images.zip
│ ├── lsp_images.zip
│ ├── sqaimage_images.zip
│ ├── svt_images.zip
│ ├── ucmerced_images.zip
│ ├── voc2012_images.zip
│ └── meta_file
│ ├── Caption_flickr30k.json
│ ├── Classification_CIFAR10.json
│ ├── Counting_FSC147.json
│ ├── Detection_VOC2012.json
│ ├── Facial_Classification_CelebA(Hair).json
│ ├── Facial_Classification_CelebA(Smile).json
│ ├── Fine-grained_Classification_UCMerced.json
│ ├── Keypoints_Dectection_LSP.json
│ ├── Locating_FSC147.json
│ ├── Locating_LSP.json
│ ├── Locating_VOC2012.json
│ ├── OCR_SVT.json
│ ├── VQA_AI2D.json
│ └── VQA_SQAimage.json
└── 3D_Benchmark
├── scannet_pcls.zip
└── meta_file
├── Detection_ScanNet.json
├── VG_ScanRefer.json
└── VQA_ScanQA_multiplechoice.json
Model Preparation
Language Model: Vicuna
To prepare the pre-trained Vicuna model, please follow the instructions provided Here. Put the downloaded model in the ./model_zoo/vicuna_ckpt
folder.
3D Encoder: EPCL
Download Pre-trained EPCL model to tokenize point cloud from Here. Put the downloaded models in the ./model_zoo/epcl_ckpt
folder.
LAMM Models
Download LAMM model from Here. Put the downloaded models in the ./ckpt
folder.
Or you can train your own LAMM model by following the instructions Here!
Other Models
cd src sh scripts/inference_2D.sh
or
sh scripts/inference_2D_slurm.sh # for slurm
sh scripts/LAMM_2D_Evaluation.sh
or
sh scripts/LAMM_2D_Evaluation_slurm.sh # for slurm
cd src sh scripts/inference_3D.sh
or
sh scripts/inference_3D_slurm.sh # for slurm
sh scripts/LAMM_3D_Evaluation.sh
or
sh scripts/LAMM_3D_Evaluation_slurm.sh # for slurm
Evaluation for other MLLM models.
Please refer to LLaVA, MiniGPT-4 and mPLUG-owl for inference respectively. Save the answers in ./answers
. And then run common_eval_2d.py
for evaluation. For example, to evaluate LLaVA on VOC2012:
python common_eval_2d.py \ --dataset-name VOC2012 \ --answer-file ./answers/LLaVA \ --base-data-path ./data/LAMM-Dataset/2D_Benchmark \ 2>&1 | tee ./results/LLaVA/eval_VOC2012.log
GPT Metric
Make sure that you have finished the inference of all the evaluation dataset for both your model/LAMM model and the MLLM model to compare. For example, to rank LAMM and LLaVA:
You may need to dive into scripts to change datasets to evaluation & checkpoints folder to load.
Results of LAMM model on selected 2D vision tasks Task Dataset LAMM(Zero-Shot) LAMM(Finetune) Classification (Acc) CIFAR10 37.90 91.2 Object Detection (Acc) VOC2012 7.20 13.48 VQA (mAP@0.5) SQAimage 49.88 74.27 Results of 3D tasks by LAMM Task Dataset SOTA LAMM (Zero-Shot) LAMM (Finetune) 3D Object Detection (mAP@0.5) ScanNet 63.2 8.2 11.89 Visual Grounding (mAP@0.5) ScanRefer 54.59 Failed 3.38 3D VQA (Acc of multiple choice prolblem) ScanQA N/A 24.90 99.89 Comparison of results of Binary Locating Metric and GPT Metric of existing MLLMs Comparison of Multimodal Large Language Models on 2D computer vision tasks.Bold fonts for the best results.
Task Dataset Metric SOTA LLaVA MiniGPT4 mPLUG-owl LAMM Classification CIFAR10 Acc ↑ 99.5 60.83 46.22 42.5 37.9 Detection VOC2012 mAP ↑ 97.2 1.42 0.92 0.158 7.20 VQA SQAimage @article{yin2023lamm,
title={LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark},
author={Yin, Zhenfei and Wang, Jiong and Cao, Jianjian and Shi, Zhelun and Liu, Dingning and Li, Mukai and Sheng, Lu and Bai, Lei and Huang, Xiaoshui and Wang, Zhiyong and others},
journal={arXiv preprint arXiv:2306.06687},
year={2023}
}
The project is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The checkpoints are also CC BY NC 4.0 (allowing only non-commercial use).
We thank Hongxing Fan, Zeren Chen, Zhen Wang for support of LAMM project.
We also thanks the great works including CLIP, EPCL, LLaMA, Vicuna, FlashAttention
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4