A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/Lighten001/LAMM below:

GitHub - Lighten001/LAMM

🌏 Project Page🤗 Demo ▶️ YouTube 📺 Bilibili 📀 Data📊 Benchmark📦 LAMM Models

📆 Coming Soon

  1. Code for less GPU memory will be released soon. Please stay tuned.

📆 [2023-07-28]

  1. Checkpoints of LAMM on huggingface updated on new code base. LAMM Performance updated, Please check it out.

📆 [2023-07-06]

  1. Evaluation code for both 2D and 3D tasks are ready.
  2. 3D Benchmark meta files & 2D Instruction images files updated! Missing files and missing keys fixed. Please update accordingly.
  3. Update scripts for demo in command line.

📆 [2023-06-30]

  1. Watch demo video for LAMM at Youtube and Bilibili!

📆 [2023-06-20]

  1. Full paper with Appendix is online.

📆 [2023-06-16]

  1. LAMM dataset is available for Research community!

📆 [2023-06-12]

  1. GPT Evaluation part available.

  2. Our Paper will release tomorrow. Please stay tuned!

📆 [2023-06-11]

  1. LAMM code is available for Research community!

  2. Try out the Interactive Demo on Huggingface! (Time to build app depends on the server load)

For cases of 2D images, we provide an online demo deployed on huggingface spaces.

Due to limitation of hardware capacity, online version only supports LLM of 7B parameters and load pretrained model takes few minutes.

We also provide a CLI demo for local test. Point cloud data are required to be in format of npy, we suggest to use data from LAMM-Benchmark-3D.

    cd ./src
    python cli_demo.py \
        --model lamm_peft \
        --vision_type pcl or image \
        --encoder_pretrain epcl or clip \
        --encoder_ckpt_path $EPCL_CKPT_PATH or '' \
        --vicuna_ckpt_path $LLM_CKPT_PATH \
        --delta_ckpt_path $LAMM_CKPT_PATH

Large language models have become a potential pathway toward achieving artificial general intelligence. Recent works on multi-modal large language models have demonstrated their effectiveness in handling visual modalities. In this work, we extend the research of MLLMs to point clouds and present the LAMM-Dataset and LAMM-Benchmark for 2D image and 3D point cloud understanding. We also establish an extensible framework to facilitate the extension of MLLMs to additional modalities. Our main contribution is three-fold: 1) We present the LAMM-Dataset and LAMM-Benchmark, which cover almost all high-level vision tasks for 2D and 3D vision. Extensive experiments validate the effectiveness of our dataset and benchmark. 2) We demonstrate the detailed methods of constructing instruction-tuning datasets and benchmarks for MLLMs, which will enable future research on MLLMs to scale up and extend to other domains, tasks, and modalities faster. 3) We provide a primary but potential MLLM training framework optimized for modalities' extension. We also provide baseline models, comprehensive experimental observations, and analysis to accelerate future research.

LAMM-Dataset is a comprehensive multi-modal instruction tuning dataset, which contains 186K language-image instruction-response pairs, and 10K lanuage-3D instruction-response pairs.In LAMM-Dataset, the instruction-response pairs are gathered from 8 image datasets and 4 point cloud datasets. Here we design four type of multi-modal instruction-response pairs,

Download LAMM-Dataset from here.

If you would like to download the entire LAMM Dataset and LAMM Benchmark, you can do so from the opendatalab website using the provided LAMM link. Here is the table illustrating the correspondence between each Meta file and image collection in the LAMM dataset:

Instruction Data For Training Dataset Structure
└── 2D_Instruct  
│   ├── coco_images.zip  
│   ├── bamboo_images.zip  
│   ├── textvqa_images.zip  
│   ├── locount_images.zip  
│   └── meta_file  
│       ├── daily_dialogue_49k.json  
│       ├── detailed_description_49k.json  
│       ├── factual_knowledge_dialogue_42k.json  
│       └── vision_task_dialogue_46k.json  
└── 3D_Instruct  
    ├── 3rscan_pcls.zip  
    ├── shapenet_pcls.zip  
    └── meta_file  
        └── LAMM_3dinstruct_10k.json  
Meta file format
[
    {
    "id": "000000019028",  # image id
    "image": "coco_images/000000019028.jpg", # image path
    "conversations": [
        {
            "from": "human",  # instruction
            "value": "How is the kitchen in the image furnished?"
        },
        {
            "from": "gpt",  # response
            "value": "The kitchen in the image is furnished with white cabinets and white appliances. There is a dishwasher, a stove, and a sink. On the stove, a blue towel hangs on the handle. A cutting board is placed on the dishwasher. There are also additional elements like a bowl of apples on the counter and a beige rug on the floor."
        }
    ],
    "task_type": "conversation",  # task type
    "src_image": "coco2017" # original dataset
    },
    {
        ...
    }
]
[
    {
        "pcl": "shapenet_pcls/04256520_cb71cb7b36dbcb6f826fc8d57346a2e4_4096.npy",
        "conversations": [
                {
                    "from": "human",
                    "value": "What scenario does this point cloud belong to according to the model\u2019s prediction?"
                },
                {
                    "from": "gpt",
                    "value": "Through meticulous analysis, it becomes evident that the point cloud aligns with the characteristics of sofa,couch,lounge s       cenario."
                }
            ],
        "task_type": "classification3d",
        "src_dataset": "ShapeNet",
        "src_id": "04256520_cb71cb7b36dbcb6f826fc8d57346a2e4"
    },
    {
        ...
    }
]

Notes

  1. If you want to work with a specific subset of the LAMM dataset, you will need to download both the corresponding meta file and the image collection.
  2. if you prefer to download the data from the official website yourself, you can still organize it in the same way as we have and run it successfully. For example, during the 2D instruction tuning stage, if you only want to run the daily_dialogue_49k.json file, you can download the COCO2017 dataset and organize it accordingly.

Pre-requist Packages: gcc <= 7.5.0; nvcc >= 11.1

    conda create -n lamm python=3.10 -y
    conda activate lamm
    # Choose different version of torch according to your 
    conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

Install required packages

    pip install -r requirements.txt

    # Optional; For 3D experiments ONLY
    cd src/model/EPCL/third_party/pointnet2/
    python setup.py install
    cd ../../utils/
    pip install cython
    python cython_compile.py build_ext --inplace

Download required NLTK data

    import nltk
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')

Optional:

Data & Model Preparation for Training

You need to dive into scripts to change data path and other hyper-parameters.

For your reference, GPU memory consumption for different models are shown as follows

Model Size Sample Num/GPU GPU Memory Vicuna_v0_7B 1 ~30GB Vicuna_v0_7B 2 ~46GB Vicuna_v0_13B 1 ~53GB Vicuna_v0_13B 2 ~70GB

LAMM-Benchmark evaluates 9 common image tasks, using a total of 11 datasets with over 62,439 samples, and 3 common point cloud tasks, by utilizing 3 datasets with over 12,788 data samples, while existing works only provide quantitative results on fine-tuning and evaluating specific datasets such as ScienceQA, and most works only conduct demonstration or user studies.

Data & Model Preparation for LAMM-Benchmark Benchmark Data For Evaluation Dataset Structure
├── 2D_Benchmark  
│   ├── ai2d_images.zip  
│   ├── celeba_images.zip  
│   ├── cifar10_images.zip  
│   ├── flickr30k_images.zip  
│   ├── fsc147_images.zip  
│   ├── lsp_images.zip  
│   ├── sqaimage_images.zip  
│   ├── svt_images.zip  
│   ├── ucmerced_images.zip  
│   ├── voc2012_images.zip  
│   └── meta_file  
│       ├── Caption_flickr30k.json  
│       ├── Classification_CIFAR10.json  
│       ├── Counting_FSC147.json  
│       ├── Detection_VOC2012.json  
│       ├── Facial_Classification_CelebA(Hair).json  
│       ├── Facial_Classification_CelebA(Smile).json  
│       ├── Fine-grained_Classification_UCMerced.json  
│       ├── Keypoints_Dectection_LSP.json  
│       ├── Locating_FSC147.json  
│       ├── Locating_LSP.json  
│       ├── Locating_VOC2012.json  
│       ├── OCR_SVT.json  
│       ├── VQA_AI2D.json  
│       └── VQA_SQAimage.json  
└── 3D_Benchmark  
    ├── scannet_pcls.zip  
    └── meta_file  
        ├── Detection_ScanNet.json  
        ├── VG_ScanRefer.json  
        └── VQA_ScanQA_multiplechoice.json
Model Preparation

You may need to dive into scripts to change datasets to evaluation & checkpoints folder to load.

Results of LAMM model on selected 2D vision tasks Task Dataset LAMM(Zero-Shot) LAMM(Finetune) Classification (Acc) CIFAR10 37.90 91.2 Object Detection (Acc) VOC2012 7.20 13.48 VQA (mAP@0.5) SQAimage 49.88 74.27 Results of 3D tasks by LAMM Task Dataset SOTA LAMM (Zero-Shot) LAMM (Finetune) 3D Object Detection (mAP@0.5) ScanNet 63.2 8.2 11.89 Visual Grounding (mAP@0.5) ScanRefer 54.59 Failed 3.38 3D VQA (Acc of multiple choice prolblem) ScanQA N/A 24.90 99.89 Comparison of results of Binary Locating Metric and GPT Metric of existing MLLMs Comparison of Multimodal Large Language Models on 2D computer vision tasks.

Bold fonts for the best results.

Task Dataset Metric SOTA LLaVA MiniGPT4 mPLUG-owl LAMM Classification CIFAR10 Acc ↑ 99.5 60.83 46.22 42.5 37.9 Detection VOC2012 mAP ↑ 97.2 1.42 0.92 0.158 7.20 VQA SQAimage
AI2D Acc ↑ 92.53
N/A 40.5
18.13 43.43
Failed 36.39
19.31 49.88
20.92 Image Caption flickr30k BLEU4 ↑ 30.1 6.65 5.1 2.74 2.56 F-g clasification UCMerced Acc ↑ 100 47 33.6 32.5 18.23 Counting FSC147 MAE ↓ 10.79 56.2 Failed 60.67 46.88 OCR SVT Word Acc ↑ 97.9 37.78 16.97 30.39 29.14 Facial Classification CelebA(Smile)
CelebA(Hair) Acc ↑ N/A
N/A Failed
46.42 66.36
43.47 Failed
40.93 57.60
56.96 Keypoints Detection LSP PCK ↑ 99.5 Failed Failed Failed Failed # Training Samples Vision Encoder LLM Training Data Lora Rank Link 98K CLIP-ViT-L Vicuna_v0_7B LAMM-2D daily dialogue & desctiption 32 Checkpoints 186K CLIP-ViT-L Vicuna_v0_7B LAMM-2D Instruction Data 32 Checkpoints 98K CLIP-ViT-L Vicuna_v0_13B LAMM-2D daily dialogue & desctiption 32 Checkpoints 186K CLIP-ViT-L Vicuna_v0_13B LAMM-2D Instruction Data 32 Checkpoints 10K EPCL-ViT-L Vicuna13B LAMM-3D Instruction Data 32 Checkpoints
    @article{yin2023lamm,
        title={LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark},
        author={Yin, Zhenfei and Wang, Jiong and Cao, Jianjian and Shi, Zhelun and Liu, Dingning and Li, Mukai and Sheng, Lu and Bai, Lei and Huang, Xiaoshui and Wang, Zhiyong and others},
        journal={arXiv preprint arXiv:2306.06687},
        year={2023}
}

The project is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The checkpoints are also CC BY NC 4.0 (allowing only non-commercial use).

We thank Hongxing Fan, Zeren Chen, Zhen Wang for support of LAMM project.

We also thanks the great works including CLIP, EPCL, LLaMA, Vicuna, FlashAttention, xformers, lightllm


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4