RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/ModelCloud/GPTQModel below:

ModelCloud/GPTQModel: Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.

Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.

08/18/2025 4.0.0-dev main: GPT-Neo model support. Memory leak fix in error capture (stacktrace) and fixed lm_head quantization compatibility for many models.
07/31/2025 4.0.0-dev main: New Group Aware Reordering (GAR) support and prelim Pytorch 2.8 fused-ops for Intel XPU for up to 50% speedup.
07/03/2025 4.0.0-dev main: New Baidu Ernie and Huawei PanGu model support.
07/02/2025 4.0.0-dev main: Gemma3 4B model compat fix.
05/29/2025 4.0.0-dev main: Falcon H1 model support. Fixed Transformers 4.52+ compat with Qwen 2.5 VL models.
05/19/2025 4.0.0-dev main: Qwen 2.5 Omni model support.
05/05/2025 4.0.0-dev main: Python 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x Cpu Core scaling of packing stage.
04/29/2025 3.1.0-dev (Now 4.) main: Xiaomi Mimo model support. Qwen 3 and 3 MoE model support. New arg for quantize(..., calibration_dataset_min_length=10) to filter out bad calibration data that exists in public dataset (wikitext).
04/13/2025 3.0.0: 🎉 New ground-breaking GPTQ v2 quantization option for improved model quantization accuracy validated by GSM8K_PLATINUM benchmarks vs original gptq. New Phi4-MultiModal model support . New Nvidia Nemotron-Ultra model support. New Dream model support. New experimental multi-gpu quantization support. Reduced vram usage. Faster quantization.
04/2/2025 2.2.0: New Qwen 2.5 VL model support. New samples log column during quantization to track module activation in MoE models. Loss log column now color-coded to highlight modules that are friendly/resistant to quantization. Progress (per-step) stats during quantization now streamed to log file. Auto bfloat16 dtype loading for models based on model config. Fix kernel compile for Pytorch/ROCm. Slightly faster quantization and auto-resolve some low-level oom issues for smaller vram gpus.
03/12/2025 2.1.0: ✨ New QQQ quantization method and inference support! New Google Gemma 3 zero-day model support. New Alibaba Ovis 2 VL model support. New AMD Instella zero-day model model support. New GSM8K Platinum and MMLU-Pro benchmarking suppport. Peft Lora training with GPTQModel is now 30%+ faster on all gpu and IPEX devices. Auto detect MoE modules not activated during quantization due to insufficient calibration data. ROCm setup.py compat fixes. Optimum and Peft compat fixes. Fixed Peft bfloat16 training.
03/03/2025 2.0.0: 🎉 GPTQ quantization internals are now broken into multiple stages (processes) for feature expansion. Synced Marlin kernel inference quality fix from upstream. Added MARLIN_FP16, lower-quality but faster backend. ModelScope support added. Logging and cli progress bar output has been revamped with sticky bottom progress. Fixed generation_config.json save and load. Fixed Transformers v4.49.0 compat. Fixed compat of models without bos. Fixed group_size=-1 and bits=3 packing regression. Fixed Qwen 2.5 MoE regressions. Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes. Delegate loggin/progressbar to LogBar pkg. Fix ROCm version auto detection in setup install.

Archived News

GPTQModel is a production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF Transformers, vLLM, and SGLang.

Public and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production-level inference speed for token latency and rps. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment.

GPTQModel not only supports GPTQ but also QQQ, GPTQv2, Eora with more quantization methods and enhancements planned.

GPTQModel is a modular design supporting multiple quantization methods and feature extensions.

Quantization Feature GPTQModel Transformers vLLM SGLang Lora Training GPTQ ✅ ✅ ✅ ✅ ✅ EoRA ✅ ✅ ✅ ✅ x GPTQ v2 ✅ ✅ ✅ ✅ ✅ QQQ ✅ x x x x Rotation ✅ x x x x

Native support support some of the most popular multi-modal models:

Multi-Modal Qwen 2.5 Omni ✅ Qwen2 VL ✅ Ovis 1.6 + 2 ✅ Phi-4 MultiModal ✅

✨ Native integration with HF Transformers, Optimum, and Peft (main)
🚀 vLLM and SGLang inference integration for quantized model with format = FORMAT.GPTQ
🚀 Extensive model support for: Ovis VL, Llama 1-3.3, Qwen2-VL, Olmo2, Hymba, GLM, IBM Granite, Llama 3.2 Vision, MiniCPM3, GRIN-Moe, Phi 1-4, EXAONE 3.0, InternLM 2.5, Gemma 2, DeepSeek-V2, DeepSeek-V2-Lite, ChatGLM, MiniCPM, Qwen2MoE, DBRX.
✨ Linux, MacOS, Windows platform quantization and accelerated inference support for CUDA (Nvidia), XPU (Intel), ROCm (AMD), MPS (Apple Silicon), CPU (Intel/AMD/Apple Silicon).
💯 100% CI unit-test coverage for all supported models and kernels including post-quantization quality regression.
✨ Dynamic mixed quantization control on a per-module basis. Each layer/module can have a unique quantization config or be excluded from quantization all together.
🚀 Intel/IPEX hardware accelerated quantization/inference for CPU [avx, amx, xmx] and Intel GPU [Arc + Datacenter Max].
🚀 Microsoft/BITBLAS format + dynamically compiled inference.
✨ Intel/AutoRound alternative gptq-inference compatible quantization method.
✨ Asymmetric Sym=False support. Model weights sharding support with optional hash check of model weights on load.
✨ lm_head module quant inference support for further VRAM reduction.
🚀 45% faster packing stage in quantization (Llama 3.1 8B). 50% faster PPL calculations (OPT).

Quality: GPTQ 4bit (5.0 bpw) can match BF16:

🤗 ModelCloud quantized Vortex models on HF

Experimental GPTQ v2 quantization: Users have reported this mode of quantization may or may not match original GPTQ v1 implementation in terms of quality recovery. Model Baichuan ✅ EXAONE 3.0 ✅ InternLM 1/2.5 ✅ OPT ✅ StarCoder2 ✅ Bloom ✅ Falcon (H1) ✅ Llama 1-3.3 ✅ OLMo2 ✅ TeleChat2 ✅ ChatGLM ✅ Gemma 1/2/3 ✅ Llama 3.2 VL ✅ Ovis 1.6/2 ✅ Yi ✅ CodeGen ✅ GPTBigCod ✅ LongLLaMA ✅ Phi 1-4 ✅ XVERSE ✅ Cohere 1-2 ✅ GPTQ-Neo/GPT-NeoX ✅ MiniCPM3 ✅ PanGu-α ✅ DBRX Converted ✅ GPT-2 ✅ Mistral ✅ Qwen 1/2/3 ✅ Deci ✅ GPT-J ✅ Mixtral ✅ Qwen 2/3 MoE ✅ DeepSeek-V2/V3/R1 ✅ Granite ✅ MobileLLM ✅ Qwen 2/2.5 VL ✅ DeepSeek-V2-Lite ✅ GRIN-MoE ✅ MOSS ✅ Qwen 2.5 Omni ✅ Dream ✅ Hymba ✅ MPT ✅ RefinedWeb ✅ ERNIE 4.5 ✅ Instella ✅ Nemotron Ultra ✅ StableLM ✅

GPTQModel is validated for Linux, MacOS, and Windows 11:

Platform Device Optimized Arch Kernels 🐧 Linux Nvidia GPU ✅ Ampere+ Marlin, Exllama V2, Exallma V1, Triton, Torch 🐧 Linux Intel XPU ✅ Arc, Datacenter Max IPEX, Torch 🐧 Linux AMD GPU ✅ 7900XT+, ROCm 6.2+ Exllama V2, Exallma V1, Torch 🐧 Linux Intel/AMD CPU ✅ avx, amx, xmx IPEX, Torch 🍎 MacOS GPU (Metal) / CPU ✅ Apple Silicon, M1+ Torch, MLX via conversion 🪟 Windows GPU (Nvidia) / CPU ✅ Nvidia Torch

# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas,ipex,auto_round]
pip install -v gptqmodel --no-build-isolation 
uv pip install -v gptqmodel --no-build-isolation

# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel

# pip: compile and install
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation .[vllm,sglang,bitblas,ipex,auto_round]
pip install -v . --no-build-isolation

Three line api to use GPTQModel for gptq model inference:

from gptqmodel import GPTQModel

model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

To use models from ModelScope instead of HuggingFace Hub, set an environment variable:

export GPTQMODEL_USE_MODELSCOPE=True

from gptqmodel import GPTQModel
# load Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4 from modelscope
model = GPTQModel.load("Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

OpenAI API compatible end-point

# load model using above inference guide first
model.serve(host="0.0.0.0",port="12345")

Basic example of using GPTQModel to quantize a llm model:

from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
  ).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4, group_size=128)

model = GPTQModel.load(model_id, quant_config)

# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=1)

model.save(quant_path)

Quantization using GPTQ V2

Enable GPTQ v2 quantization by setting v2 = True for potentially higher post-quantization accuracy recovery.

# note v2 is currently experiemental and requires 2-4x more vram to execute
# if oom on 1 gpu, please set CUDA_VISIBLE_DEVICES=0,1 to 2 gpu and gptqmodel will auto use second gpu
quant_config = QuantizeConfig(bits=4, group_size=128, v2=True)

Llama 3.1 8B-Instruct quantized using test/models/test_llama3_2.py

Method Bits/Group Size ARC_CHALLENGE GSM8K_Platinum_COT GPTQ 4 / 128 49.15 48.30 GPTQ v2 4 / 128 49.74 👍 +1.20% 61.46 🔥 +27.25% GPTQ 3 / 128 39.93 43.26 GPTQ v2 3 / 128 41.13 👍 +3.01% 50.54 🔥 +16.83%

# test post-quant inference
model = GPTQModel.load(quant_path)
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

Quantization + EoRA Accuracy Recovery

GPTQModel now support EoRA, a LoRA method that can further imporve the accuracy of the quantized model

# higher rank improves accuracy at the cost of vram usage
# suggestion: test rank 64 and 32 before 128 or 256 as latter may overfit while increasing memory usage
eora = Lora(
  # for eora generation, path is adapter save path; for load, it is loading path
  path=f"{quant_path}/eora_rank32", 
  rank=32,
)

# provide a previously gptq quantized model path
GPTQModel.adapter.generate(
  adapter=eora,
  model_id_or_path=model_id,
  quantized_model_id_or_path=quant_path,
  calibration_dataset=calibration_dataset,
  calibration_dataset_concat_size=0,
  auto_gc=False)

# post-eora inference
model = GPTQModel.load(
  model_id_or_path=quant_path,
  adapter=eora
)

tokens = model.generate("Capital of France is")[0]
result = model.tokenizer.decode(tokens)

print(f"Result: {result}")
# For more detail of EoRA please see GPTQModel/examples/eora
# Please use the benchmark tools in later part of this README to evaluate EoRA effectiveness

For more advanced features of model quantization, please reference to this script

How to Add Support for a New Model

Read the gptqmodel/models/llama.py code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.

Evaluation and Quality Benchmarks

GPTQModel inference is integrated into both lm-eval and evalplus
We highly recommend avoid using ppl and use lm-eval/evalplus to validate post-quantization model quality. ppl should only be used for regression tests and is not a good indicator of model output quality.

# gptqmodel is integrated into lm-eval >= v0.4.7
pip install lm-eval>=0.4.7

# gptqmodel is integrated into evalplus[main]
pip install -U "evalplus @ git+https://github.com/evalplus/evalplus"

Below is a basic sample using GPTQModel.eval API

from gptqmodel import GPTQModel
from gptqmodel.utils.eval import EVAL

model_id = "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1"

# Use `lm-eval` as framework to evaluate the model
lm_eval_results = GPTQModel.eval(model_id, framework=EVAL.LM_EVAL, tasks=[EVAL.LM_EVAL.ARC_CHALLENGE], output_file='lm-eval_result.json')

# Use `evalplus` as framework to evaluate the model
evalplus_results = GPTQModel.eval(model_id, framework=EVAL.EVALPLUS, tasks=[EVAL.EVALPLUS.HUMAN], output_file='evalplus_result.json')

Dynamic Quantization (Per Module QuantizeConfig Override)

QuantizeConfig.dynamic is dynamic control which allows specific matching modules to be skipped for quantization (negative matching) or have a unique [bits, group_size, sym, desc_act, mse, pack_dtype] property override per matching module vs base QuantizeConfig (postive match with override).

Sample QuantizerConfig.dynamic usage:

dynamic = { 
    # `.*\.` matches the layers_node prefix 
    # layer index start at 0 
    
    # positive match: layer 19, gate module 
    r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32},  
    
    # positgive match: layer 20, gate module (prefix defaults to positive if missing)
    r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},  
    
    # negative match: skip layer 21, gate module
    r"-:.*\.20\..*gate.*": {}, 
    
    # negative match: skip all down modules for all layers
    r"-:.*down.*": {},  
 }

GPTQ v2: set v2=True in quantization config.
Multi-GPU Quantization: set CUDA_VISIBLE_DEVICES=0,1 to two devices and GPTQModel will use second gpu for quantization.
Pass auto_gc = False to quantize() api to speed up quantization if gpu has plenty of vram and does not need to call slow gc.
Pass buffered_fwd = True to quantize() api to potentially speed up quantization if gpu has plenty of vram and can hold all fwd inputs in vram.

Group Aware Reordering (GAR)

Group Aware Reordering (GAR) is an enhanced activation reordering scheme designed to significantly improve the accuracy of quantized models without incurring additional inference overhead. Unlike traditional activation reordering, GAR restricts permutations to within individual groups or rearrangements of entire groups. This ensures each group's associated scales and zero-points remain efficiently accessible during inference, thereby avoiding any inference-time overhead.

How to enable GAR:

Set the hyb_act parameter to True and disable the default activation reordering by setting desc_act to False in your QuantizeConfig. For example:

quant_config = QuantizeConfig(bits=4, group_size=128, desc_act=False, hyb_act=True)

This feature is based on the method introduced in:

T Gafni, A Karnieli, Y Hanani, "Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference," CVPR Workshop, 2025.

Attribution of Quantization Methods:

GPTQ (v1): IST-DASLab, main-author: Elias Frantar, arXiv:2210.17323
GPTQ (v2): Yale Intelligent Computing Lab, main-author: Yuhang Li, arXiv:2504.02692
QQQ: Meituan, main-author Ying Zhang, arXiv:2406.09904

# GPTQModel
@misc{qubitium2024gptqmodel,
  author = {ModelCloud.ai and qubitium@modelcloud.ai},
  title = {GPTQModel},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/modelcloud/gptqmodel}},
  note = {Contact: qubitium@modelcloud.ai},
  year = {2024},
}

# GPTQ
@article{frantar-gptq,
  title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, 
  author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
  journal={arXiv preprint arXiv:2210.17323},
  year={2022}
  
}

# EoRA
@article{liu2024eora,
  title={EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},
  author={Liu, Shih-Yang and Yang, Huck and Wang, Chien-Yi and Fung, Nai Chit and Yin, Hongxu and Sakr, Charbel and Muralidharan, Saurav and Cheng, Kwang-Ting and Kautz, Jan and Wang, Yu-Chiang Frank and others},
  journal={arXiv preprint arXiv:2410.21271},
  year={2024}
}

# Group Aware Reordering (GAR)
@article{gar,
  title={Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference, CVPRW 2025.},
  author={T. Gafni, A. Karnieli, Y. Hanani},
  journal={arXiv preprint arXiv:2505.14638},
  year={2025}
}

# GPTQ Marlin Kernel
@article{frantar2024marlin,
  title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},
  author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},
  journal={arXiv preprint arXiv:2408.11743},
  year={2024}
}

# QQQ 
@article{zhang2024qqq,
      title={QQQ: Quality Quattuor-Bit Quantization for Large Language Models}, 
      author={Ying Zhang and Peng Zhang and Mincong Huang and Jingyang Xiang and Yujie Wang and Chao Wang and Yineng Zhang and Lei Yu and Chuan Liu and Wei Lin},
      journal={arXiv preprint arXiv:2406.09904},
      year={2024}
}

# GPTQ v2
@article{li2025gptqv2,
  title={GPTQv2: Efficient Finetuning-Free Quantization for Asymmetric Calibration}, 
  author={Yuhang Li and Ruokai Yin and Donghyun Lee and Shiting Xiao and Priyadarshini Panda},
  journal={arXiv preprint arXiv:2504.02692},
  year={2025}
}

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4