A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/ModelCloud/GPTQModel below:

ModelCloud/GPTQModel: Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.

Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.

Archived News

GPTQModel is a production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF Transformers, vLLM, and SGLang.

Public and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production-level inference speed for token latency and rps. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment.

GPTQModel not only supports GPTQ but also QQQ, GPTQv2, Eora with more quantization methods and enhancements planned.

GPTQModel is a modular design supporting multiple quantization methods and feature extensions.

Quantization Feature GPTQModel Transformers vLLM SGLang Lora Training GPTQ ✅ ✅ ✅ ✅ ✅ EoRA ✅ ✅ ✅ ✅ x GPTQ v2 ✅ ✅ ✅ ✅ ✅ QQQ ✅ x x x x Rotation ✅ x x x x

Native support support some of the most popular multi-modal models:

Multi-Modal Qwen 2.5 Omni ✅ Qwen2 VL ✅ Ovis 1.6 + 2 ✅ Phi-4 MultiModal ✅ Quality: GPTQ 4bit (5.0 bpw) can match BF16:

🤗 ModelCloud quantized Vortex models on HF

Experimental GPTQ v2 quantization: Users have reported this mode of quantization may or may not match original GPTQ v1 implementation in terms of quality recovery. Model Baichuan ✅ EXAONE 3.0 ✅ InternLM 1/2.5 ✅ OPT ✅ StarCoder2 ✅ Bloom ✅ Falcon (H1) ✅ Llama 1-3.3 ✅ OLMo2 ✅ TeleChat2 ✅ ChatGLM ✅ Gemma 1/2/3 ✅ Llama 3.2 VL ✅ Ovis 1.6/2 ✅ Yi ✅ CodeGen ✅ GPTBigCod ✅ LongLLaMA ✅ Phi 1-4 ✅ XVERSE ✅ Cohere 1-2 ✅ GPTQ-Neo/GPT-NeoX ✅ MiniCPM3 ✅ PanGu-α ✅ DBRX Converted ✅ GPT-2 ✅ Mistral ✅ Qwen 1/2/3 ✅ Deci ✅ GPT-J ✅ Mixtral ✅ Qwen 2/3 MoE ✅ DeepSeek-V2/V3/R1 ✅ Granite ✅ MobileLLM ✅ Qwen 2/2.5 VL ✅ DeepSeek-V2-Lite ✅ GRIN-MoE ✅ MOSS ✅ Qwen 2.5 Omni ✅ Dream ✅ Hymba ✅ MPT ✅ RefinedWeb ✅ ERNIE 4.5 ✅ Instella ✅ Nemotron Ultra ✅ StableLM ✅

GPTQModel is validated for Linux, MacOS, and Windows 11:

Platform Device Optimized Arch Kernels 🐧 Linux Nvidia GPU ✅ Ampere+ Marlin, Exllama V2, Exallma V1, Triton, Torch 🐧 Linux Intel XPU ✅ Arc, Datacenter Max IPEX, Torch 🐧 Linux AMD GPU ✅ 7900XT+, ROCm 6.2+ Exllama V2, Exallma V1, Torch 🐧 Linux Intel/AMD CPU ✅ avx, amx, xmx IPEX, Torch 🍎 MacOS GPU (Metal) / CPU ✅ Apple Silicon, M1+ Torch, MLX via conversion 🪟 Windows GPU (Nvidia) / CPU ✅ Nvidia Torch
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas,ipex,auto_round]
pip install -v gptqmodel --no-build-isolation 
uv pip install -v gptqmodel --no-build-isolation
# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel

# pip: compile and install
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation .[vllm,sglang,bitblas,ipex,auto_round]
pip install -v . --no-build-isolation

Three line api to use GPTQModel for gptq model inference:

from gptqmodel import GPTQModel

model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

To use models from ModelScope instead of HuggingFace Hub, set an environment variable:

export GPTQMODEL_USE_MODELSCOPE=True
from gptqmodel import GPTQModel
# load Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4 from modelscope
model = GPTQModel.load("Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output
OpenAI API compatible end-point
# load model using above inference guide first
model.serve(host="0.0.0.0",port="12345")

Basic example of using GPTQModel to quantize a llm model:

from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
  ).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4, group_size=128)

model = GPTQModel.load(model_id, quant_config)

# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=1)

model.save(quant_path)
Quantization using GPTQ V2

Enable GPTQ v2 quantization by setting v2 = True for potentially higher post-quantization accuracy recovery.

# note v2 is currently experiemental and requires 2-4x more vram to execute
# if oom on 1 gpu, please set CUDA_VISIBLE_DEVICES=0,1 to 2 gpu and gptqmodel will auto use second gpu
quant_config = QuantizeConfig(bits=4, group_size=128, v2=True)

Llama 3.1 8B-Instruct quantized using test/models/test_llama3_2.py

Method Bits/Group Size ARC_CHALLENGE GSM8K_Platinum_COT GPTQ 4 / 128 49.15 48.30 GPTQ v2 4 / 128 49.74 👍 +1.20% 61.46 🔥 +27.25% GPTQ 3 / 128 39.93 43.26 GPTQ v2 3 / 128 41.13 👍 +3.01% 50.54 🔥 +16.83%
# test post-quant inference
model = GPTQModel.load(quant_path)
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output
Quantization + EoRA Accuracy Recovery

GPTQModel now support EoRA, a LoRA method that can further imporve the accuracy of the quantized model

# higher rank improves accuracy at the cost of vram usage
# suggestion: test rank 64 and 32 before 128 or 256 as latter may overfit while increasing memory usage
eora = Lora(
  # for eora generation, path is adapter save path; for load, it is loading path
  path=f"{quant_path}/eora_rank32", 
  rank=32,
)

# provide a previously gptq quantized model path
GPTQModel.adapter.generate(
  adapter=eora,
  model_id_or_path=model_id,
  quantized_model_id_or_path=quant_path,
  calibration_dataset=calibration_dataset,
  calibration_dataset_concat_size=0,
  auto_gc=False)

# post-eora inference
model = GPTQModel.load(
  model_id_or_path=quant_path,
  adapter=eora
)

tokens = model.generate("Capital of France is")[0]
result = model.tokenizer.decode(tokens)

print(f"Result: {result}")
# For more detail of EoRA please see GPTQModel/examples/eora
# Please use the benchmark tools in later part of this README to evaluate EoRA effectiveness

For more advanced features of model quantization, please reference to this script

How to Add Support for a New Model

Read the gptqmodel/models/llama.py code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.

Evaluation and Quality Benchmarks

GPTQModel inference is integrated into both lm-eval and evalplus
We highly recommend avoid using ppl and use lm-eval/evalplus to validate post-quantization model quality. ppl should only be used for regression tests and is not a good indicator of model output quality.

# gptqmodel is integrated into lm-eval >= v0.4.7
pip install lm-eval>=0.4.7
# gptqmodel is integrated into evalplus[main]
pip install -U "evalplus @ git+https://github.com/evalplus/evalplus"

Below is a basic sample using GPTQModel.eval API

from gptqmodel import GPTQModel
from gptqmodel.utils.eval import EVAL

model_id = "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1"

# Use `lm-eval` as framework to evaluate the model
lm_eval_results = GPTQModel.eval(model_id, framework=EVAL.LM_EVAL, tasks=[EVAL.LM_EVAL.ARC_CHALLENGE], output_file='lm-eval_result.json')

# Use `evalplus` as framework to evaluate the model
evalplus_results = GPTQModel.eval(model_id, framework=EVAL.EVALPLUS, tasks=[EVAL.EVALPLUS.HUMAN], output_file='evalplus_result.json')
Dynamic Quantization (Per Module QuantizeConfig Override)

QuantizeConfig.dynamic is dynamic control which allows specific matching modules to be skipped for quantization (negative matching) or have a unique [bits, group_size, sym, desc_act, mse, pack_dtype] property override per matching module vs base QuantizeConfig (postive match with override).

Sample QuantizerConfig.dynamic usage:

dynamic = { 
    # `.*\.` matches the layers_node prefix 
    # layer index start at 0 
    
    # positive match: layer 19, gate module 
    r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32},  
    
    # positgive match: layer 20, gate module (prefix defaults to positive if missing)
    r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},  
    
    # negative match: skip layer 21, gate module
    r"-:.*\.20\..*gate.*": {}, 
    
    # negative match: skip all down modules for all layers
    r"-:.*down.*": {},  
 } 
Group Aware Reordering (GAR)

Group Aware Reordering (GAR) is an enhanced activation reordering scheme designed to significantly improve the accuracy of quantized models without incurring additional inference overhead. Unlike traditional activation reordering, GAR restricts permutations to within individual groups or rearrangements of entire groups. This ensures each group's associated scales and zero-points remain efficiently accessible during inference, thereby avoiding any inference-time overhead.

How to enable GAR:

Set the hyb_act parameter to True and disable the default activation reordering by setting desc_act to False in your QuantizeConfig. For example:

quant_config = QuantizeConfig(bits=4, group_size=128, desc_act=False, hyb_act=True)

This feature is based on the method introduced in:

T Gafni, A Karnieli, Y Hanani, "Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference," CVPR Workshop, 2025.

Attribution of Quantization Methods:
# GPTQModel
@misc{qubitium2024gptqmodel,
  author = {ModelCloud.ai and qubitium@modelcloud.ai},
  title = {GPTQModel},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/modelcloud/gptqmodel}},
  note = {Contact: qubitium@modelcloud.ai},
  year = {2024},
}

# GPTQ
@article{frantar-gptq,
  title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, 
  author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
  journal={arXiv preprint arXiv:2210.17323},
  year={2022}
  
}

# EoRA
@article{liu2024eora,
  title={EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},
  author={Liu, Shih-Yang and Yang, Huck and Wang, Chien-Yi and Fung, Nai Chit and Yin, Hongxu and Sakr, Charbel and Muralidharan, Saurav and Cheng, Kwang-Ting and Kautz, Jan and Wang, Yu-Chiang Frank and others},
  journal={arXiv preprint arXiv:2410.21271},
  year={2024}
}

# Group Aware Reordering (GAR)
@article{gar,
  title={Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference, CVPRW 2025.},
  author={T. Gafni, A. Karnieli, Y. Hanani},
  journal={arXiv preprint arXiv:2505.14638},
  year={2025}
}

# GPTQ Marlin Kernel
@article{frantar2024marlin,
  title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},
  author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},
  journal={arXiv preprint arXiv:2408.11743},
  year={2024}
}

# QQQ 
@article{zhang2024qqq,
      title={QQQ: Quality Quattuor-Bit Quantization for Large Language Models}, 
      author={Ying Zhang and Peng Zhang and Mincong Huang and Jingyang Xiang and Yujie Wang and Chao Wang and Yineng Zhang and Lei Yu and Chuan Liu and Wei Lin},
      journal={arXiv preprint arXiv:2406.09904},
      year={2024}
}

# GPTQ v2
@article{li2025gptqv2,
  title={GPTQv2: Efficient Finetuning-Free Quantization for Asymmetric Calibration}, 
  author={Yuhang Li and Ruokai Yin and Donghyun Lee and Shiting Xiao and Priyadarshini Panda},
  journal={arXiv preprint arXiv:2504.02692},
  year={2025}
}

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4