Latest news 🔥
hpcai-tech/grok-1
hpcai-tech/grok-1[04/2025] Support for SpD, multiprojection heads. Implemented post-attention hidden size projections to speculate tokens ahead of the base model
[04/2025] QNN Compilation support for AutoModel classes. QNN compilation capabilities for multi-models, embedding models and causal models.
[04/2025] Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for disaggregated serving.
[04/2025] SwiftKV Support for both continuous and non-continuous batching execution in SwiftKV.
[04/2025] Support for GGUF model execution (without quantized weights)
[04/2025] Enabled FP8 model support on replicate_kv_heads script
[04/2025] Added support for gradient checkpointing in the finetuning script
[04/2025] Added support of model ibm-granite/granite-vision-3.2-2b
ibm-granite/granite-vision-3.2-2b
[03/2025] Added support for swiftkv model Snowflake/Llama-3.1-SwiftKV-8B-Instruct
[02/2025] VLMs support added for the models InternVL-1B, Llava and Mllama
[01/2025] FP8 models support Added support for inference of FP8 models.
[01/2025] Added support for [Ibm-Granite] (https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)
[11/2024] finite adapters support allows mixed adapter usage for peft models.
[11/2024] Speculative decoding TLM QEFFAutoModelForCausalLM model can be compiled for returning more than 1 logits during decode for TLM.
[11/2024] Added support for Meta-Llama-3.3-70B-Instruct, Meta-Llama-3.2-1B and Meta-Llama-3.2-3B
[09/2024] Now we support PEFT models
[01/2025] Added support for [Ibm-Granite] (https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)
[01/2025] Added support for [Ibm-Granite-Guardian] (https://huggingface.co/ibm-granite/granite-guardian-3.1-8b)
[09/2024] Added support for Gemma-2-Family
[09/2024] Added support for CodeGemma-Family
[09/2024] Added support for Gemma-Family
[09/2024] Added support for Meta-Llama-3.1-8B
[09/2024] Added support for Meta-Llama-3.1-8B-Instruct
[09/2024] Added support for Meta-Llama-3.1-70B-Instruct
[09/2024] Added support for granite-20b-code-base
[09/2024] Added support for granite-20b-code-instruct-8k
[09/2024] Added support for Starcoder1-15B
[08/2024] Added support for inference optimization technique continuous batching
[08/2024] Added support for Jais-adapted-70b
[08/2024] Added support for Jais-adapted-13b-chat
[08/2024] Added support for Jais-adapted-7b
[06/2024] Added support for GPT-J-6B
[06/2024] Added support for Qwen2-1.5B-Instruct
[06/2024] Added support for StarCoder2-15B
[06/2024] Added support for Phi3-Mini-4K-Instruct
[06/2024] Added support for Codestral-22B-v0.1
[06/2024] Added support for Vicuna-v1.5
[05/2024] Added support for Mixtral-8x7B & Mistral-7B-Instruct-v0.1.
[04/2024] Initial release of efficient transformers for seamless inference on pre-trained LLMs.
This library provides reimplemented blocks of LLMs which are used to make the models functional and highly performant on Qualcomm Cloud AI 100. There are several models which can be directly transformed from a pre-trained original form to a deployment ready optimized form. For other models, there is comprehensive documentation to inspire upon the changes needed and How-To(s).
Typically for LLMs, the library provides:ONNX
Graph.It is mandatory for each Pull Request to include tests such as:
# Create Python virtual env and activate it. (Recommended Python 3.10) sudo apt install python3.10-venv python3.10 -m venv qeff_env source qeff_env/bin/activate pip install -U pip # Clone and Install the QEfficient Repo. pip install git+https://github.com/quic/efficient-transformers # Or build wheel package using the below command. pip install build wheel python -m build --wheel --outdir dist pip install dist/qefficient-0.0.1.dev0-py3-none-any.whl
For more details about using QEfficient
via Cloud AI 100 Apps SDK, visit Linux Installation Guide
Note: More details are here: https://quic.github.io/cloud-ai-sdk-pages/latest/Getting-Started/Model-Architecture-Support/Large-Language-Models/llm/
Thanks to:
If you run into any problems with the code, please file Github issues directly to this repo.
This project welcomes contributions and suggestions. Please check the License. Integration with a CLA Bot is underway.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4