< English | 中文 >
IPEX-LLM
is an LLM acceleration library for Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max), NPU and CPU 1.
Note
IPEX-LLM
provides seamless integration with llama.cpp, Ollama, vLLM, HuggingFace transformers, LangChain, LlamaIndex, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc.ipex-llm
(e.g., Llama, Phi, Mistral, Mixtral, DeepSeek, Qwen, ChatGLM, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art LLM optimizations, XPU acceleration and low-bit (FP8/FP6/FP4/INT4) support; see the complete list here.ipex-llm
.ipex-llm 2.2.0
, which includes Ollama Portable Zip and llama.cpp Portable Zip.ipex-llm
on Intel Arc B580 GPU.ipex-llm
on Intel GPU.ipex-llm
now supports Axolotl for LLM finetuning on Intel GPU; see the quickstart here.ipex-llm
inference, serving and finetuning using the Docker images.ipex-llm
on Windows using just "one command".ipex-llm
; see the quickstart here.llama.cpp
and ollama
with ipex-llm
; see the quickstart here.ipex-llm
now supports Llama 3 on both Intel GPU and CPU.ipex-llm
now provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU.bigdl-llm
has now become ipex-llm
(see the migration guide here); you may find the original BigDL
project here.ipex-llm
now supports directly loading model from ModelScope (魔搭).ipex-llm
added initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.ipex-llm
through Text-Generation-WebUI GUI.ipex-llm
now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.ipex-llm
now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).ipex-llm
QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).ipex-llm
now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates").ipex-llm
now supports Mixtral-8x7B on both Intel GPU and CPU.ipex-llm
now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models").ipex-llm
now supports FP8 and FP4 inference on Intel GPU.ipex-llm
is available.ipex-llm
now supports vLLM continuous batching on both Intel GPU and CPU.ipex-llm
now supports QLoRA finetuning on both Intel GPU and CPU.ipex-llm
now supports FastChat serving on on both Intel CPU and GPU.ipex-llm
now supports Intel GPU (including iGPU, Arc, Flex and MAX).ipex-llm
tutorial is released.See demos of running local LLMs on Intel Core Ultra iGPU, Intel Core Ultra NPU, single-card Arc GPU, or multi-card Arc GPUs using ipex-llm
below.
See the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below1 (and refer to [2][3][4] for more details).
You may follow the Benchmarking Guide to run ipex-llm
performance benchmark yourself.
Please see the Perplexity result below (tested on Wikitext dataset using the script here).
Perplexity sym_int4 q4_k fp6 fp8_e5m2 fp8_e4m3 fp16 Llama-2-7B-chat-hf 6.364 6.218 6.092 6.180 6.098 6.096 Mistral-7B-Instruct-v0.2 5.365 5.320 5.270 5.273 5.246 5.244 Baichuan2-7B-chat 6.734 6.727 6.527 6.539 6.488 6.508 Qwen1.5-7B-chat 8.865 8.816 8.557 8.846 8.530 8.607 Llama-3.1-8B-Instruct 6.705 6.566 6.338 6.383 6.325 6.267 gemma-2-9b-it 7.541 7.412 7.269 7.380 7.268 7.270 Baichuan2-13B-Chat 6.313 6.160 6.070 6.145 6.086 6.031 Llama-2-13b-chat-hf 5.449 5.422 5.341 5.384 5.332 5.329 Qwen1.5-14B-Chat 7.529 7.520 7.367 7.504 7.297 7.334ipex-llm
on Intel Arc B580 GPU for Ollama, llama.cpp, PyTorch, HuggingFace, etc.ipex-llm
on Intel NPU in both Python/C++ or llama.cpp API.ipex-llm
) on Intel GPU for Windows and Linuxipex-llm
in vLLM on both Intel GPU and CPUipex-llm
in FastChat serving on on both Intel GPU and CPUipex-llm
serving on multiple Intel GPUs by leveraging DeepSpeed AutoTP and FastAPIipex-llm
in oobabooga
WebUIipex-llm
in Axolotl for LLM finetuningipex-llm
on Intel CPU and GPUllama.cpp
, ollama
, etc., with ipex-llm
on Intel GPUtransformers
, LangChain
, LlamaIndex
, ModelScope
, etc. with ipex-llm
on Intel GPUvLLM
serving with ipex-llm
on Intel GPUvLLM
serving with ipex-llm
on Intel CPUFastChat
serving with ipex-llm
on Intel GPUipex-llm
applications in Python using VSCode on Intel GPUGraphRAG
using local LLM with ipex-llm
RAGFlow
(an open-source RAG engine) with ipex-llm
LangChain-Chatchat
(Knowledge Base QA using RAG pipeline) with ipex-llm
Continue
(coding copilot in VSCode) with ipex-llm
Open WebUI
with ipex-llm
PrivateGPT
to interact with documents with ipex-llm
ipex-llm
in Dify
(production-ready LLM app development platform)ipex-llm
on Windows with Intel GPUipex-llm
on Linux with Intel GPUipex-llm
low-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.)ipex-llm
ipex-llm
ipex-llm
Over 70 models have been optimized/verified on ipex-llm
, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.
Performance varies by use, configuration and other factors. ipex-llm
may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩ ↩2
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4