RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/intel-analytics/ipex-llm below:

intel/ipex-llm: Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.

💫 Intel® LLM Library for PyTorch*

< English | 中文 >

IPEX-LLM is an LLM acceleration library for Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max), NPU and CPU ¹.

Note

IPEX-LLM provides seamless integration with llama.cpp, Ollama, vLLM, HuggingFace transformers, LangChain, LlamaIndex, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc.
70+ models have been optimized/verified on ipex-llm (e.g., Llama, Phi, Mistral, Mixtral, DeepSeek, Qwen, ChatGLM, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art LLM optimizations, XPU acceleration and low-bit (FP8/FP6/FP4/INT4) support; see the complete list here.

[2025/05] You can now run DeepSeek V3/R1 671B and Qwen3MoE 235B models with just 1 or 2 Intel Arc GPU (such as A770 or B580) using FlashMoE in ipex-llm.
[2025/04] We released ipex-llm 2.2.0, which includes Ollama Portable Zip and llama.cpp Portable Zip.
[2025/04] We added support of PyTorch 2.6 for Intel GPU.
[2025/03] We added support for Gemma3 model in the latest llama.cpp Portable Zip.
[2025/03] We can now run DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon using the latest llama.cpp Portable Zip.
[2025/02] We added support of llama.cpp Portable Zip for Intel GPU (both Windows and Linux) and NPU (Windows only).
[2025/02] We added support of Ollama Portable Zip to directly run Ollama on Intel GPU for both Windows and Linux (without the need of manual installations).
[2025/02] We added support for running vLLM 0.6.6 on Intel Arc GPUs.
[2025/01] We added the guide for running ipex-llm on Intel Arc B580 GPU.
[2025/01] We added support for running Ollama 0.5.4 on Intel GPU.
[2024/12] We added both Python and C++ support for Intel Core Ultra NPU (including 100H, 200V, 200K and 200H series).

More updates

[2024/11] We added support for running vLLM 0.6.2 on Intel Arc GPUs.
[2024/07] We added support for running Microsoft's GraphRAG using local LLM on Intel GPU; see the quickstart guide here.
[2024/07] We added extensive support for Large Multimodal Models, including StableDiffusion, Phi-3-Vision, Qwen-VL, and more.
[2024/07] We added FP6 support on Intel GPU.
[2024/06] We added experimental NPU support for Intel Core Ultra processors; see the examples here.
[2024/06] We added extensive support of pipeline parallel inference, which makes it easy to run large-sized LLM using 2 or more Intel GPUs (such as Arc).
[2024/06] We added support for running RAGFlow with ipex-llm on Intel GPU.
[2024/05] ipex-llm now supports Axolotl for LLM finetuning on Intel GPU; see the quickstart here.
[2024/05] You can now easily run ipex-llm inference, serving and finetuning using the Docker images.
[2024/05] You can now install ipex-llm on Windows using just "one command".
[2024/04] You can now run Open WebUI on Intel GPU using ipex-llm; see the quickstart here.
[2024/04] You can now run Llama 3 on Intel GPU using llama.cpp and ollama with ipex-llm; see the quickstart here.
[2024/04] ipex-llm now supports Llama 3 on both Intel GPU and CPU.
[2024/04] ipex-llm now provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU.
[2024/03] bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.
[2024/02] ipex-llm now supports directly loading model from ModelScope (魔搭).
[2024/02] ipex-llm added initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
[2024/02] Users can now use ipex-llm through Text-Generation-WebUI GUI.
[2024/02] ipex-llm now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.
[2024/02] ipex-llm now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).
[2024/01] Using ipex-llm QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).
[2023/12] ipex-llm now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates").
[2023/12] ipex-llm now supports Mixtral-8x7B on both Intel GPU and CPU.
[2023/12] ipex-llm now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models").
[2023/12] ipex-llm now supports FP8 and FP4 inference on Intel GPU.
[2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into ipex-llm is available.
[2023/11] ipex-llm now supports vLLM continuous batching on both Intel GPU and CPU.
[2023/10] ipex-llm now supports QLoRA finetuning on both Intel GPU and CPU.
[2023/10] ipex-llm now supports FastChat serving on on both Intel CPU and GPU.
[2023/09] ipex-llm now supports Intel GPU (including iGPU, Arc, Flex and MAX).
[2023/09] ipex-llm tutorial is released.

See demos of running local LLMs on Intel Core Ultra iGPU, Intel Core Ultra NPU, single-card Arc GPU, or multi-card Arc GPUs using ipex-llm below.

See the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below¹ (and refer to [2][3][4] for more details).

You may follow the Benchmarking Guide to run ipex-llm performance benchmark yourself.

Please see the Perplexity result below (tested on Wikitext dataset using the script here).

Perplexity sym_int4 q4_k fp6 fp8_e5m2 fp8_e4m3 fp16 Llama-2-7B-chat-hf 6.364 6.218 6.092 6.180 6.098 6.096 Mistral-7B-Instruct-v0.2 5.365 5.320 5.270 5.273 5.246 5.244 Baichuan2-7B-chat 6.734 6.727 6.527 6.539 6.488 6.508 Qwen1.5-7B-chat 8.865 8.816 8.557 8.846 8.530 8.607 Llama-3.1-8B-Instruct 6.705 6.566 6.338 6.383 6.325 6.267 gemma-2-9b-it 7.541 7.412 7.269 7.380 7.268 7.270 Baichuan2-13B-Chat 6.313 6.160 6.070 6.145 6.086 6.031 Llama-2-13b-chat-hf 5.449 5.422 5.341 5.384 5.332 5.329 Qwen1.5-14B-Chat 7.529 7.520 7.367 7.504 7.297 7.334

Ollama: running Ollama on Intel GPU without the need of manual installations
llama.cpp: running llama.cpp on Intel GPU without the need of manual installations
Arc B580: running ipex-llm on Intel Arc B580 GPU for Ollama, llama.cpp, PyTorch, HuggingFace, etc.
NPU: running ipex-llm on Intel NPU in both Python/C++ or llama.cpp API.
PyTorch/HuggingFace: running PyTorch, HuggingFace, LangChain, LlamaIndex, etc. (using Python interface of ipex-llm) on Intel GPU for Windows and Linux
vLLM: running ipex-llm in vLLM on both Intel GPU and CPU
FastChat: running ipex-llm in FastChat serving on on both Intel GPU and CPU
Serving on multiple Intel GPUs: running ipex-llm serving on multiple Intel GPUs by leveraging DeepSpeed AutoTP and FastAPI
Text-Generation-WebUI: running ipex-llm in oobabooga WebUI
Axolotl: running ipex-llm in Axolotl for LLM finetuning
Benchmarking: running (latency and throughput) benchmarks for ipex-llm on Intel CPU and GPU

GPU Inference in C++: running llama.cpp, ollama, etc., with ipex-llm on Intel GPU
GPU Inference in Python : running HuggingFace transformers, LangChain, LlamaIndex, ModelScope, etc. with ipex-llm on Intel GPU
vLLM on GPU: running vLLM serving with ipex-llm on Intel GPU
vLLM on CPU: running vLLM serving with ipex-llm on Intel CPU
FastChat on GPU: running FastChat serving with ipex-llm on Intel GPU
VSCode on GPU: running and developing ipex-llm applications in Python using VSCode on Intel GPU

GraphRAG: running Microsoft's GraphRAG using local LLM with ipex-llm
RAGFlow: running RAGFlow (an open-source RAG engine) with ipex-llm
LangChain-Chatchat: running LangChain-Chatchat (Knowledge Base QA using RAG pipeline) with ipex-llm
Coding copilot: running Continue (coding copilot in VSCode) with ipex-llm
Open WebUI: running Open WebUI with ipex-llm
PrivateGPT: running PrivateGPT to interact with documents with ipex-llm
Dify platform: running ipex-llm in Dify(production-ready LLM app development platform)

Windows GPU: installing ipex-llm on Windows with Intel GPU
Linux GPU: installing ipex-llm on Linux with Intel GPU
For more details, please refer to the full installation guide

- INT4 inference: INT4 LLM inference on Intel GPU and CPU
- FP8/FP6/FP4 inference: FP8, FP6 and FP4 LLM inference on Intel GPU
- INT8 inference: INT8 LLM inference on Intel GPU and CPU
- INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
- FP16 LLM inference on Intel GPU, with possible self-speculative decoding optimization
- BF16 LLM inference on Intel CPU, with possible self-speculative decoding optimization
- Pipeline Parallel inference on Intel GPU
- DeepSpeed AutoTP inference on Intel GPU
- Low-bit models: saving and loading ipex-llm low-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.)
- GGUF: directly loading GGUF models into ipex-llm
- AWQ: directly loading AWQ models into ipex-llm
- GPTQ: directly loading GPTQ models into ipex-llm
- LLM finetuning on Intel GPU, including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA
- QLoRA finetuning on Intel CPU
Integration with community libraries
Tutorials

FAQ & Trouble Shooting

Over 70 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.

Model CPU Example GPU Example NPU Example LLaMA link1, link2 link LLaMA 2 link1, link2 link Python link, C++ link LLaMA 3 link link Python link, C++ link LLaMA 3.1 link link LLaMA 3.2 link Python link, C++ link LLaMA 3.2-Vision link ChatGLM link ChatGLM2 link link ChatGLM3 link link GLM-4 link link GLM-4V link link GLM-Edge link Python link GLM-Edge-V link Mistral link link Mixtral link link Falcon link link MPT link link Dolly-v1 link link Dolly-v2 link link Replit Code link link RedPajama link1, link2 Phoenix link1, link2 StarCoder link1, link2 link Baichuan link link Baichuan2 link link Python link InternLM link link InternVL2 link Qwen link link Qwen1.5 link link Qwen2 link link Python link, C++ link Qwen2.5 link Python link, C++ link Qwen-VL link link Qwen2-VL link Qwen2-Audio link Aquila link link Aquila2 link link MOSS link Whisper link link Phi-1_5 link link Flan-t5 link link LLaVA link link CodeLlama link link Skywork link InternLM-XComposer link WizardCoder-Python link CodeShell link Fuyu link Distil-Whisper link link Yi link link BlueLM link link Mamba link link SOLAR link link Phixtral link link InternLM2 link link RWKV4 link RWKV5 link Bark link link SpeechT5 link DeepSeek-MoE link Ziya-Coding-34B-v1.0 link Phi-2 link link Phi-3 link link Phi-3-vision link link Yuan2 link link Gemma link link Gemma2 link DeciLM-7B link link Deepseek link link StableLM link link CodeGemma link link Command-R/cohere link link CodeGeeX2 link link MiniCPM link link Python link, C++ link MiniCPM3 link MiniCPM-V link MiniCPM-V-2 link link MiniCPM-Llama3-V-2_5 link Python link MiniCPM-V-2_6 link link Python link MiniCPM-o-2_6 link Janus-Pro link Moonlight link StableDiffusion link Bce-Embedding-Base-V1 Python link Speech_Paraformer-Large Python link

Please report a bug or raise a feature request by opening a Github Issue
Please report a vulnerability by opening a draft GitHub Security Advisory

Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩ ↩²

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4