LLamaSharp is a cross-platform library to run 🦙LLaMA/LLaVA model (and others) on your local device. Based on llama.cpp, inference with LLamaSharp is efficient on both CPU and GPU. With the higher-level APIs and RAG support, it's convenient to deploy LLMs (Large Language Models) in your application with LLamaSharp.
Please star the repo to show your support for this project!🤗
Table of ContentsThere are integrations for the following libraries, making it easier to develop your APP. Integrations for semantic-kernel and kernel-memory are developed in the LLamaSharp repository, while others are developed in their own repositories.
The following examples show how to build APPs with LLamaSharp.
To gain high performance, LLamaSharp interacts with native libraries compiled from c++, these are called backends
. We provide backend packages for Windows, Linux and Mac with CPU, CUDA, Metal and Vulkan. You don't need to compile any c++, just install the backend packages.
If no published backend matches your device, please open an issue to let us know. If compiling c++ code is not difficult for you, you could also follow this guide to compile a backend and run LLamaSharp with it.
PM> Install-Package LLamaSharp
Install one or more of these backends, or use a self-compiled backend.
LLamaSharp.Backend.Cpu
: Pure CPU for Windows, Linux & Mac. Metal (GPU) support for Mac.LLamaSharp.Backend.Cuda11
: CUDA 11 for Windows & Linux.LLamaSharp.Backend.Cuda12
: CUDA 12 for Windows & Linux.LLamaSharp.Backend.Vulkan
: Vulkan for Windows & Linux.(optional) For Microsoft semantic-kernel integration, install the LLamaSharp.semantic-kernel package.
(optional) To enable RAG support, install the LLamaSharp.kernel-memory package (this package only supports net6.0
or higher yet), which is based on Microsoft kernel-memory integration.
There are two popular formats of model file of LLMs, these are PyTorch format (.pth) and Huggingface format (.bin). LLamaSharp uses a GGUF
format file, which can be converted from these two formats. To get a GGUF
file, there are two options:
Search model name + 'gguf' in Huggingface, you will find lots of model files that have already been converted to GGUF format. Please take note of the publishing time of them because some old ones may only work with older versions of LLamaSharp.
Convert PyTorch or Huggingface format to GGUF format yourself. Please follow the instructions from this part of llama.cpp readme to convert them with python scripts.
Generally, we recommend downloading models with quantization rather than fp16, because it significantly reduces the required memory size while only slightly impacting the generation quality.
Example of LLaMA chat sessionHere is a simple example to chat with a bot based on a LLM in LLamaSharp. Please replace the model path with yours.
using LLama; using LLama.Common; using LLama.Sampling; string modelPath = @"<Your Model Path>"; // change it to your own model path. var parameters = new ModelParams(modelPath) { ContextSize = 1024, // The longest length of chat as memory. GpuLayerCount = 5 // How many layers to offload to GPU. Please adjust it according to your GPU memory. }; using var model = LLamaWeights.LoadFromFile(parameters); using var context = model.CreateContext(parameters); var executor = new InteractiveExecutor(context); // Add chat histories as prompt to tell AI how to act. var chatHistory = new ChatHistory(); chatHistory.AddMessage(AuthorRole.System, "Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision."); chatHistory.AddMessage(AuthorRole.User, "Hello, Bob."); chatHistory.AddMessage(AuthorRole.Assistant, "Hello. How may I help you today?"); ChatSession session = new(executor, chatHistory); InferenceParams inferenceParams = new InferenceParams() { MaxTokens = 256, // No more than 256 tokens should appear in answer. Remove it if antiprompt is enough for control. AntiPrompts = new List<string> { "User:" }, // Stop generation once antiprompts appear. SamplingPipeline = new DefaultSamplingPipeline(), }; Console.ForegroundColor = ConsoleColor.Yellow; Console.Write("The chat session has started.\nUser: "); Console.ForegroundColor = ConsoleColor.Green; string userInput = Console.ReadLine() ?? ""; while (userInput != "exit") { await foreach ( // Generate the response streamingly. var text in session.ChatAsync( new ChatHistory.Message(AuthorRole.User, userInput), inferenceParams)) { Console.ForegroundColor = ConsoleColor.White; Console.Write(text); } Console.ForegroundColor = ConsoleColor.Green; userInput = Console.ReadLine() ?? ""; }
For more examples, please refer to LLamaSharp.Examples.
Why is my GPU not used when I have installed CUDA?GpuLayerCount > 0
when loading the model weight.NativeLibraryConfig.All.WithLogCallback(delegate (LLamaLogLevel level, string message) { Console.Write($"{level}: {message}"); } )Why is the inference so slow?
Firstly, due to the large size of LLM models, it requires more time to generate output than other models, especially when you are using models larger than 30B parameters.
To see if that's a LLamaSharp performance issue, please follow the two tips below.
GpuLayerCount
as large as possible.Generally, there are two possible cases for this problem:
Please set anti-prompt or max-length when executing the inference.
All contributions are welcome! There's a TODO list in LLamaSharp Dev Project and you can pick an interesting one to start. Please read the contributing guide for more information.
You can also do one of the following to help us make LLamaSharp better:
Join our chat on Discord (please contact Rinne to join the dev channel if you want to be a contributor).
Join QQ group
Map of LLamaSharp and llama.cpp versionsIf you want to compile llama.cpp yourself you must use the exact commit ID listed for each version.
LLamaSharp Verified Model Resources llama.cpp commit id v0.2.0 This version is not recommended to use. - v0.2.1 WizardLM, Vicuna (filenames with "old") - v0.2.2, v0.2.3 WizardLM, Vicuna (filenames without "old")63d2046
v0.3.0, v0.4.0 LLamaSharpSamples v0.3.0, WizardLM 7e4ea5b
v0.4.1-preview Open llama 3b, Open Buddy aacdbd4
v0.4.2-preview Llama2 7B (GGML) 3323112
v0.5.1 Llama2 7B (GGUF) 6b73ef1
v0.6.0 cb33f43
v0.7.0, v0.8.0 Thespis-13B, LLaMA2-7B 207b519
v0.8.1 e937066
v0.9.0, v0.9.1 Mixtral-8x7B 9fb13f9
v0.10.0 Phi2 d71ac90
v0.11.1, v0.11.2 LLaVA-v1.5, Phi2 3ab8b3a
v0.12.0 LLama3 a743d76
v0.13.0 1debe72
v0.14.0 Gemma2 36864569
v0.15.0 LLama3.1 345c8c0c
v0.16.0 11b84eb4
v0.17.0 c35e586e
v0.18.0 c35e586e
v0.19.0 958367bf
v0.20.0 0827b2c1
v0.21.0 DeepSeek R1 5783575c
v0.22.0, v0.23.0 Gemma3 be7c3034
v0.24.0 Qwen3 ceda28ef
v0.25.0 11dd5a44eb180e1d69fac24d3852b5222d66fb7f
This project is licensed under the terms of the MIT license.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4