Local LLM-assisted text completion.
Insert
modeCtrl+F
Tab
Shift+Tab
vim-plug
Plug 'ggml-org/llama.vim'
Vundle
cd ~/.vim/bundle git clone https://github.com/ggml-org/llama.vim
Then add Plugin 'llama.vim'
to your .vimrc in the vundle#begin()
section.
lazy.nvim
{ 'ggml-org/llama.vim', }
You can customize llama.vim by setting the g:llama_config
variable.
Examples:
Disable the inline info:
" put before llama.vim loads let g:llama_config = { 'show_info': 0 }
Same thing but setting directly
let g:llama_config.show_info = v:false
Disable auto FIM (Fill-In-the-Middle) completion with lazy.nvim
{ 'ggml-org/llama.vim', init = function() vim.g.llama_config = { auto_fim = false, } end, }
Changing accept line keymap
let g:llama_config.keymap_accept_full = "<C-S>"
Please refer to :help llama_config
or the source for the full list of options.
The plugin requires a llama.cpp server instance to be running at g:llama_config.endpoint
.
Either build from source or use the latest binaries: https://github.com/ggml-org/llama.cpp/releases
Here are recommended settings, depending on the amount of VRAM that you have:
More than 16GB VRAM:
llama-server --fim-qwen-7b-default
Less than 16GB VRAM:
llama-server --fim-qwen-3b-default
Less than 8GB VRAM:
llama-server --fim-qwen-1.5b-default
Use :help llama
for more details.
The plugin requires FIM-compatible models: HF collection
Usingllama.vim
on M1 Pro (2021) with Qwen2.5-Coder 1.5B Q8_0
:
The orange text is the generated suggestion. The green text contains performance stats for the FIM request: the currently used context is 15186
tokens and the maximum is 32768
. There are 30
chunks in the ring buffer with extra context (out of 64
). So far, 1
chunk has been evicted in the current session and there are 0
chunks in queue. The newly computed prompt tokens for this request were 260
and the generated tokens were 24
. It took 1245 ms
to generate this suggestion after entering the letter c
on the current line.
llama.vim
on M2 Ultra with Qwen2.5-Coder 7B Q8_0
: llama.vim-0-lq.mp4
Demonstrates that the global context is accumulated and maintained across different files and showcases the overall latency when working in a large codebase.
Another example on a small Swift codeThe plugin aims to be very simple and lightweight and at the same time to provide high-quality and performant local FIM completions, even on consumer-grade hardware. Read more on how this is achieved in the following links:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4