Local AI assistant in vim
Local AI assistant in vim with codellama
The Problem
I have been using Bryley’s neoai.vim in combination with OpenAI’s gpt API for a while now, but mainly for playing around and creating regexes, to be honest..
I do most of my coding for my employer, and they are understandably sceptical when it comes to pushing all of their sensitive code and content to a black-box cloud service. Which means that gpt, copilot et al. are forbidden for sensitive usecases. Which I think is understandable. Additionally, even for personal projects, I don’t like the idea of not being in control over what is sent where, when and what’s going to happen with what is being sent.
Possible solution
There are large language models that can be executed/operated locally! However, it’s not as easy as visiting a website or “buying” an API key. And glueing it all together always looked a bit challenging.
I wanted to play around with llama for a long while now, and today finally was the day!
Solution
I created an integrated configuration/solution which
- Uses nvim + a fork of neoai.vim
- Starts a containerized instance of llama.cpp with an OpenAI compatible API
- Uses the codellama language model
- Sets everything up automatically
Automation
I have been using a dotfiles repository for a long time now, making use of tuning for the setup automation. Everything is in the code and should bootstrap automagically.
Downloading the model
The model is downloaded by running tuning
, but using a little
script.
Running the API Service
The llama API service runs in a podman
container, orchestrated and managed by systemd
.
The quadlet configuration is part of the repo
With the configuration and model in place, the API can be brought up with a simple
$ systemctl --user start llama
After that, it listens at http://localhost:9741
However, the nvim integration starts it automatically, if needed.
nvim integration
There is a pretty great nvim plugin for the OpenAI API:
neoai.nvim. I created a fork, which
allows it to use an unauthenticated API on a configurable url, e.g.
http://localhost:9741/v1/chat/completions
. The first attempt was not nice and
just added a hard-coded bit of lua to another hardcoded bit of lua, but now
it’s configurable, the merge request is open.
Configuration and installation happens in the neoai.lua configuration file, with automated systemd start of the API server llama.
local utils = require("utils")
local function startLlama()
os.execute('systemctl --user start llama')
end
local keymap_opts = { expr = true }
return {
{
'gierdo/neoai.nvim',
branch = 'local-llama',
cmd = {
"NeoAI",
"NeoAIOpen",
"NeoAIClose",
"NeoAIToggle",
"NeoAIContext",
"NeoAIContextOpen",
"NeoAIContextClose",
"NeoAIInject",
"NeoAIInjectCode",
"NeoAIInjectContext",
"NeoAIInjectContextCode",
},
keys = {
{
mode = "n",
'<A-a>',
':NeoAI<CR>',
keymap_opts
},
{
mode = "n",
'<A-i>',
':NeoAIInject<Space>',
keymap_opts
},
{
mode = "v",
'<A-a>',
':NeoAIContext<CR>',
keymap_opts
},
{
mode = "v",
'<A-i>',
':NeoAIInjectContext<Space>',
keymap_opts
}
},
config = function()
startLlama()
require("neoai").setup({
ui = {
output_popup_text = "AI",
input_popup_text = "Prompt",
width = 45, -- As percentage eg. 45%
output_popup_height = 80, -- As percentage eg. 80%
submit = "<Enter>", -- Key binding to submit the prompt
},
models = {
{
name = "openai",
model = "codellama",
params = nil,
},
},
register_output = {
["a"] = function(output)
return output
end,
["c"] = require("neoai.utils").extract_code_snippets,
},
inject = {
cutoff_width = 75,
},
prompts = {
context_prompt = function(context)
return "I'd like to provide some context for future "
.. "messages. Here is what I want to refer "
.. "to in our upcoming conversations:\n\n"
.. context
end,
},
mappings = {
["select_up"] = "<C-k>",
["select_down"] = "<C-j>",
},
open_ai = {
url = "http://localhost:9741/v1/chat/completions",
display_name = "llama.cpp",
api_key = {
env = "OPENAI_API_KEY",
value = nil,
-- `get` is is a function that retrieves an API key, can be used to override the default method.
-- get = function() ... end
-- Here is some code for a function that retrieves an API key. You can use it with
-- the Linux 'pass' application.
-- get = function()
-- local key = vim.fn.system("pass show openai/mytestkey")
-- key = string.gsub(key, "\n", "")
-- return key
-- end,
},
},
shortcuts = {
-- {
-- name = "textify",
-- key = "<leader>as",
-- desc = "fix text with AI",
-- use_context = true,
-- prompt = [[
-- Please rewrite the text to make it more readable, clear,
-- concise, and fix any grammatical, punctuation, or spelling
-- errors
-- ]],
-- modes = { "v" },
-- strip_function = nil,
-- },
-- {
-- name = "gitcommit",
-- key = "<leader>ag",
-- desc = "generate git commit message",
-- use_context = false,
-- prompt = function()
-- return [[
-- Using the following git diff generate a consise and
-- clear git commit message, with a short title summary
-- that is 75 characters or less:
-- ]] .. vim.fn.system("git diff --cached")
-- end,
-- modes = { "n" },
-- strip_function = nil,
-- },
},
})
end,
dependencies = {
'MunifTanjim/nui.nvim',
}
},
'MunifTanjim/nui.nvim',
}
Limitations
The proposed solution only uses CPU vectorization and should (TM) be portable. I don’t have a machine with a GPU capable of doing shiny Cuda/hipBLAS/ROCm acceleration. This means that the performance is OK(ish) for one single parallel user. Which is exactly what the setup was intended for from the start. Be aware though that the running model needs 9 GiB of memory while idling and will make use of pretty much all of your CPU while working. That’s also why I set it up to only start when it’s really used.
If you’re done using it and you want to free up the resources again, shut it down with
$ systemctl --user stop llama
If your setup has a ROCm supported AMD graphics card or a CUDA supported Nvidia
card, you will probably want to spend some time in getting llama.cpp
to make
use of it for a significant performance increase, at the price of removed
portability and added maintenance.
Old versions of podman + systemd
If you have an older version of podman
, prior to a version with quadlet
support, the systemd unit for the llama service has to be created differently:
$ podman run -d --name llama -v ~/.local/lib/llama/models/:/models -e MODEL=/models/codellama-13b-instruct.Q4_K_M.gguf -p 9741:8000 ghcr.io/gierdo/dotfiles/llama-cpp-python-server:0.2.19
$ podman generate systemd llama > ~/.config/systemd/user/llama.service
Leave a comment