[Run Your Own] Running llama.cpp server on Guix

This is a start to a (hopefully) series of posts about running different kinds of things on Guix distro.

Idea

Even with all the ethical problems with LLMs (and, especially, companies providing them), technology could be useful enough for at least giving it a try.

Luckily, a couple of years ago I thought that I wanted to play some AAA games. I had a decent sum of money in hand, so I decided to buy a beefy configuration with RTX 4090 GPU. I haven't actually played anything seriously, except some hopeless crashes in Microsoft Flight Simulator to overcome my fear of flying, so most of the time the computer was gathering dust.

Until one not-so-sunny Berlin Saturday morning, when I finally decided to load it with local LLM inference.

How to run?

Question was answered pretty fast - Guix has llama.cpp packaged and it has server mode, which suits nicely. However, two problems appeared:

To use RTX 4090 fully I need to run proprietary drivers
There is no Guix system service for running llama.cpp as a server

On drivers

Thanks to Nonguix channel, running proprietary Nvidia drivers is not a big problem. You just have to carefully follow instructions from the README. Nowadays I am running nvda-595 driver and it is working fine (as fine as NVIDIA could on Linux).

On llama.cpp system service

This thing is a little bit more complicated. We have to provide correct Vulkan environment variables to make use of Nvidia proprietary drivers in Guix. To simplify these things I decided to pack all the complexity inside the Guix system service.

As this service uses proprietary software, there is no way it could be merged in the main Guix. But nobody can stop me from implementing it in my own channel.

So, here is llama-cpp-service-type. For now it only supports running GGUF models from HuggingFace via Vulkan backend. Following configuration parameters are available:

(define-configuration/no-serialization llama-cpp-configuration
  (host
    (string "127.0.0.1")
    "Host to run server on.")
  (port
   (string "8080")
   "Port to run server on.")
  (huggingface-name
   (string)
   "Huggingface model to run.")
  (parameters
   (list '())
   "Model parameters.")
  (with-nvidia?
   (boolean #f)
   "Enable hack for running on Nvidia with Vulkan")
  (nvidia-driver-package
   (package nvidia-driver)
   "Nvidia driver package to use")
  (user
   (string "llama")
   "User to run as.")
  (group
   (string "llama")
   "Group to run as.")
  (requirements
   (list '())
   "List of additional service requirements."))

So, the only caveat here is providing correct Vulkan ICD path. Let's check it in detail:

(define (llama-cpp-shepherd-service config)
  (match-record config <llama-cpp-configuration>
                (huggingface-name parameters with-nvidia?
                                  nvidia-driver-package user
                                  group requirements
                                  host port)
    (let* ((icd (if with-nvidia?
                    (mixed-text-file "nvidia_icd.x86_64.json" "{
    \"file_format_version\" : \"1.0.1\",
    \"ICD\": {
        \"library_path\": \"" nvidia-driver-package "/lib/libEGL_nvidia.so.0\",
        \"api_version\" : \"1.4.312\"
    }
}")
                    (file-append mesa "/share/vulkan/icd.d")))
           (device-mappings
            (append
             (list (file-system-mapping
                     (source "/dev/dri")
                     (target source)
                     (writable? #t)))
             (if with-nvidia?
                 (map (lambda (dev)
                        (file-system-mapping
                          (source dev)
                          (target source)
                          (writable? #t)))
                      '("/dev/nvidiactl"
                        "/dev/nvidia0"
                        "/dev/nvidia-modeset"
                        "/dev/nvidia-uvm"
                        "/dev/nvidia-uvm-tools"))
                 '())))
           (llama-server
            (least-authority-wrapper
             (file-append llama-cpp "/bin/llama-server")
             #:name "llama-server"
             #:user user
             #:group group
             #:preserved-environment-variables
             (append %default-preserved-environment-variables
                     '("SSL_CERT_FILE" "VK_ICD_FILENAMES" "HOME"))
             #:mappings
             (append
              (list (file-system-mapping
                      (source "/etc/ssl/certs/ca-certificates.crt")
                      (target source)
                      (writable? #f))
                    (file-system-mapping
                      (source %llama-home)
                      (target source)
                      (writable? #t))
                    (file-system-mapping
                      (source (if with-nvidia? nvidia-driver-package mesa))
                      (target source)
                      (writable? #f))
                    (file-system-mapping
                      (source icd)
                      (target source)
                      (writable? #f)))
              device-mappings)
             #:namespaces
             (fold delq spaces '(net user)))))
      (list (shepherd-service
              (documentation "Llama-cpp server")
              (provision '(llama-cpp-server))
              (requirement (append '(networking) requirements))
              (start #~(make-forkexec-constructor
                        (append (list #$llama-server
                                      "--host" #$host
                                      "--port" #$port
                                      "-hf" #$huggingface-name
                                      #$@parameters))
                        #:environment-variables
                        (list (string-append "HOME=" #$%llama-home)
                              "SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt"
                              (string-append "VK_ICD_FILENAMES=" #$icd))))
              (stop #~(make-kill-destructor)))))))

There are basically three parts:

Determining Vulkan ICD path depending on passed with-nvidia? flag and Nvidia driver version.
```
(if with-nvidia?
    (mixed-text-file "nvidia_icd.x86_64.json" "{
    \"file_format_version\" : \"1.0.1\",
    \"ICD\": {
        \"library_path\": \"" nvidia-driver-package "/lib/libEGL_nvidia.so.0\",
        \"api_version\" : \"1.4.312\"
    }
}")
    (file-append mesa "/share/vulkan/icd.d"))
```
If we have with-nvidia? flag passed, we create an ICD in Guix store with a path to nvidia-driver-package. It is necessary, because ICD provided in driver package doesn't work for headless setup (and in my case beefy PC is headless, so it is necessary). In case we run without Nvidia proprietary driver, we fallback to Mesa ICDs.

As we are using least-authority-wrapper for an enhanced security, it is necessary to specify all the devices we will need access to:

(device-mappings
 (append
  (list (file-system-mapping
          (source "/dev/dri")
          (target source)
          (writable? #t)))
  (if with-nvidia?
      (map (lambda (dev)
             (file-system-mapping
               (source dev)
               (target source)
               (writable? #t)))
           '("/dev/nvidiactl"
             "/dev/nvidia0"
             "/dev/nvidia-modeset"
             "/dev/nvidia-uvm"
             "/dev/nvidia-uvm-tools"))
      '())))

We always mount /dev/dri and, in case of Nvidia, specify additional Nvidia devices.

Finally we create least-authority-wrapper itself. It is pretty straightforward: just getting all file mappings together and specifying environment variables we want to save for the running service:

(llama-server
 (least-authority-wrapper
  (file-append llama-cpp "/bin/llama-server")
  #:name "llama-server"
  #:user user
  #:group group
  #:preserved-environment-variables
  (append %default-preserved-environment-variables
          '("SSL_CERT_FILE" "VK_ICD_FILENAMES" "HOME"))
  #:mappings
  (append
   (list (file-system-mapping
           (source "/etc/ssl/certs/ca-certificates.crt")
           (target source)
           (writable? #f))
         (file-system-mapping
           (source %llama-home)
           (target source)
           (writable? #t))
         (file-system-mapping
           (source (if with-nvidia? nvidia-driver-package mesa))
           (target source)
           (writable? #f))
         (file-system-mapping
           (source icd)
           (target source)
           (writable? #f)))
   device-mappings)
  #:namespaces
  (fold delq spaces '(net user))))

SSL certificates are needed for llama.cpp to be able to access HuggingFace via HTTPS. Also, ICD path and the whole Mesa or Nvidia driver path in Guix store are mapped, so all the Vulkan drivers stuff could be loaded.

What to run?

As we have the service in place, we now can use it in OS configuration. Here is an example:

(use-modules ...
             (rodion services llama))

(operating-system
  (services (list ...
                  (service llama-cpp-service-type
                           (llama-cpp-configuration
                            (huggingface-name "unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL")
                            (parameters (list "-c" "64000"
                                              "--temp" "0.6"
                                              "--top-k" "20"
                                              "--top-p" "0.95"
                                              "--min-p" "0.00"))
                            (host "0.0.0.0")
                            (with-nvidia? #t)
                            (nvidia-driver-package nvidia-driver-595))))))

In my case machine cannot be accessed from the Internet (only via the VPN), so 0.0.0.0 host seems to be a rather safe choice.

When the model had been downloaded, you can access UI via the http://<your-ip>:8080/ to give your newly deployed model a try.

Of course, OpenAI-compatible API is also available on the same host and port and could be used with a wide range of tools, i.e. with Claude Code, gpt.el in Emacs or my favorite editor-agnostic ECA.

Model I am currently running is Qwen3.5-27B with 64000 context. It can be used for light coding and asking questions, as well as tool calling with speed ~32 t/s on RTX 4090.

For now I am delighted with results, stay tuned and happy hacking.

P.S. If you have any questions/patches on the little-guix-channel or this llama-cpp-service-type in particular, don't hesitate to write an email on rodion [at] goritskov.com