Devlog: Setting up Open WebUI in K8s with GPU
Setting up a new node
Currently my juciest GPU is in my gaming PC. As I do not want to take it out, I have to bring K8s to my gaming Laptop.
Setting up dual boot with Ubuntu server on a 200gb partition that serendipitously was unpartitioned on a HDD.
Settin up NFS
Since my single node cluster uses node storage i need to do sometihng to distribute it, so that do not have to be bound to a node. Looking at severalt technologies, rook, and longhorn looked promising. NFS won in the end for being the simplest to implement (probably).
First setting up a NFS server on my k8s server following ubnutus doc
sudo apt install nfs-kernel-server
sudo systemctl start nfs-kernel-server.service
sudo echo "/data 192.168.1.203(rw,no_root_squash,no_subtree_check,sync)" >> /etc/exports
sudo exportfs -a
Setting up the client on the gaming PC node
sudo apt install nfs-common
sudo mkdir /data
sudo echo "192.168.1.200:/data /data nfs rsize=8192,wsize=8192,timeo=14,intr" >> /etc/fstab
Problem.. entries in fstab is mounted before the network controller is up. Another problem /etc/if-up-d
is no longer used in Ubuntu 24.
Solution use networkd-dispatcher. Create the following script in /etc/networkd-dispatcher/routable.d/fstab
#!/bin/bash
mount -a
Make the script executable and enable networkd-dispatcher
sudo chmod +x /etc/networkd-dispatcher/routable.d/fstab
systemctl enable networkd-dispatcher.service
Installing containerd
Using dockers apt repository
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install containerd.io
Installing kubeadm
Following the official Kubernetes docs.
sudo apt-get install -y apt-transport-https ca-certificates curl gpg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.29/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
sudo systemctl enable --now kubelet
Setting up the node
On the kontrolplane node generate the join command for the worker
kubeadm token create --print-join-command --ttl 24h
Putting the output of the command above into the new node.
kubeadm join 192.168.1.200:6443 --token xxxxxx --discovery-token-ca-cert-hash sha256:xxxxxxx
Resulting in a lot of errors we have to fix first.
Turn off swap and comment out swap from fstab so it is persisted after reboot.
sudo swapoff -a
Doing network stuff
sudo modprobe br_netfilter
sudo bash -c "echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables"
sudo bash -c "echo 1 > /proc/sys/net/ipv4/ip_forward"
Fix the containerd config
sudo bash -c "containerd config default > /etc/containerd/config.toml"
sudo service containerd restart
Running the join command again and success!
NAME STATUS ROLES AGE VERSION
compute1 Ready <none> 46s v1.29.13
mainframe Ready control-plane 364d v1.29.10
Get the GPU up and running
Following the ionstructions from Nvidias gpu operator
Create the namespace and make it priviledged
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
Lets see if it works now, the first time around this worked, untill reboot, then killed the entire node.
helm install --wait gpu-operator \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
So far so good all pods up after a few minutes, with all expected kernel modules installed.
$ lsmod | grep -i nvidia
nvidia_modeset 1355776 0
nvidia_uvm 4956160 4
nvidia 54386688 12 nvidia_uvm,nvidia_modeset
video 77824 2 asus_wmi,nvidia_modeset
Alas the driver pod starts crash looping, even if it has loaded the driver and the validator reports success.
Done, now waiting for signal
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/run/containerd/containerd.sock: connect: connection refused"
Subsequent pod restarts fails to remove the already loaded nvidia modules.
Trying with manually installed drivers and tools
A good guide for setting up the gpu-operator.
Install gpu driver and nvidia container tools
sudo ubuntu-drivers install
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
Add nvidia config to containerd
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
Add operator without driver and container tools daemonsets
helm install --wait gpu-operator \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set toolkit.enabled=false \
--set driver.enabled=false
Once again, looks promising. all pods running or completed no restarts after two minutes.
…gpu worker node has scheduling disabled and no gpu resources.
Trying to uncordon node
kubectl uncordon
stil nothing… vector add test fails. Killing all pods in namespace gpu-operator.
Node now has oen gpu resource available. GPU requesting pods still won’t schedule due to “insufficient nvidia.com/gpus”. Strange according to nvidia docs driver veriosn 530 and up should support newest version of cuda.
Got the nvidia-smi container to work. Ollama still can’t find a compatible GPU.
Trying to use self managed driver with managed container tools.
helm upgrade -i --wait gpu-operator \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false
Success!
Further problems after reboot, fixed my persisting the “network stuff”, IP forward and bridging.