Posted on Jan 15 • Originally published at deep75.Medium on Jan 15

AIOps : Investigation par l’IA dans Kubernetes avec HolmesGPT, Ollama et RunPod …

Dans le monde de l’orchestration de conteneurs, Kubernetes est devenu une norme pour gérer les workloads conteneurisés. Cependant, la gestion et le dépannage des clusters Kubernetes peuvent être complexes et chronophages. Cet article explore comment l’intelligence artificielle (IA) peut être intégrée dans Kubernetes pour améliorer l’investigation et la gestion des incidents. J’avais d’ailleurs évoqué le sujet dans un article précédent :

AIOps : Déboguer son cluster Kubernetes en utilisant l’intelligence artificielle générative via…

Ici je vais m’intéresser à HolmesGPT. HolmesGPT, développé par Robusta, est un agent de dépannage open source qui utilise l’IA pour investiguer les incidents dans les clusters Kubernetes avec ces caractéristiques :

Intégration avec les outils de gestion d’incidents : HolmesGPT se connecte à des outils comme PagerDuty, OpsGenie et Prometheus pour collecter des données et analyser les alertes.
Investigation automatisée : Grâce à l’IA, HolmesGPT peut identifier et résoudre des problèmes tels que l’expiration des certificats SSL, les problèmes de ressources insuffisantes et les problèmes d’affinité des nœuds. Cela réduit significativement le temps et l’effort nécessaires pour le dépannage.
Personnalisation : HolmesGPT permet de créer des livres de recettes (runbooks) personnalisés pour gérer des problèmes spécifiques, en utilisant des API et des outils personnalisés si nécessaire.

GitHub - robusta-dev/holmesgpt: On-Call Assistant for Prometheus Alerts - Get a head start on fixing alerts with AI investigation

Pour cet exercice, je vais d’abord lancer une instance Ubuntu 24.04 LTS de nouveau chez le fournisseur Cloud DigitalOcean :

Je vais y installer Incus, un fork de LXD qui va me servir de base pour la formation d’un cluster Kubernetes avec plusieurs containers :

Linux Containers - Incus - Introduction

Comme pour LXD, je vais procéder à la création de plusieurs profiles. Mais dans un premier temps, installation d’Incus sur l’instance :

root@k0s-incus:~# curl -fsSL https://pkgs.zabbly.com/key.asc | gpg --show-keys --fingerprint gpg: directory '/root/.gnupg' created gpg: keybox '/root/.gnupg/pubring.kbx' created pub rsa3072 2023-08-23 [SC] [expires: 2025-08-22] 4EFC 5906 96CB 15B8 7C73 A3AD 82CC 8797 C838 DCFD uid Zabbly Kernel Builds <info@zabbly.com> sub rsa3072 2023-08-23 [E] [expires: 2025-08-22] root@k0s-incus:~# mkdir -p /etc/apt/keyrings/ root@k0s-incus:~# curl -fsSL https://pkgs.zabbly.com/key.asc -o /etc/apt/keyrings/zabbly.asc root@k0s-incus:~# sh -c 'cat <<EOF > /etc/apt/sources.list.d/zabbly-incus-stable.sources Enabled: yes Types: deb URIs: https://pkgs.zabbly.com/incus/stable Suites: $(. /etc/os-release && echo ${VERSION_CODENAME}) Components: main Architectures: $(dpkg --print-architecture) Signed-By: /etc/apt/keyrings/zabbly.asc EOF' root@k0s-incus:~# apt-get update Hit:1 http://security.ubuntu.com/ubuntu noble-security InRelease Hit:2 http://mirrors.digitalocean.com/ubuntu noble InRelease Hit:3 https://repos-droplet.digitalocean.com/apt/droplet-agent main InRelease Hit:4 http://mirrors.digitalocean.com/ubuntu noble-updates InRelease Hit:5 http://mirrors.digitalocean.com/ubuntu noble-backports InRelease Get:6 https://pkgs.zabbly.com/incus/stable noble InRelease [7358 B] Get:7 https://pkgs.zabbly.com/incus/stable noble/main amd64 Packages [3542 B] Fetched 10.9 kB in 1s (13.3 kB/s) Reading package lists... Done root@k0s-incus:~# apt-get install incus incus-client incus-ui-canonical -y Reading package lists... Done Building dependency tree... Done Reading state information... Done The following additional packages will be installed: attr dconf-gsettings-backend dconf-service dns-root-data dnsmasq-base fontconfig genisoimage glib-networking glib-networking-common glib-networking-services gsettings-desktop-schemas gstreamer1.0-plugins-base gstreamer1.0-plugins-good gstreamer1.0-x incus-base iw libaa1 libasyncns0 libavc1394-0 libboost-iostreams1.83.0 libboost-thread1.83.0 libbtrfs0t64 libcaca0 libcairo-gobject2 libcairo2 libcdparanoia0 libdatrie1 libdaxctl1 libdconf1 libdv4t64 libflac12t64 libgdk-pixbuf-2.0-0 libgdk-pixbuf2.0-bin libgdk-pixbuf2.0-common libgraphite2-3 libgstreamer-plugins-base1.0-0 libgstreamer-plugins-good1.0-0 libharfbuzz0b libiec61883-0 libmp3lame0 libmpg123-0t64 libndctl6 libnet1 libogg0 libopus0 liborc-0.4-0t64 libpango-1.0-0 libpangocairo-1.0-0 libpangoft2-1.0-0 libpixman-1-0 libpmem1 libpmemobj1 libproxy1v5 libpulse0 librados2 libraw1394-11 librbd1 librdmacm1t64 libshout3 libsndfile1 libsoup-3.0-0 libsoup-3.0-common libspeex1 libspice-server1 libtag1v5 libtag1v5-vanilla libthai-data libthai0 libtheora0 libtwolame0 libusbredirparser1t64 libv4l-0t64 libv4lconvert0t64 libvisual-0.4-0 libvorbis0a libvorbisenc2 libvpx9 libwavpack1 libx11-xcb1 libxcb-render0 libxcb-shm0 libxdamage1 libxfixes3 libxi6 libxrender1 libxtst6 libxv1 session-migration sshfs wireless-regdb x11-common xdelta3 root@k0s-incus:~# incus Description: Command line client for Incus All of Incus's features can be driven through the various commands below. For help with any of those, simply call them with --help. Custom commands can be defined through aliases, use "incus alias" to control those. Usage: incus [command] Available Commands: admin Manage incus daemon cluster Manage cluster members config Manage instance and server configuration options console Attach to instance consoles copy Copy instances within or in between servers create Create instances from images delete Delete instances exec Execute commands in instances export Export instance backups file Manage files in instances help Help about any command image Manage images import Import instance backups info Show instance or server information launch Create and start instances from images list List instances move Move instances within or in between servers network Manage and attach instances to networks pause Pause instances profile Manage profiles project Manage projects publish Publish instances as images rebuild Rebuild instances remote Manage the list of remote servers rename Rename instances restart Restart instances resume Resume instances snapshot Manage instance snapshots start Start instances stop Stop instances storage Manage storage pools and volumes top Display resource usage info per instance version Show local and remote versions webui Open the web interface Flags: --all Show less common commands --debug Show all debug messages --force-local Force using the local unix socket -h, --help Print help --project Override the source project -q, --quiet Don't show progress information --sub-commands Use with help or --help to view sub-commands -v, --verbose Show all information messages --version Print version number Use "incus [command] --help" for more information about a command.

Initialisation d’Incus en version minimaliste :

root@k0s-incus:~# incus admin init Would you like to use clustering? (yes/no) [default=no]: Do you want to configure a new storage pool? (yes/no) [default=yes]: Name of the new storage pool [default=default]: Name of the storage backend to use (btrfs, dir, lvm) [default=btrfs]: dir Where should this storage pool store its data? [default=/var/lib/incus/storage-pools/default]: Would you like to create a new local network bridge? (yes/no) [default=yes]: What should the new bridge be called? [default=incusbr0]: What IPv4 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: What IPv6 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: Would you like the server to be available over the network? (yes/no) [default=no]: Would you like stale cached images to be updated automatically? (yes/no) [default=yes]: Would you like a YAML "init" preseed to be printed? (yes/no) [default=no]: root@k0s-incus:~# incus list +------+-------+------+------+------+-----------+ | NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | +------+-------+------+------+------+-----------+ root@k0s-incus:~# incus profile list +---------+-----------------------+---------+ | NAME | DESCRIPTION | USED BY | +---------+-----------------------+---------+ | default | Default Incus profile | 0 | +---------+-----------------------+---------+ root@k0s-incus:~# incus profile show default config: {} description: Default Incus profile devices: eth0: name: eth0 network: incusbr0 type: nic root: path: / pool: default type: disk name: default used_by: [] project: default root@k0s-incus:~# incus profile create k8s

Incus dispose d’un tableau de bord de contrôle qui peut être actionné temporairement par incus webui.

incus webui

Activation de ce dernier :

root@k0s-incus:~# nohup incus webui & [1] 4104 root@k0s-incus:~# nohup: ignoring input and appending output to 'nohup.out' root@k0s-incus:~# cat nohup.out Web server running at: http://127.0.0.1:34363/ui?auth_token=3c5f5d4b-f9ed-4bf9-a174-d5ea2366cfbf

Utilisation de pinggy.io pour y accéder :

Pinggy - Simple Localhost Tunnels

root@k0s-incus:~# ssh -p 443 -R0:127.0.0.1:34363 a.pinggy.io

Je récupère le même profile qu’utilise LXD pour MicroK8s :

MicroK8s - MicroK8s in LXD | MicroK8s

roothttps://microk8s.io/docs/install-lxd@k0s-incus:~# wget https://raw.githubusercontent.com/ubuntu/microk8s/master/tests/lxc/microk8s.profile -O k8s.profile --2025-01-14 20:58:42-- https://raw.githubusercontent.com/ubuntu/microk8s/master/tests/lxc/microk8s.profile Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 816 [text/plain] Saving to: ‘k8s.profile’ k8s.profile 100%[=====================================================================================================>] 816 --.-KB/s in 0s 2025-01-14 20:58:42 (33.4 MB/s) - ‘k8s.profile’ saved [816/816] root@k0s-incus:~# cat k8s.profile | incus profile edit k8s root@k0s-incus:~# rm k8s.profile root@k0s-incus:~# incus profile show k8s config: boot.autostart: "true" linux.kernel_modules: ip_vs,ip_vs_rr,ip_vs_wrr,ip_vs_sh,ip_tables,ip6_tables,netlink_diag,nf_nat,overlay,br_netfilter raw.lxc: | lxc.apparmor.profile=unconfined lxc.mount.auto=proc:rw sys:rw cgroup:rw lxc.cgroup.devices.allow=a lxc.cap.drop= security.nesting: "true" security.privileged: "true" description: "" devices: aadisable: path: /sys/module/nf_conntrack/parameters/hashsize source: /sys/module/nf_conntrack/parameters/hashsize type: disk aadisable2: path: /dev/kmsg source: /dev/kmsg type: unix-char aadisable3: path: /sys/fs/bpf source: /sys/fs/bpf type: disk aadisable4: path: /proc/sys/net/netfilter/nf_conntrack_max source: /proc/sys/net/netfilter/nf_conntrack_max type: disk name: k8s used_by: [] project: default

Comme Incus a la faculté d’utiliser cloud-init, je crée un nouveau profile destiné à cet usage :

root@k0s-incus:~# incus profile show cloud config: cloud-init.user-data: | #cloud-config package_update: true package_upgrade: true package_reboot_if_required: true packages: - vim - wget - git - curl - htop - openssh-server bootcmd: - systemctl enable ssh - systemctl start ssh ssh_authorized_keys: - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCpbsaaVUMa2TM9q8VkeBmbKvJpbreXTcqI5F5N3riGsoZ7Z/IIN7eR6J47UP2bj3IBTdgHmij1uOexm60QBO2PY4abIhsN+xnVS4a0LSyI8v6nYECWbEehL/gFn6uDmSLA4m0hZCF5BSpLxQYzKS28dHIdXsLC4CDd67nAXIhOiVpM0q/AUCuSy+mA0VwFa/JAkFCk8TpQBorgwJIq635imrgxYIpEUA2wHXOhw23mO3zTUlay13LSlA2a1xyTkP8hSDWdRYVxr2DEB/MtmTX2BdWlA5rDRmzXE7R2/csE245WAxG+XfSu4zNqhHzm8Df3zmZn3/UyKLcx4eJF//mVZyrM7RQHRteA/im8I4IavrReGyCUKY+OsSfygYVFyO87rYQ+IOauOnB4LxBohBjSBN3Skk4X7krYFIi8D9R1lmL+VvBfpvy0YMurOahY1VJFzD0dUeK2bDUdeWzfFkcX039d9/RRXRxieNpxwp1BLPi5/DXG8FihzgwVTf6h60J9/fkYzY+BO8CKG2kYTUsy1ykuXLzLY5sTCREiEoEKcJ9IGz8OimZ1AmkgJJCrQnI6mT/KiNDU6YCc75ONKTKX5HKVPhZWT255Aw4f5LBbBrj06cJX3GuunV0I30+BYyHwLbPBoqgd4GUk3YJlr8wS3qre/YUSc2iKNDTOzFCC8Q== root@k0s-incus description: incus with cloud-init devices: {} name: cloud used_by: [] project: default

Je suis prêt pour la création de trois containers qui me serviront de pivot à la création d’un cluster Kubernetes :

root@k0s-incus:~# for i in {1..3}; do incus launch -p default -p k8s -p cloud images:ubuntu/24.04/cloud k0s-$i; done Launching k0s-1 Launching k0s-2 Launching k0s-3 root@k0s-incus:~# incus list +-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+ | NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | +-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+ | k0s-1 | RUNNING | 10.224.160.99 (eth0) | fd42:4641:b619:c782:216:3eff:fea4:53d3 (eth0) | CONTAINER | 0 | +-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+ | k0s-2 | RUNNING | 10.224.160.54 (eth0) | fd42:4641:b619:c782:216:3eff:feee:7af8 (eth0) | CONTAINER | 0 | +-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+ | k0s-3 | RUNNING | 10.224.160.215 (eth0) | fd42:4641:b619:c782:216:3eff:fef3:709b (eth0) | CONTAINER | 0 | +-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+ root@k0s-incus:~# cat .ssh/config Host * StrictHostKeyChecking no UserKnownHostsFile=/dev/null root@k0s-incus:~# ssh ubuntu@10.224.160.99 Welcome to Ubuntu 24.04.1 LTS (GNU/Linux 6.8.0-51-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/pro The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. To run a command as administrator (user "root"), use "sudo <command>". See "man sudo_root" for details. ubuntu@k0s-1:~$

Récupération de k0sctl pour la création d’un cluster Kubernetes avec k0s :

Using k0sctl - Documentation

root@k0s-incus:~# wget -c https://github.com/k0sproject/k0sctl/releases/download/v0.21.0/k0sctl-linux-amd64 && chmod +x k0sctl-linux-amd64 && mv k0sctl-linux-amd64 /usr/local/bin/k0sctl Saving to: ‘k0sctl-linux-amd64’ k0sctl-linux-amd64 100%[=====================================================================================================>] 18.21M --.-KB/s in 0.1s 2025-01-14 21:22:23 (122 MB/s) - ‘k0sctl-linux-amd64’ saved [19091608/19091608] root@k0s-incus:~# k0sctl NAME: k0sctl - k0s cluster management tool USAGE: k0sctl [global options] command [command options] COMMANDS: version Output k0sctl version apply Apply a k0sctl configuration kubeconfig Output the admin kubeconfig of the cluster init Create a configuration template reset Remove traces of k0s from all of the hosts backup Take backup of existing clusters state config Configuration related sub-commands completion help, h Shows a list of commands or help for one command GLOBAL OPTIONS: --debug, -d Enable debug logging (default: false) [$DEBUG] --trace Enable trace logging (default: false) [$TRACE] --no-redact Do not hide sensitive information in the output (default: false) --help, -h show help root@k0s-incus:~# k0sctl init --k0s > k0sctl.yaml root@k0s-incus:~# cat k0sctl.yaml apiVersion: k0sctl.k0sproject.io/v1beta1 kind: Cluster metadata: name: k0s-cluster user: admin spec: hosts: - ssh: address: 10.224.160.99 user: ubuntu port: 22 keyPath: /root/.ssh/id_rsa role: controller - ssh: address: 10.224.160.54 user: ubuntu port: 22 keyPath: /root/.ssh/id_rsa role: worker - ssh: address: 10.224.160.215 user: ubuntu port: 22 keyPath: /root/.ssh/id_rsa role: worker k0s: config: apiVersion: k0s.k0sproject.io/v1beta1 kind: Cluster metadata: name: k0s spec: api: k0sApiPort: 9443 port: 6443 installConfig: users: etcdUser: etcd kineUser: kube-apiserver konnectivityUser: konnectivity-server kubeAPIserverUser: kube-apiserver kubeSchedulerUser: kube-scheduler konnectivity: adminPort: 8133 agentPort: 8132 network: kubeProxy: disabled: false mode: iptables kuberouter: autoMTU: true mtu: 0 peerRouterASNs: "" peerRouterIPs: "" podCIDR: 10.244.0.0/16 provider: kuberouter serviceCIDR: 10.96.0.0/12 podSecurityPolicy: defaultPolicy: 00-k0s-privileged storage: type: etcd telemetry: enabled: true

Lancement de la création :

 root@k0s-incus:~# k0sctl apply --config k0sctl.yaml ⠀⣿⣿⡇⠀⠀⢀⣴⣾⣿⠟⠁⢸⣿⣿⣿⣿⣿⣿⣿⡿⠛⠁⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀█████████ █████████ ███ ⠀⣿⣿⡇⣠⣶⣿⡿⠋⠀⠀⠀⢸⣿⡇⠀⠀⠀⣠⠀⠀⢀⣠⡆⢸⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀███ ███ ███ ⠀⣿⣿⣿⣿⣟⠋⠀⠀⠀⠀⠀⢸⣿⡇⠀⢰⣾⣿⠀⠀⣿⣿⡇⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀███ ███ ███ ⠀⣿⣿⡏⠻⣿⣷⣤⡀⠀⠀⠀⠸⠛⠁⠀⠸⠋⠁⠀⠀⣿⣿⡇⠈⠉⠉⠉⠉⠉⠉⠉⠉⢹⣿⣿⠀███ ███ ███ ⠀⣿⣿⡇⠀⠀⠙⢿⣿⣦⣀⠀⠀⠀⣠⣶⣶⣶⣶⣶⣶⣿⣿⡇⢰⣶⣶⣶⣶⣶⣶⣶⣶⣾⣿⣿⠀█████████ ███ ██████████ k0sctl v0.21.0 Copyright 2023, k0sctl authors. By continuing to use k0sctl you agree to these terms: https://k0sproject.io/licenses/eula INFO ==> Running phase: Set k0s version INFO Looking up latest stable k0s version INFO Using k0s version v1.31.3+k0s.0 INFO ==> Running phase: Connect to hosts INFO [ssh] 10.224.160.215:22: connected INFO [ssh] 10.224.160.99:22: connected INFO [ssh] 10.224.160.54:22: connected INFO ==> Running phase: Detect host operating systems INFO [ssh] 10.224.160.215:22: is running Ubuntu 24.04.1 LTS INFO [ssh] 10.224.160.99:22: is running Ubuntu 24.04.1 LTS INFO [ssh] 10.224.160.54:22: is running Ubuntu 24.04.1 LTS INFO ==> Running phase: Acquire exclusive host lock INFO ==> Running phase: Prepare hosts INFO ==> Running phase: Gather host facts INFO [ssh] 10.224.160.215:22: using k0s-3 as hostname INFO [ssh] 10.224.160.54:22: using k0s-2 as hostname INFO [ssh] 10.224.160.99:22: using k0s-1 as hostname INFO [ssh] 10.224.160.215:22: discovered eth0 as private interface INFO [ssh] 10.224.160.54:22: discovered eth0 as private interface INFO [ssh] 10.224.160.99:22: discovered eth0 as private interface INFO ==> Running phase: Validate hosts INFO ==> Running phase: Validate facts INFO ==> Running phase: Download k0s on hosts INFO [ssh] 10.224.160.215:22: downloading k0s v1.31.3+k0s.0 INFO [ssh] 10.224.160.54:22: downloading k0s v1.31.3+k0s.0 INFO [ssh] 10.224.160.99:22: downloading k0s v1.31.3+k0s.0 INFO ==> Running phase: Install k0s binaries on hosts INFO [ssh] 10.224.160.99:22: validating configuration INFO ==> Running phase: Configure k0s INFO [ssh] 10.224.160.99:22: installing new configuration INFO ==> Running phase: Initialize the k0s cluster INFO [ssh] 10.224.160.99:22: installing k0s controller INFO [ssh] 10.224.160.99:22: waiting for the k0s service to start INFO [ssh] 10.224.160.99:22: wait for kubernetes to reach ready state INFO ==> Running phase: Install workers INFO [ssh] 10.224.160.99:22: generating a join token for worker 1 INFO [ssh] 10.224.160.99:22: generating a join token for worker 2 INFO [ssh] 10.224.160.215:22: validating api connection to https://10.224.160.99:6443 using join token INFO [ssh] 10.224.160.54:22: validating api connection to https://10.224.160.99:6443 using join token INFO [ssh] 10.224.160.215:22: writing join token to /etc/k0s/k0stoken INFO [ssh] 10.224.160.54:22: writing join token to /etc/k0s/k0stoken INFO [ssh] 10.224.160.54:22: installing k0s worker INFO [ssh] 10.224.160.215:22: installing k0s worker INFO [ssh] 10.224.160.215:22: starting service INFO [ssh] 10.224.160.215:22: waiting for node to become ready INFO [ssh] 10.224.160.54:22: starting service INFO [ssh] 10.224.160.54:22: waiting for node to become ready INFO ==> Running phase: Release exclusive host lock INFO ==> Running phase: Disconnect from hosts INFO ==> Finished in 42s INFO k0s cluster version v1.31.3+k0s.0 is now installed INFO Tip: To access the cluster you can now fetch the admin kubeconfig using: INFO k0sctl kubeconfig

Le cluster est actif :

root@k0s-incus:~# curl -LO https://dl.k8s.io/release/v1.31.3/bin/linux/amd64/kubectl && chmod +x kubectl && mv kubectl /usr/local/bin/ % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 138 100 138 0 0 923 0 --:--:-- --:--:-- --:--:-- 926 100 53.7M 100 53.7M 0 0 476k 0 0:01:55 0:01:55 --:--:-- 1023k root@k0s-incus:~# mkdir .kube root@k0s-incus:~# k0sctl kubeconfig --config k0sctl.yaml > .kube/config root@k0s-incus:~# kubectl cluster-info Kubernetes control plane is running at https://10.224.160.99:6443 CoreDNS is running at https://10.224.160.99:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'. root@k0s-incus:~# kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k0s-2 Ready <none> 5m1s v1.31.3+k0s 10.224.160.54 <none> Ubuntu 24.04.1 LTS 6.8.0-51-generic containerd://1.7.24 k0s-3 Ready <none> 5m1s v1.31.3+k0s 10.224.160.215 <none> Ubuntu 24.04.1 LTS 6.8.0-51-generic containerd://1.7.24 root@k0s-incus:~# kubectl get po,svc -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system pod/coredns-645c5d6f5b-kgnsf 1/1 Running 0 5m2s kube-system pod/coredns-645c5d6f5b-n2rbk 1/1 Running 0 5m2s kube-system pod/konnectivity-agent-2dg8l 1/1 Running 0 5m4s kube-system pod/konnectivity-agent-5l5dl 1/1 Running 0 5m4s kube-system pod/kube-proxy-cx47n 1/1 Running 0 5m7s kube-system pod/kube-proxy-sp5fd 1/1 Running 0 5m7s kube-system pod/kube-router-6l4qv 1/1 Running 0 5m7s kube-system pod/kube-router-b9t89 1/1 Running 0 5m7s kube-system pod/metrics-server-78c4ccbc7f-jxpzz 1/1 Running 0 5m1s NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 5m17s kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 5m7s kube-system service/metrics-server ClusterIP 10.109.44.51 <none> 443/TCP 5m1s

Il est alors possible de procéder à l’installation d’HolmesGPT via Pipx :

GitHub - robusta-dev/holmesgpt: On-Call Assistant for Prometheus Alerts - Get a head start on fixing alerts with AI investigation

root@k0s-incus:~# apt install pipx -y root@k0s-incus:~# pipx ensurepath Success! Added /root/.local/bin to the PATH environment variable. Consider adding shell completions for pipx. Run 'pipx completions' for instructions. You will need to open a new terminal or re-login for the PATH changes to take effect. Otherwise pipx is ready to go! ✨ 🌟 ✨ root@k0s-incus:~# pipx install "https://github.com/robusta-dev/holmesgpt/archive/refs/heads/master.zip" installed package holmesgpt 0.1.0, installed using Python 3.12.3 These apps are now globally available - holmes done! ✨ 🌟 ✨ root@k0s-incus:~# holmes version /root/.local/share/pipx/venvs/holmesgpt/lib/python3.12/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2: * 'fields' has been removed warnings.warn(message, UserWarning) HEAD -> master-bfafbde3

Pour l’accompagner, je récupère K9s qui fournit une interface utilisateur de terminal pour interagir avec vos clusters Kubernetes. L’objectif de ce projet est de faciliter la navigation, l’observation et la gestion de vos applications dans la nature. K9s surveille continuellement Kubernetes pour les changements et offre des commandes ultérieures pour interagir avec vos ressources observées.

GitHub - derailed/k9s: 🐶 Kubernetes CLI To Manage Your Clusters In Style!

root@k0s-incus:~# wget -c https://github.com/derailed/k9s/releases/download/v0.32.7/k9s_linux_amd64.deb HTTP request sent, awaiting response... 200 OK Length: 31832132 (30M) [application/octet-stream] Saving to: ‘k9s_linux_amd64.deb’ k9s_linux_amd64.deb 100%[=====================================================================================================>] 30.36M --.-KB/s in 0.1s 2025-01-14 21:40:07 (291 MB/s) - ‘k9s_linux_amd64.deb’ saved [31832132/31832132] root@k0s-incus:~# apt install -f ./k9s_linux_amd64.deb root@k0s-incus:~# k9s --help K9s is a CLI to view and manage your Kubernetes clusters. Usage: k9s [flags] k9s [command] Available Commands: completion Generate the autocompletion script for the specified shell help Help about any command info List K9s configurations info version Print version/build info Flags: -A, --all-namespaces Launch K9s in all namespaces --as string Username to impersonate for the operation --as-group stringArray Group to impersonate for the operation --certificate-authority string Path to a cert file for the certificate authority --client-certificate string Path to a client certificate file for TLS --client-key string Path to a client key file for TLS --cluster string The name of the kubeconfig cluster to use -c, --command string Overrides the default resource to load when the application launches --context string The name of the kubeconfig context to use --crumbsless Turn K9s crumbs off --headless Turn K9s header off -h, --help help for k9s --insecure-skip-tls-verify If true, the server's caCertFile will not be checked for validity --kubeconfig string Path to the kubeconfig file to use for CLI requests --logFile string Specify the log file (default "/root/.local/state/k9s/k9s.log") -l, --logLevel string Specify a log level (error, warn, info, debug, trace) (default "info") --logoless Turn K9s logo off -n, --namespace string If present, the namespace scope for this CLI request --readonly Sets readOnly mode by overriding readOnly configuration setting -r, --refresh int Specify the default refresh rate as an integer (sec) (default 2) --request-timeout string The length of time to wait before giving up on a single server request --screen-dump-dir string Sets a path to a dir for a screen dumps --token string Bearer token for authentication to the API server --user string The name of the kubeconfig user to use --write Sets write mode by overriding the readOnly configuration setting Use "k9s [command] --help" for more information about a command.

Ollama, une alternative à ChatGPT, peut être déployée pour fournir des capacités de traitement du langage naturel directement dans votre environnement. Cela permet de bénéficier des capacités de traitement du langage naturel de Ollama sans dépendre de services cloud externes.

Ollama

En intégrant Ollama à vos outils de dépannage, vous pouvez générer des réponses et des solutions basées sur l’analyse des logs et des données de votre cluster Kubernetes.

Pour son exécution, je suis amené à utiliser RunPod, une plateforme qui permet d’exécuter des tâches de traitement du langage naturel et d’autres tâches IA. RunPod vous permet en effet de créer des environnements de pod personnalisés pour exécuter des modèles de langage comme Ollama ou d’autres applications IA :

RunPod - The Cloud Built for AI

Création d’un Pod GPU qui me permet donc d’utiliser Ollama …

Set up Ollama on your GPU Pod | RunPod Documentation

Je peux m’y connecter via SSH :

Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 6.5.0-44-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage This system has been minimized by removing packages and content that are not required on a system that users do not log into. To restore this content, you can run the 'unminimize' command. The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. ____________ _ (_____\ (_____ \ | | _____) ) _ _____ _____) )___ __| | | __/ | | | || _ \ |____ // _ \ / _ | | | \ \ | |_| || | | || | | |_| |( (_| | |_| |_|| ____/ |_| |_||_| \___ / \ ____ | For detailed documentation and guides, please visit: https://docs.runpod.io/ and https://blog.runpod.io/ root@5ed8df208cf4:~# nvidia-smi Tue Jan 14 22:03:15 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4070 Ti On | 00000000:81:00.0 Off | N/A | | 0% 28C P8 11W / 285W | 2MiB / 12282MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

Exécution d’Ollama :

root@5ed8df208cf4:~# apt update 2> /dev/null && apt install -qq lshw -y 2> /dev/null root@5ed8df208cf4:~# export OLLAMA_HOST=0.0.0.0:11434 root@5ed8df208cf4:~# (curl -fsSL https://ollama.com/install.sh | sh && ollama serve > ollama.log 2>&1) & [1] 950 root@5ed8df208cf4:~# >>> Installing ollama to /usr/local >>> Downloading Linux amd64 bundle ######################################################################## 100.0% >>> Creating ollama user... >>> Adding ollama user to video group... >>> Adding current user to ollama group... >>> Creating ollama systemd service... WARNING: systemd is not running >>> NVIDIA GPU installed. >>> The Ollama API is now available at 127.0.0.1:11434. >>> Install complete. Run "ollama" from the command line. root@5ed8df208cf4:~# netstat -tunlp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:7861 0.0.0.0:* LISTEN 52/nginx: master pr tcp 0 0 0.0.0.0:8081 0.0.0.0:* LISTEN 52/nginx: master pr tcp 0 0 0.0.0.0:8001 0.0.0.0:* LISTEN 52/nginx: master pr tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 70/sshd: /usr/sbin/ tcp 0 0 0.0.0.0:3001 0.0.0.0:* LISTEN 52/nginx: master pr tcp 0 0 0.0.0.0:9091 0.0.0.0:* LISTEN 52/nginx: master pr tcp 0 0 127.0.0.11:39145 0.0.0.0:* LISTEN - tcp6 0 0 :::22 :::* LISTEN 70/sshd: /usr/sbin/ tcp6 0 0 :::11434 :::* LISTEN 1006/ollama udp 0 0 127.0.0.11:33663 0.0.0.0:* -

Récupération d’un LLM avec Llama3.2 :

root@5ed8df208cf4:~# ollama pull llama3.2:3b-instruct-q4_K_S pulling manifest pulling d5e517daeee4... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.9 GB pulling 966de95ca8a6... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.4 KB pulling fcc5a6bec9da... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.7 KB pulling a70ff7e570d9... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 6.0 KB pulling 56bb8bd477a5... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 96 B pulling 9c65e8607c0c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 561 B verifying sha256 digest writing manifest success root@5ed8df208cf4:~# ollama list NAME ID SIZE MODIFIED llama3.2:3b-instruct-q4_K_S 80f2089878c9 1.9 GB 31 seconds ago

L’endpoint d’Ollama est disponible publiquement via le proxy offert par RunPod :

Modification de K9s pour y intégrer cet endpoint et HolmesGPT sous forme d’un plug-in :

root@k0s-incus:~# cat ~/.config/k9s/plugins.yaml plugins: holmesgpt: shortCut: Shift-H description: Ask HolmesGPT scopes: - all command: bash background: false confirm: false args: - -c - | holmes ask "why is $NAME of $RESOURCE_NAME in -n $NAMESPACE not working as expected" --model="openai/llama3.2:3b-instruct-q4_K_S" echo "Press 'q' to exit" while : ; do read -n 1 k <&1 if [[$k = q]] ; then break fi done root@k0s-incus:~# export OPENAI_API_BASE="https://vsr6spvysc6jly-11434.proxy.runpod.net/v1" root@k0s-incus:~# export OPENAI_API_KEY=123

Déploiement d’un exemple de Pod problématique dans le cluster Kubernetes via les exemples fournis par Robusta :

root@k0s-incus:~# kubectl apply -f https://raw.githubusercontent.com/robusta-dev/kubernetes-demos/main/crashpod/broken.yaml deployment.apps/payment-processing-worker created root@k0s-incus:~# kubectl get po NAME READY STATUS RESTARTS AGE payment-processing-worker-747ccfb9db-njgmw 0/1 CrashLoopBackOff 1 (4s ago) 9s

Je peux lancer la requête relative au plug-in avec HolmesGPT et la combinaison CTRL+H pour obtenir cette première réponse :

The payment-processing-container container has crashed and is being restarted for the 6th time due to a CrashLoopBackOff. The last state indicates that the container was terminated with an exit code of 0, which suggests that the command executed successfully but did not complete as expected. To investigate further, you can check the logs of the payment-processing-container container to see if there are any error messages or clues about what is causing the issue. You can also check the Kubernetes events for any other errors or warnings that may be related to this issue. Additionally, you can try to debug the command executed by the payment-processing-container container to see if it's correct and working as expected. The command is: if [[-z "${DEPLOY_ENV}"]]; then echo Environment variable DEPLOY_ENV is undefined ; else while true; do echo hello; sleep 10;done; fi This command checks if the DEPLOY_ENV environment variable is set, and if it's not, it prints a message. If it is set, it enters an infinite loop that prints "hello" every 10 seconds. If you're running this container in a Kubernetes pod, you can try to debug the issue by checking the pod's logs or using a tool like kubectl to inspect the container's state and logs. Press 'q' to exit

Modification de la requête et autre réponse :

root@k0s-incus:~# cat ~/.config/k9s/plugins.yaml plugins: holmesgpt: shortCut: Shift-H description: Ask HolmesGPT scopes: - all command: bash background: false confirm: false args: - -c - | holmes ask "why is $NAME of $RESOURCE_NAME in -n $NAMESPACE not working and why $NAME is crashed?" --model="openai/llama3.2:3b-instruct-q4_K_S" echo "Press 'q' to exit" while : ; do read -n 1 k <&1 if [[$k = q]] ; then break fi done The payment-processing-container container has crashed and is being restarted for the 6th time due to a CrashLoopBackOff. The last state indicates that the container was terminated with an exit code of 0, which suggests that the command executed successfully but did not produce any output. To investigate further, you can check the logs of the payment-processing-container container to see if there are any error messages or clues about what is causing the crash: kubectl logs payment-processing-worker-747ccfb9db-njgmw -c payment-processing-container Additionally, you can check the configuration of the payment-processing-container container to ensure that it is running with the correct environment variables and settings. kubectl describe pod payment-processing-worker-747ccfb9db-njgmw -c payment-processing-container This will provide more detailed information about the container's configuration and any errors that may be occurring.

HolmesGPT peut s’intégrer plus globalement à la plateforme Robusta via une installation dans le cluster Kubernetes et Helm …

AI Analysis - Robusta documentation

Pour cela génération du fichier YAML de ce type en configuration :

root@k0s-incus:~# cat generated_values.yaml globalConfig: signing_key: 568927d5-6e65-4c13-b3fe-fdc50e616fde account_id: a4d7cea6-fba3-4ce6-ba3d-941b55ec83db sinksConfig: - robusta_sink: name: robusta_ui_sink token: <TOKEN> enablePrometheusStack: true kube-prometheus-stack: grafana: persistence: enabled: true enablePlatformPlaybooks: true runner: sendAdditionalTelemetry: true enableHolmesGPT: true holmes: additionalEnvVars: - name: ROBUSTA_AI value: "true"

Utilisation des commandes et du fichier de configuration YAML fournis par la plateforme Robusta :

root@k0s-incus:~# helm repo add robusta https://robusta-charts.storage.googleapis.com && helm repo update "robusta" has been added to your repositories Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "robusta" chart repository Update Complete. ⎈Happy Helming!⎈ root@k0s-incus:~# helm install robusta robusta/robusta -f ./generated_values.yaml --set clusterName="k0s-cluster" \ --set isSmallCluster=true \ --set holmes.resources.requests.memory=512Mi \ --set kube-prometheus-stack.prometheus.prometheusSpec.retentionSize=9GB \ --set kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=10Gi \ --set kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.memory=512Mi NAME: robusta LAST DEPLOYED: Tue Jan 14 22:59:09 2025 NAMESPACE: default STATUS: deployed REVISION: 1 NOTES: Thank you for installing Robusta 0.20.0 As an open source project, we collect general usage statistics. This data is extremely limited and contains only general metadata to help us understand usage patterns. If you are willing to share additional data, please do so! It really help us improve Robusta. You can set sendAdditionalTelemetry: true as a Helm value to send exception reports and additional data. This is disabled by default. To opt-out of telemetry entirely, set a ENABLE_TELEMETRY=false environment variable on the robusta-runner deployment. Note that if the Robusta UI is enabled, telemetry cannot be disabled even if ENABLE_TELEMETRY=false is set. Visit the web UI at: https://platform.robusta.dev/ root@k0s-incus:~# helm ls -A NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION robusta default 2 2025-01-14 23:10:13.935906491 +0000 UTC deployed robusta-0.20.0 0.20.0 root@k0s-incus:~# kubectl get po,svc NAME READY STATUS RESTARTS AGE pod/alertmanager-robusta-kube-prometheus-st-alertmanager-0 0/2 Pending 0 2m pod/payment-processing-worker-747ccfb9db-njgmw 0/1 CrashLoopBackOff 10 (2m33s ago) 28m pod/prometheus-robusta-kube-prometheus-st-prometheus-0 0/2 Pending 0 2m pod/robusta-forwarder-cd847ccc-wxc6d 1/1 Running 0 2m5s pod/robusta-grafana-8588b8fb85-fv5vj 3/3 Running 0 2m5s pod/robusta-holmes-55dd58ff6d-m4zth 1/1 Running 0 2m5s pod/robusta-kube-prometheus-st-operator-6885c8f675-szncg 1/1 Running 0 2m5s pod/robusta-kube-state-metrics-8667fd9775-s49z4 1/1 Running 0 2m5s pod/robusta-prometheus-node-exporter-c6jvb 1/1 Running 0 2m5s pod/robusta-prometheus-node-exporter-j6zp5 1/1 Running 0 2m5s pod/robusta-runner-5d667b7d9c-dm2z7 1/1 Running 0 2m5s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 2m1s service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 94m service/prometheus-operated ClusterIP None <none> 9090/TCP 2m1s service/robusta-forwarder ClusterIP 10.102.7.41 <none> 80/TCP 2m5s service/robusta-grafana ClusterIP 10.106.69.72 <none> 80/TCP 2m5s service/robusta-holmes ClusterIP 10.110.124.241 <none> 80/TCP 2m5s service/robusta-kube-prometheus-st-alertmanager ClusterIP 10.105.101.210 <none> 9093/TCP,8080/TCP 2m5s service/robusta-kube-prometheus-st-operator ClusterIP 10.103.213.208 <none> 443/TCP 2m5s service/robusta-kube-prometheus-st-prometheus ClusterIP 10.107.13.104 <none> 9090/TCP,8080/TCP 2m5s service/robusta-kube-state-metrics ClusterIP 10.103.53.30 <none> 8080/TCP 2m5s service/robusta-prometheus-node-exporter ClusterIP 10.102.243.65 <none> 9104/TCP 2m5s service/robusta-runner ClusterIP 10.97.82.15 <none> 80/TCP 2m5s

Je peux procéder à l’installation complête via cette formule :

root@k0s-incus:~# helm upgrade robusta robusta/robusta -f ./generated_values.yaml --set clusterName="k0s-cluster" Release "robusta" has been upgraded. Happy Helming! NAME: robusta LAST DEPLOYED: Tue Jan 14 23:14:02 2025 NAMESPACE: default STATUS: deployed REVISION: 5 NOTES: Thank you for installing Robusta 0.20.0 As an open source project, we collect general usage statistics. This data is extremely limited and contains only general metadata to help us understand usage patterns. If you are willing to share additional data, please do so! It really help us improve Robusta. You can set sendAdditionalTelemetry: true as a Helm value to send exception reports and additional data. This is disabled by default. To opt-out of telemetry entirely, set a ENABLE_TELEMETRY=false environment variable on the robusta-runner deployment. Note that if the Robusta UI is enabled, telemetry cannot be disabled even if ENABLE_TELEMETRY=false is set. Visit the web UI at: https://platform.robusta.dev/

Le cluster apparaît sur Robusta :

Et là également via HolmesGPT, intterogation de la plateforme sur les éventuelles problématiques rencontrées dans le cluster Kubernetes :

Le tout avec une consommation moindre dans le cluster …

L’utilisation de l’IA pour le dépannage et l’analyse des incidents réduit le temps et l’effort humain nécessaire, permettant aux équipes de se concentrer sur des tâches plus stratégiques.

Les outils comme HolmesGPT et Ollama peuvent être mis à l’échelle en fonction de la demande, ce qui est particulièrement utile dans les environnements de production où la charge de travail peut varier significativement.

On peut donc en conclure que l’intégration de l’IA dans les clusters Kubernetes à l’aide d’outils comme HolmesGPT, Ollama et de fournisseur d’instances GPU comme RunPod, offre des avantages significatifs en termes d’efficiacité, de scalabilité et de tolérance aux pannes.

Ces technologies permettent de rationaliser le cycle de vie des applications, de simplifier le dépannage et d’améliorer la gestion des ressources, rendant ainsi les opérations Kubernetes plus robustes et plus performantes …

À suivre !

DEV Community

AIOps : Investigation par l’IA dans Kubernetes avec HolmesGPT, Ollama et RunPod …

Top comments (0)