More playing around with AI development. Decided to see if I could build something useful. As I work with Kubernetes I thought I would build some sort of pod troubleshooter.
What It Does
This script does the heavy lifting in Kubernetes pod debugging:
- Scans a namespace for failing pods
- Collects diagnostics:
describe
,logs
, andevents
- Sends the info to Ollama running
llama3
- Displays a full AI-generated root cause analysis
Requirements
Install dependencies
pip install rich requests
Install and Start Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3 ollama run llama3
How It Works
First we pull in the imports and create a helper function to handle shell commands in string format
import argparse import subprocess import json import requests from rich.console import Console from rich.panel import Panel # Initialize rich console = Console() def run_command(command): """ Runs a shell command. If the command succeeds, it returns stdout. If it fails with a CalledProcessError (like a normal kubectl error), it returns stderr so the AI can analyze the error message. For other exceptions, it prints an error and returns an error string. """ try: # Using shell=True for simplicity, but be cautious with untrusted input. Not sanitizing command line input here. result = subprocess.run( command, shell=True, check=True, capture_output=True, text=True, timeout=30 ) return result.stdout except subprocess.CalledProcessError as e: # This is an expected failure (e.g., `logs --previous` on a new pod). # We return the error message for the AI to analyze. return e.stderr except Exception as e: # For any other kind of error, log it and return a failure message. # This prevents the whole script from crashing. error_message = f"An unexpected error occurred running command: {command}\nError: {e}" console.print(f"[bold red]{error_message}[/bold red]") return error_message
Finding Failing Pods
Next we create a function for Finding Failing Pods
def get_failing_pods(namespace): """Finds all pods in a namespace that are not in a 'Running' or 'Succeeded' state.""" console.print(f"\n[bold cyan]🔍 Searching for failing pods in namespace '{namespace}'...[/bold cyan]") command = f"kubectl get pods -n {namespace} -o json" pods_json_str = run_command(command) if not pods_json_str: console.print(f"[bold red]Could not fetch pods in namespace '{namespace}'. No output received.[/bold red]") return [] try: pods_data = json.loads(pods_json_str) except json.JSONDecodeError: console.print(f"[bold red]Failed to parse JSON output from kubectl for namespace '{namespace}'.[/bold red]") # The output might be an error message from kubectl console.print(f"Received: {pods_json_str}") return [] failing_pods = [] for pod in pods_data.get("items", []): pod_name = pod.get("metadata", {}).get("name") phase = pod.get("status", {}).get("phase") # Pod phase is not Running, Succeeded, or Completed if phase not in ["Running", "Succeeded"]: failing_pods.append(pod_name) continue # Pod phase is Running, but its containers are not ready or have crashed. if phase == "Running": container_statuses = pod.get("status", {}).get("containerStatuses", []) if not container_statuses: # A running pod should have container statuses. If not, it's an issue. failing_pods.append(pod_name) continue for container in container_statuses: # If any container is not ready, the pod is considered failing. if not container.get("ready", False): failing_pods.append(pod_name) break # Found a failing container, no need to check others in this pod. # Explicitly check for terminated state with non-zero exit code terminated_state = container.get("state", {}).get("terminated") if terminated_state and terminated_state.get("exitCode", 0) != 0: failing_pods.append(pod_name) break # Remove duplicates that might occur if a pod is both 'not ready' and has other issues. return sorted(list(set(failing_pods)))
The script runs:
kubectl get pods -n <namespace> -o json
It filters out pods not in "Running"
or "Succeeded"
phase, and checks:
- Missing or unhealthy container statuses
- Containers not
ready
- Terminated containers with non-zero exit codes
Collect Diagnostics
Next we Collect Diagnostics
def get_pod_diagnostics(pod_name, namespace="default"): """Gathers describe, logs, and events for a given pod.""" console.print(f"[bold cyan]🔍 Gathering diagnostics for pod '{pod_name}' in namespace '{namespace}'...[/bold cyan]") diagnostics = {} # Get pod description diagnostics["describe"] = run_command(f"kubectl describe pod {pod_name} -n {namespace}") # Get pod logs (including previous container if it crashed) diagnostics["logs"] = run_command(f"kubectl logs {pod_name} -n {namespace} --all-containers=true") diagnostics["previous_logs"] = run_command( f"kubectl logs {pod_name} -n {namespace} --all-containers=true --previous") # Get relevant events in the namespace uid_command = f"kubectl get pod {pod_name} -n {namespace} -o jsonpath='{{.metadata.uid}}'" pod_uid = run_command(uid_command).strip() if pod_uid and 'error' not in pod_uid.lower(): diagnostics["events"] = run_command( f"kubectl get events -n {namespace} --field-selector involvedObject.uid={pod_uid}") else: diagnostics["events"] = "Could not retrieve pod UID to filter events." return diagnostics
For each failing pod, it gathers:
kubectl describe pod <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> --all-containers kubectl logs <pod-name> -n <namespace> --all-containers --previous
And also filters relevant events:
kubectl get events -n <namespace> --field-selector involvedObject.uid=<uid>
Analyze with Ollama
Next we create a AI prompt with the diagnostic data and pass it to Ollama that is running locally
def analyze_with_ollama(diagnostics, pod_name): """Sends diagnostics to Ollama for analysis and streams the response.""" console.print(f"[bold cyan]🧠 Analyzing diagnostics for '{pod_name}' with Ollama...[/bold cyan]") prompt = f""" You are an expert Kubernetes Site Reliability Engineer (SRE). Your task is to diagnose a failing pod based on the following `kubectl` outputs. Analyze the provided data and perform the following steps: 1. **Identify the Root Cause:** State the most likely root cause of the problem in a single, concise sentence. 2. **Provide a Detailed Explanation:** Explain your reasoning. Reference specific lines from the logs, events, or pod description to support your conclusion. 3. **Suggest a Next Step:** Recommend a single, concrete `kubectl` command or action for the user to take next to fix or further investigate the issue. Here is the diagnostic information: --- ### 1. KUBECTL DESCRIBE POD: {diagnostics.get('describe', 'Not available')} --- ### 2. KUBECTL LOGS (Current Container): {diagnostics.get('logs', 'Not available')} --- ### 3. KUBECTL LOGS (Previous Container): {diagnostics.get('previous_logs', 'Not available')} --- ### 4. KUBECTL GET EVENTS: {diagnostics.get('events', 'Not available')} --- Provide your analysis in a clear, easy-to-read format. """ # Use a Panel to visually separate the analysis for each pod panel = Panel(f"[italic]Waiting for AI response for [bold]{pod_name}[/bold]...[/italic]", title=f"[bold green]AI Diagnosis for {pod_name}[/bold green]", border_style="green") console.print(panel) full_response = "" try: with requests.post( "http://localhost:11434/api/generate", json={ "model": "llama3", "prompt": prompt, "stream": True }, stream=True, timeout=300 ) as response: response.raise_for_status() console.print(f"\n[bold green]AI Analysis for {pod_name}:[/bold green]") for line in response.iter_lines(): if line: chunk = json.loads(line) content = chunk.get("response", "") full_response += content console.print(content, end="", style="white") if chunk.get("done"): console.print() break return full_response except requests.exceptions.RequestException as e: console.print(f"\n[bold red]Error connecting to Ollama API: {e}[/bold red]") console.print("Please ensure Ollama is running and the 'llama3' model is available (`ollama pull llama3`).") return None
It streams this prompt to Ollama's local API running the llama3
model and prints the response.
Bring it all together
In the main function we
- setup the prompt
- handle the namespace arguments.
- We then call the
get_failing_pods(namespace)
function to find the failing pods - We then call
get_pod_diagnostics(pod_name, namespace)
to get the kubectl diagnostic information - Finally we call
analyze_with_ollama(pod_diagnostics, pod_name)
to pass in the diagnostic data and build the AI prompt and query the AI(ollama) api endpoint for answers.
if __name__ == "__main__": console.print(f"[bold cyan]888 .d8888b. 888 888 [/bold cyan] ") console.print(f"[bold cyan]888 d88P Y88b 888 888 [/bold cyan]") console.print(f"[bold cyan]888 Y88b. d88P 888 888 [/bold cyan]") console.print(f"[bold cyan]888 888 \"Y88888\" .d8888b .d88888 .d88b. .d8888b888888 .d88b. 888d888\" [/bold cyan]") console.print(f"[bold cyan]888 .88P.d8P\"\"Y8b.88K d88\" 888d88\"\"88bd88P\" 888 d88\"\"88b888P\" [/bold cyan]") console.print(f"[bold cyan]888888K 888 888\"Y8888b.888888 888 888888 888888 888 888 888888 [/bold cyan]") console.print(f"[bold cyan]888 \"88bY88b d88P X88 Y88b 888Y88..88PY88b. Y88b. Y88..88P888 [/bold cyan]") console.print(f"[bold cyan]888 888 \"Y8888P\" 88888P\' \"Y88888 \"Y88P\" \"Y8888P \"Y888 \"Y88P\" 888 [/bold cyan] ") console.print(" ") parser = argparse.ArgumentParser( description="AI-powered Kubernetes Pod Troubleshooter.", formatter_class=argparse.RawTextHelpFormatter ) parser.add_argument("-n", "--namespace", help="The namespace to inspect.") args = parser.parse_args() namespace = args.namespace if not namespace: namespace = console.input("[bold yellow]Please enter the Kubernetes namespace: [/bold yellow]") if not namespace: console.print("[bold red]Namespace cannot be empty. Exiting.[/bold red]") exit(1) # Automatically find all failing pods failing_pods = get_failing_pods(namespace) if not failing_pods: console.print(f"[bold green]✅ No failing pods found in namespace '{namespace}'.[/bold green]") exit(0) console.print(f"[bold yellow]Found {len(failing_pods)} failing pod(s): {', '.join(failing_pods)}[/bold yellow]") # Loop through each failing pod and diagnose it for pod_name in failing_pods: pod_diagnostics = get_pod_diagnostics(pod_name, namespace) if pod_diagnostics: analyze_with_ollama(pod_diagnostics, pod_name) console.print("-" * 80) # Separator for clarity
Running the script
python k8s-doctor.py -n my-namespace
If you omit -n
, the script will prompt you to enter a namespace interactively.
Code lives here
View the full Python script on GitHub
To summarize I am using this script to automate log/event gathering for failed pods, creating a prompt for an AI LLM and returning a response from AI. Is this a wrapper for AI? Is this automation? a bit of both I think.
Top comments (0)