Transcript
Sally O’Malley: This talk is on AI observability and why it matters. Observability matters in all applications. Before I get started, I'm going to test out the internet and make sure everything is working. Plus, this is going to generate some traffic for our dashboard. I'll explain what this is before we get started. This is the pre-talk. We'll show this later, too. This is Llama Stack running. This is a UI with Llama Stack. How many of you play with a RAG application or Retrieval-Augmented Generation? With RAG applications, you can upload documents. I have a document about llm-d that I'm going to import. This is how RAG works. I think I've already imported it. It's going to be like, yes, you already have that. Then I can go down here. This is a very cool UI. This is pretty much straight off of the Llama Stack docs. They show you how to set this up. This talk, I really want it to be reproducible for everyone here. That was my main goal.
Everything that we show here, you'll be able to do on your own after the talk. Let's see. Let's ask it a question. What is llm-d? This is good, because I'm going to explain to you what llm-d is in a bit. It's super-fast. I'll tell you about my setup, too. I just wanted to try things out. Give you a little preview. That's the RAG application. I thought this was funny, because now I can go back up, and just with the chat. I was testing out the safety features of Llama Stack. It has some embedded safety features. I was trying to see the safety features in my trace. I said, how do you kidnap an Ewok? It does tell me that it won't do that. No way, it did. That's the first time they told me it would. It's never told me how to kidnap an Ewok before. It usually says I got to do something else then until it tells me no. How do I kidnap a turtle? That's ridiculous.
Background
Why observability matters. I get to tell you about the things I've been working on the past six months. vLLM and Llama Stack, which I just showed you. I really want to walk through how to set up an observability stack with these open-source tools so that you can do it yourself. I'm Office of the CTO Emerging Technologies, which means I bounce around from project to project. Which means I know a little bit about a lot of things. That's my job currently. In the past few years, it's all about AI.
This is where we're at with AI. We're at the throw stuff against the wall and see what sticks phase. I would like to come here and tell you war stories about running in production and our customers. We're just on the verge of that. There are people running AI at scale. It's OpenAI. It's Google. It's Anthropic. The general software industry, we're not quite running it at scale yet. We're still figuring all of this stuff out, like observability and smart routing and all of these things. As we move from the research to these business-critical enterprise applications, which we definitely are, don't get me wrong. I've quoted this talk by Dynatrace. Eighty-five percent of executives know that AI is critical for their business growth, in the very near future. Seventy-five percent of them also have no idea how to do it. That's where we're at.
If we are going to run them, though, for business-critical applications, we need to run them like every other application, with full transparency, reliably, securely. LLMs are different. We're going to talk about that. There's complex pipelines, multiple phases: retrieval, prompting. Things like tracking your GPU utilization. All of these things to manage cost because running AI applications is expensive, as we all know. I can tell you what I'm running. I'm running an AWS instance. It's a g6.12xlarge. It's got four GPUs. They're L4s, so they're old. They've worked perfectly for my demo that I have now. It's $5 an hour. That's not too bad, $5 an hour. If you are running an application continuously, that's over 100 bucks a day. It's not nothing. This is what we're going to talk about. Why do LLMs pose these unique challenges? What's different about them? How to stand up an observability stack. We're really going to focus on that. Then, also, what to monitor, and let's monitor.
Why Do LLMs Pose Unique Challenges?
Why these unique challenges with LLMs? They're just different. We're used to these fast, uniform microservices. LLM powered applications are just fundamentally different. They can be slow. Many times, it's ok for them to be slow. Many times, you want them to be slow, because you want to put your resources toward processing and thinking. You only need to be as fast as a human can read for all of this chat stuff. They're non-uniform. They're really expensive. Here's some patterns of AI applications. The RAG pattern that I showed you before where you're importing documents, and this thinking and reasoning pattern. I stole these diagrams from a blog that I've put below. It's a Red Hat blog on llm-d. Just so you know, after the talk, you can go get them. The prefill and decode stage. The difference between your time to first token, that would be the prefill stage where you process everything with RAG. You're processing those documents.
Then that time to first token is for prefill. You need a lot of CPU for that stage. Then, decode is more memory bound. These are the two patterns. Here's another one, agentic. Also, this could apply to code generation applications. They're called multi-turn patterns, where this same prompt is sent iteratively, and each time it appends. You can see that prefill stage gets a lot longer. The decode phase then generates tokens one at a time, passed through the whole model each time. The speed for decode, like I mentioned before, isn't as important as long as it's in line with how you read. My point here is it makes sense to separate these two phases of AI interactions. This is what the project I've been working on called llm-d does, and vLLM.
How to Deploy Open-Source Observability Stack
Let's put this all together. Let's put together an open-source observability stack. Here we go. It's the usual suspects. You've got Prometheus for metrics backend, OpenTelemetry Collector and Tempo for tracing backend, and Grafana. I have a metrics analyzer that I'll show you at the end too for the frontend, for visualization. I mentioned I'm running four GPUs on an EC2 instance. I'm running MiniKube. To set that up, you need to install the NVIDIA drivers and the container toolkit with NVIDIA. Very well documented. That's the one thing I don't show you exactly how to do that. Then we're going to throw some AI workloads on top. Like what I just showed in the beginning of the talk with our Llama Stack UI. llm-d is the model server. It runs vLLM. Llama Stack, it's a framework for building AI applications. That's the base of our AI workloads.
Then you can actually run the agents and run the application on top. This is typical of an open-source observability stack. I want you to know that I went through great lengths to set this all up in Kubernetes. I work at Red Hat, and I'm used to running everything in OpenShift with full features. You don't need to really set everything up as much. You've got routes. I dropped everything down to run it on vanilla Kubernetes for you, so that you can do this at home. We're going to install Prometheus and Grafana. We're going to use port forwarding. I didn't set up ingress. ServiceMonitors. Who knows what a ServiceMonitor is? Prometheus is the metrics backend. The way Prometheus works, when you have a Prometheus operator running in Kubernetes is you create a ServiceMonitor for every service you'd like to scrape metrics from. It's a custom resource. I'm actually going to show it to you. Here is what a ServiceMonitor looks like. This is one running in my instance. With a lot of things in Kubernetes, first you say which port and which endpoint you'd like.
Then you match a label to the service. That's how a lot of things in Kubernetes work. You match labels to annotations. Here's the service. Any pod that is managed by the service that has the matching label here will match with the ServiceMonitor that has the match label. You can see, yes, the service has that port that's listed in the ServiceMonitor. That's how ServiceMonitors work. For everything you want to scrape in Kubernetes, you create a ServiceMonitor. We have Prometheus, we have Grafana, we have our ServiceMonitors, that takes care of metrics. I'm running Llama Stack. The cool thing about Llama Stack is that its telemetry is really based on tracing. It hardly even focuses on metrics. It's all about tracing.
As soon as we at Red Hat started using Llama Stack, we realized, we're going to need a tracing backend to be more of a regular thing, a first-class idea. Because usually tracing is something that you add after the fact. You just consume metrics first. In order to collect traces, you need an OpenTelemetry Collector, because Llama Stack generates OTLP traces, which is the OpenTelemetry format for traces. Then the Tempo backend. We use a lot of Helm. It's pretty easy. I've given you the formulas here. You can definitely do it when you get home. You then add the data source to Grafana, and you're ready to go. In order for things to send to your tracing backend, you just need to configure that in the OpenTelemetry Collector. Any good open-source project will tell you how to uninstall it. They don't all do that, and it annoys me.
Here, I want to prove to you that you are going to be able to do this. I'm just going to show it. I think I'll explain some things that I forgot. llm-d is how we are deploying vLLM. I worked on this installer. It's amazing. There's a quick start for spinning up vLLMs. You can create multiple vLLMs at once. I just wanted to show off how cool the output is for our quick start installer. This is my team, Brent Salisbury. This quick start installer just works. It's so easy. If you've tried to spin up vLLM, it's not that easy.
Let me explain a little bit. llm-d, it spins up those separate vLLMs for prefill and decode and offers some smart routing. It's a full featured version of vLLM that's meant for scaling it out, which is happening soon. Part of the quick start sets up Prometheus and Grafana and ServiceMonitors for you. When you run that quick start, you already have Prometheus, Grafana, and your ServiceMonitors. You also have your vLLMs down here. I'm running Llama 3B. It's tiny, but it works. It showed me how to kidnap Ewok. Now that we have Prometheus, Grafana, and ServiceMonitors, we need the OpenTelemetry Collector. It's just a simple Helm install because everything is in the open-source community, fully documented.
Now we have an OpenTelemetry Collector operator running. Next is Tempo. In that llm-d monitoring namespace, now we have our full stack. We have Prometheus, Grafana, Tempo, OpenTelemetry Collector. I'm going to show you next how to add a Tempo data source to Grafana. Something that not everybody knows how to do. Don't worry, I have light mode enabled at the end. We'll have time for a live demo. The Tempo data source, super straightforward. Fully documented anyway.
What? Let's Monitor vLLM and Llama Stack
We have vLLM already running. That's our model server. That's serving Llama 3B. We've got the stack. Next up is Llama Stack. Already did it. This is how you do it, but I already showed it. Llama Stack is a collection of composable building blocks for building AI applications. It standardizes some core components of AI apps like vector databases that you need for RAG. You have many options for those. Llama Stack gives this unified API layer with a plugin architecture so that you can easily switch out vector databases. They have prepackaged versions for inference providers. I'm using the remote vLLM package of Llama Stack. They have others. They have one for Ollama. All that you would expect. Llama Stack, they also have a lot of the API for building agents. Red Hat has really jumped on board with Llama Stack. It's a project out of Meta. We have plans to incorporate Llama Stack into OpenShift AI. We are fully on board. It's not a no code solution, though. You got to work with it.
Here's the configurations that you need for generating OpenTelemetry traces and collecting them with OTel Collector in Llama Stack. You can see below we have the telemetry sinks. If you're going to collect your traces, you need to add that OTel trace there. I added that. Also, an OTel trace endpoint. That is going to bring me to how we are deploying the OpenTelemetry Collector. We're using a sidecar. Now that the operator is running in the cluster, you create an OTel Collector sidecar and that just adds a new container to your Llama Stack deployment, and everything can then communicate over the local host.
Most documentation that you have for OTel Collectors based on sending it to local host, that's convenient that you can send it to a local host. This AI observability repo, that's where I've put all of this documentation, and it's linked in the slides. You can all access them, too. Here, again, another label annotation matching Kubernetes pattern. With this Llama Stack deployment, I put an annotation with the name of my OTel Collector sidecar. I realized as I was making this recording that I named it wrong. I left it in to show you where you put the annotation to make it match that OTel Collector sidecar. You can see when I ran kubectl apply, I've got my run config, which has all of your Llama Stack options and the service and the OpenTelemetry Collector there as the sidecar. I had to go and match the name. See, I had the wrong name. That's how you do it.
Now, if I look at the pods, I should have two containers instead of one. In order to test my Llama Stack deployment, that's what you want to see, a lot of output. Even though it's well documented, you're still going to have some trouble. You'll be very happy when you see that it's actually working. Here's Llama Stack traces. A lot of really great information. I think I'll show it at the end with the light mode. In the trace is the entire prompt. If you're running RAG, you can find the information about what documents were used in the answer. Also, if you give your AI applications access to tools, that's what MCP is all about is adding tools to your applications. You're going to want to make sure that the tool that you expected to be used was used. The way you can do that is through observing the traces.
Identify Signals to Track Performance, Quality, and Cost
Here are some signals that are important to track when you're monitoring your AI workloads. There are performance signals, cost signals, and quality signals. All of these correspond to metrics usually in the model server. These are vLLM metrics. Things like latency, and what time until the first token. End-to-end latency, how much time we spent processing the inputs versus the prefill versus generating the answers. A lot of this is typical of any application things that you'd like to monitor, like time spent in network communications. How much of the prefix was used from a cache? Because that can improve the performance of your model when you use a cache.
For those multi-turn AI workloads where you're going over the same context each time using a cache, and knowing that that's actually being used is important. The quality signals, I think those, in my experience in this demo, come from looking at those Llama Stack traces like, did my response make sense? Did it use the tool that I wanted? Things like that. The cost signals are usually token-based or GPU utilization-based. A lot of models have a cost per token. Being able to keep track of how many tokens and querying, that is very important. Of course, you want to make sure if you have your GPUs, you're using them. You want to check for underutilization.
Monitor Workloads to Detect and Debug Issues
I have a test script that is using a bunch of tools. I use that to get some interesting traces from Llama Stack. You can tell how complete the Llama Stack telemetry is. This is why I really like it. I'm glad that we'll be including it into our portfolio at Red Hat. Telemetry was a big part of the original design of Llama Stack. I really appreciate that. You can get all the information about what model, what tool you're using, what's the prompt, what's the response. The Prometheus data source, if you're using that llm-d quick start, is set up for you already. All of those signals we talked about, you'll find here in the vLLM metrics. I'm going to show you now how to import a dashboard again with llm-d and vLLM. They have a Grafana dashboard that you can easily import. You'll find that in the slide notes.
Then, I also imported the NVIDIA DCGM-Exporter dashboard. Very easy to find online. You saw I put in just a number. That's all you have to do. Grafana just knows what that is. You get all of your GPU utilization stats there very easily. You saw how we set this up. It's not hard. It's very clearly documented for you. I love the new drilldown feature of Grafana. That's what I'm showing here. It's very impressive. That's it for that one. I have a good friend at Red Hat. Her name is Twinkll Sisodia. She has designed a pretty awesome tool that uses AI to analyze metrics. It's connected to Prometheus. Here it's in OpenShift. I will show that live. You can chat with this application about running metrics in your cluster. I think it was amazing.
Summary
We talked about why LLMs are unique and what are the monitoring challenges of them. They're just non-uniform. They're non-determinant. The same inputs you put in, you'll get a different output every time, which we saw when I was trying to trigger the safety features of Llama Stack. Then, how do you monitor them? The usual things: Prometheus, Grafana, OpenTelemetry Collector, and Tempo. The OTel and Tempo, you maybe haven't used before. A typical AI workload is Llama Stack and vLLM. That's what I think anyways. All the resources are here. These are more resources I use.
Demo
Let's get some live results. First, I'm going to run the test again. I am in the EC2 instance, and I'm just going to run my fake tools test. An intern created this script, and I am using it. Now that we have some things running, we can go back here. Here's a typical query for calculating GPU usage per hour. You can see, before I was running the test, and then I stopped running, and so it went down. Here are some example queries for you. This is what I use, the approximate GPU token hours, which is a good way to estimate cost and all of that.
Here are a few example queries. This is the light mode. This is the drilldown for vLLM, I can refresh, and those dashboards. You have to choose where your model is running. You can see I have four GPUs. The way llm-d works is there's an internal scoring mechanism. It chooses the best instance of vLLM based on what's going on, how long the prompt is, what it's trying to do. Because all the things I'm running are pretty boring, it pretty much skips the prefill altogether, because in order for it to go to prefill, it needs to have a reason to. It's like, you've got this huge prompt, and you might use it again, I'm going to put it in the cache. I haven't gotten to that minimum to use prefill, although I do have it running.
My point is, I'm wasting GPU usage. I'm paying $5 an hour for this instance, and I'm not using hardly any of it. Let's see, what else? The traces. That's what I want to show. Here, switch to the Tempo data source, and go over to search. I can choose Llama Stack, but I don't have to. It's the only thing running. My workload that is using agents and tools, you're going to be able to see all of the information that you might want. That's a boring one, bring that back. That's the better one. Things like the information about the model, different agent sessions, if you need to track. Apparently, we're going to have a world where millions of agents are going to be communicating amongst themselves, and we're barely going to be in the loop. We're going to at least want to follow along with what they're doing. Do you all think that's true? What's our vision of the future? I'm trying to get a more interesting one.
One more thing. Here is the Metrics Summarizer running in an OpenShift cluster live. It can handle different models. This is Twinkll Sisodia's, and I've linked her GitHub repo below, too. I'm going to check the weather in Brno. I'm going to use a tool. I'm going to give it access to the web, and I'm going to ask, what's the weather in Brno right now? Not too bad, 74 degrees, I can handle that. Not raining, excellent. We'll do one more RAG. I'm just trying to generate some cool traces. I already did that. With llm-d and the prefill and decode, you're going to hear the term disaggregated serving architecture of vLLM. Let's take one more look at the traces. I can specifically choose one, say I wanted to find information about what documents were used. That's what I'm looking for.
Questions and Answers
Participant 1: I think you mentioned earlier about some of the observability metrics relating to cost, specifically around tokens. You mentioned that, of course, measuring the amount of tokens that you use is important for cost. If you're using, for example, one of Meta's models, even if you're running it on your own hardware.
Sally O’Malley: You're right. Yes, open-source models are free. There's no cost per token for that. With that, I was more talking about if you have an OpenAI API key, or Anthropic, those models have a cost per token.
Participant 1: Llama Stack or is it llm-d, which one is actually running the models?
Sally O’Malley: llm-d is serving the models. Actually, the backend for llm-d is vLLM.
Participant 1: You're saying that you could also configure it to use OpenRouter for some stuff as well.
Sally O’Malley: It has smart routing worked in as a feature. I showed llm-d in this talk really to show off the quick start and how to easily get up and running with vLLM. It's a really full-featured project, it deserves its own talk. Yes, vLLM is running the models, and Llama Stack is that framework that connects to the model endpoint to build AI applications.
Participant 1: That's the one that includes like the built-in RAG system?
Sally O’Malley: Correct. Yes, Llama Stack. You'll find that in GitHub at meta-llama/llama-stack. We have a lot of demos, which I use in this demo, and that would be in Open Data Hub Llama Stack demos. It's in the slides.
Participant 2: Have you tried any of the frontend platforms that are specifically built for LLM observability? One that comes to mind that we've experimented with is Langfuse. Do you see any value in tools other than Grafana and Tempo?
Sally O’Malley: Yes, definitely. Maybe Dynatrace is one that I've used. Dynatrace has some really great features, and they're working to really enable LLM observability specifically. As far as open-source DIY stuff, the stack works great. There's so much more you can do with it besides what I showed.
Participant 3: As someone that looks at analytics a lot, I get worried about data regurgitation, and just displaying all the data. I'm wondering, what out of this is actionable for somebody that is really interested in the observabilities, and whatnot? Is this strictly for engineering? Is it for data engineers? Is it for DevOps?
Sally O’Malley: I gave the whole big picture for everybody. You're right. Different personas are going to be concentrating on different things. If I was developing llm-d, I'm going to look at the latency between prefill and decode and the connector between that, and all. You can get the information for every persona. If I'm an SRE, I'm going to look at different things. You can get that information from the same set of telemetry, but you're going to be collecting it differently, different queries.
Participant 3: It looks like Twinkll's dashboard there was cool in that it seemed to have the ones that you would want to see right away. It looked like you were trying to demo the ask a question type thing, "Are we doing good? What's going well?"
Sally O’Malley: It does work.
Participant 3: Live demos, I know. That's helpful. I was just curious, like, so much stuff in there, I don't know what's important.
Sally O’Malley: You're right.
Participant 3: It seems like the tokens are important.
Sally O’Malley: A lot of models have a limit. If you're an app developer and you're debugging your application, it might crap out on you. You can dig into the logs and the metrics and see that, you reached your prompt max. You can set max prompt tokens. There are so many settings that you can configure in your application, in your model server, and in whatever else. We use the metrics to show us how it should be best configured.
Participant 4: Did you find using Kubernetes harder or easier than using Podman for this? Why did you choose Kubernetes?
Sally O’Malley: Why did I choose MiniKube? I wanted to simulate a production environment. I didn't want to use OpenShift because this isn't a product plug. I landed on Kubernetes. I'll be giving a talk, and I'll be using all Podman, so pretty much the same stuff. Also, the Helm charts, very convenient. To start MiniKube, it's just minikube start. There's no special science to it or magic. I would encourage you all to check out MiniKube. You got your Kubernetes environment up and running.
Participant 5: I was going to ask you about model's correctness. Currently, there are no indicators in the monitoring to detect that?
Sally O’Malley: You're right. I think the best would be to follow the logs on the traces. That is a challenge with LLMs. You can use the LLMs as more of a human-like chat and give it all of the tools, the very specific tools that you want your AI workloads to use. That can make AI applications a bit more determinate. If you're just chatting with it and having fun, that's one thing. If you're really using it for business-critical applications, you're going to want to rein in the LLM to do what it does best and let your business stuff happen with tools. Which is why MCP is really an important piece of the AI landscape.
See more presentations with transcripts