Technically Speaking
Explore all episodes
Technically Speaking | Taming AI agents with observability

This video can't play due to privacy settings

To change your settings, select the "Cookie Preferences" link in the footer and opt in to "Advertising Cookies."

Taming AI agents with observability ft. Bernd Greifeneder

  |    
Artificial intelligence
Automation and management

As modern IT systems grow too complex for humans to effectively manage, there's growing interest in turning to autonomous AI agents for operations. While powerful, these agents introduce new challenges around trust, reliability, and control. To explore how to solve for this, Red Hat CTO Chris Wright speaks with Bernd Greifeneder, Founder and CTO of Dynatrace, a company that has long focused on managing complexity with AI.

Transcript

Transcript

00:00 - Chris Wright
Today, we're diving into a major shift happening in enterprise IT. It's no longer just about using AI to write code or analyze data, it's about building and managing systems that are themselves powered by autonomous AI agents. This brings up huge questions about complexity, trust, and control. And to help us navigate this new world, we're joined by Bernd Greifeneder. Bernd is the founder and Chief Technology Officer at Dynatrace, a company that has been thinking about AI and complexity for a long time. Welcome to Technically Speaking, where we explore how open source is shaping the future of technology. I'm your host, Chris Wright. Thank you for joining us, and welcome to the show, Bernd. I thought we could start off by recognizing that the world is really changing rapidly, and when you were getting started, it was about much simpler application architectures, two and three tier apps, and today, we have these massive microservices architectures, and all of that complexity gets more and more challenging to manage. And at some point, we're needing to offload the management of that complexity to computers. So enter AI and AIOps and some of these kind of high-level concepts, the potential for agents, why don't you kind of dig in from your point of view, where are we and what are we doing to help tame that complexity?

01:33 - Bernd Greifeneder
So that's really a very interesting, challenging question because in the original times, it was all about the ability to observe systems in order to understand what's going on. Complexity grew, virtualization came at some point, suddenly systems become ephemeral, everything is moving all the time. Systems grew from a handful of servers to thousands of servers, to now we are thinking more millions of servers. This is the new dimension we are thinking about. So it is clear that originally, it was making systems work. Now, it becomes that you actually need AI to, on one hand, manage these systems properly because it is beyond humans to actually keep them working and delivering. But also, at the same time, you need AI to run, for instance, your production. You need also now to observe what's going on in these environments, especially the more people that not only run, their standard digital services, their standard cloud-native services, but cobble them together with additional AI services, then the complexity increases even further. Because especially, AI, with its stochastic behavior, you never know that what it's going to do and what's to talk, makes it even more important to also observe the AI, what the AI is doing. But because of this complexity, you need AI to help you manage this. The term of AI observability means to actually observe the AI of AI-driven cloud and AI-native systems, versus the classic term, legacy term actually, of AIOps, means you use AI to operate your production environment, and actually by combining both of AI to run the system, but also to observe the AI that is in those systems because you need that input of the data in order to steer it, because you can't manage what you don't measure.

04:12 - Chris Wright
I think the pithy, you can't improve what you're not measuring kind of concept, just really applies in spades here. And then add to that, the complexity that you're describing, the need for computer support, automation, AI, agents, to help humans manage this complexity is, I think, becomes more and more obvious. Where did you shift your focus from thinking about applications?

04:43 - Bernd Greifeneder
So honestly, it was already in the age when virtualization and VMotion sort of kicked in that it became obvious to us that the world will become more complex and more dynamic. So this is why already, 10 years ago, it was clear to us that we have to deal with a realtime ability to discover what is the system, how is the system actually connected to each other, sort of on the vertical stake as well as on the horizontal one, create sort of a digital twin of the production environment, and use this as a directed graph for further analytics to really analyze true causation, not just correlation. We also figured, in order to do automatic analysis and build expertise into the product, we need to bring all the data together and enrich it with additional context. And as we did those prerequisites, we figured out actually, oh, combining that together, adding then particular AI routines to it, turned it already into a causal AI engine that actually helps our customers with fully automatic root cause analysis that is based on fact. So for instance, typically, when there's a problem arising, it's on one of the tiers, there's a ripple effect through all the other tiers. Then suddenly, bing, you get the thousands of alerts, but you don't know exactly what was the first one. The first one that alert, it might not be the cause because cause and effect can be different. So this issue is that we have clearly solved by having a true dependency graph that understands not only the temporal kind of item, but actually the cause and effects. So in that type of knowledge of causal AI, is a very actually deterministic approach and this combines super well now with generative AI approaches that leads to agentic. So I paint it this way, you get the best of two worlds here, sort of you have with a large language models, the artistic side, sort of the creative part, and you have with the causal AI, the fact-based, the scientist kind of knowledge. And I think by combining those, this is giving us a great path towards agentic AI and to delivering customers an agentic AI effort that helps them to proactively and autonomously protect from becoming issues, protect from security, help them optimize towards goals, help to automatically and autonomously remediate.

08:03 - Chris Wright
So much of the world's focused on generative, it's really taken all the oxygen out of the room, and we forget the power of predictive models. I think one of the interesting sort of requirements for building such a predictive model, would be the data collection. And you're describing this complex topology graph and the relationships between components in the system. How do you approach that part?

08:33 - Bernd Greifeneder
Yeah, with almost everything, the answers you get depend on the input you provide to whatever analytics or AI you are running. So the same is true here, this is why we care foremost of the quality of data. But there are a couple of key aspects, so one is realtime learning is key, because if you have to make fact-based decisions about your production system, with all those ephemeral workloads that change every second or even more frequently, you can't use an a model that you have trained yesterday. So this idea, "Learning from past" outages is a bit limited because no one wants to have a million outages from the past in order to predict into the future. Also, the services are changing so much in production that it's pretty hard if you use historical-based training. So the key is that this aspect needs to be realtime, and this is then informing, sort of this realtime topological graph. So the predictive portion is then a mixture actually of leveraging that realtime detection with, depending on what predictive use case it is, using more or less of history. The is also there actually, a data lakehouse is handy, especially one that doesn't require schemas and lengthy pre-setup of what needs to be indexed and whatnot because this allows the agentic AI at any time to come up with, "Okay, now I need to have an answer to how is this set of clusters doing, versus this business goal? And do need to predict here, certain orchestration parameters." Agentic AI with such a schema and index list approach to a data lakehouse can get that understand on the fly, and drive and direct accordingly.

10:46 - Chris Wright
Yeah, I think a couple things that come to mind are, well, there's always the garbage in, garbage out problem. So there's some potential data cleansing or at least ensuring that you're collecting the right data. That huge sort of just wealth of information that comes from all the logging across the entire system that you can then query and understand current states. The reasoning piece I'm curious about, so you have an agent that's trying to understand the state of the system, and using reasoning to really produce the best worldview of the current state of the system. Anything worth highlighting in that context?

11:26 - Bernd Greifeneder
Yeah, the key one here is that the reasoning builds on actually a whole set of different expertise. So it's not just one agent, it's a sort of, I think of a team of agents that know their world of expertise. Let's say there's the expertise of Kubernetes, there's the expertise of databases, there's the expertise of driving towards a business goal. And so that this agentic setup is using that expertise in a collective fashion, and using as much as possible of deterministic data and results under the hood to minimize the impact of hallucinations, sort of this is we want to make it as scientific as possible to get reliable outcomes. But this is also where then the ecosystem comes in because for instance, Red Hat knows the Red Hat systems the best, so incorporating Red Hat agentic knowledge into this entire collaborative process, makes perfectly sense to drive the optimal outcome for then the customer.

12:39 - Chris Wright
There's gotta be an element of trust, that's really important here, 'cause we're offloading some forms of responsibility to AI and AI agents. How do you think about trust in this context?

12:49 - Bernd Greifeneder
Yeah, that's a big one, and we all know that AI is super capable. And the problem though is that if the capable, the cool stuff only works five out of 10 times, and the rest, it's hallucinating, doing something wrong or never converging at all, it's not much of a help. So this is actually where the whole reliability is an important one, and actually, this is the much bigger challenge that we all in the market have ahead of us. This is also why I strongly believe in all of this agentic AI setup, it is imminent to have as much as possible as deterministic analytics routines, behavior in the mix to complement the stochastic and probabilistic nature of the typically large language models that are used for these agents. Combining that in the right way to minimize hallucinations or a wrong behavior is absolutely key to build the proper trust. And this is also the reason why right now, most of that happens with the developers, because there's still the developer overseeing the code, but do you still trust then the developer, accepting all code changes, and then it goes straight into production? Code is changing all the time. So I think this is where it is absolutely imminent to combine the realtime awareness of customers' digital systems to understand what's going on in there, combine that in a deterministic way to know for a fact what's out there. Then with the creativity of then the large language models, combine that into an agentic AI system. This is, in my opinion, the best way to get a very reliable approach to solving problems autonomously or even preventively avoiding problems to occur at all.

15:17 - Chris Wright
I think the human-in-the-loop is really important in this context that, you know, we've talked about the importance of trust and understanding what proposed changes are being made to the system, and we're gonna wanna sort of gain that trust before we give more and more responsibility, and less and less oversight. What if we zoom out for a minute? You're creating a system that itself is quite complicated. You know, a bunch of agents, approximating the different types of workflows that an SRE does to better inform an SRE to be able to respond quickly, and even extend that information all the way to developers. Is there an observability challenge with watching all the agents?

15:58 - Bernd Greifeneder
Oh yeah, absolutely. So this is really important to understand, there's the AI observability that actually is about observing other AI. What used to be cloud native is going to be cloud and AI native, because there is basically no modern new digital service that you stand up that also doesn't have some form of AI with it. And so what this means is that basically, the number of microservice containers that you already have there, you will add on the additional inference servers that you leverage, and for the agents that you build on top of these. So, and all that comes together. And obviously, now when you think that the Kubernetes deployment, as it grows to thousands, 10,000s of containers, becomes a complex beast because everything is talking to each other. But when you now add in agentic AI, you no longer have so much control as you used to be because now these agents talk to each other, almost in a language they want. Does it do what it's supposed to? Is it doing it within the cost boundaries and performance boundaries? But then you have the other facts too. Is it compliant? Is it starting to hallucinate? Is it within the guardrails? So there's so many questions around that that makes it actually mandatory to observe then this system, because otherwise you have no chance to guarantee any quality. This is why the debate is already a big one that we all know that the capability of generative AI is fantastic, but the reliability is the key challenge. And I think this is also where, this approach of observing it and also proactively stepping in when needed is a way to help customers with the reliability of their AI-enhanced digital services.

18:22 - Chris Wright
I think that's critically important. I mean, you can imagine a remedy suggested, a hallucination that says, "The best way to remedy this issue is essentially delete the database and start with the fresh schema." Like that would be catastrophic for all the obvious reasons. You mentioned the ability or the flexibility of including agents potentially from other systems. Technology-wise, are there key technology components that you're using, agentic frameworks, you know, protocols that are important to create that kind of integrated solution?

19:02 - Bernd Greifeneder
So if it's about interfacing with the ecosystem, then yeah, right now is MCP, of course, a key interface sort of to integrate into IDEs, also remote MCP for integrating with the partners, including yourself, but also the agent to agent protocol, also eventually arises, but the market is moving so fast so that in two months, there might be another protocol. Basically, the whole point is, yes, there is the, the integration with the ecosystem is super important because I think this is the whole power of agentic AI as well on this one. That this is not just an isolated tool, but actually a collaborative effort to get the best competence and expertise from different services and bring that together in a best optimal way that actually is reliable as an outcome.

20:07 - Chris Wright
Yeah, I think the open standards are key here. And of course, they are evolving as fast as we can imagine, but building off of those, I think, is fundamentally important. The data collection from the system, that realtime view, sounds fundamental, critical to build the right knowledge graph. Is there anything, insights that you have in terms of how you do the instrumentation observability to gather the right data to build this worldview?

20:39 - Bernd Greifeneder
Yeah, so it is that, for instance, we hook up to the Kubernetes cluster with an operator that auto-discovers and all the different containers running there, and instruments not only the infrastructure level but actually, instruments even into the code level so that really the depth is understood, and this enables automatic tracing, automatic log analytics, automatic metrics gathering from the infrastructure. But also those traces are attached to end user experience, because if there are front-end tiers in there, those are instrumented as well, like mobile apps or web apps, and the end user experience is actually the most important one to tie them to the business outcome. And this is why also, I keep saying that, yeah, watching CPUs is nice, but actually more important it is to understand what is the actual end user impact, because you can have a CPU go out of whack, but no user is impacted by that at all because of the system is able to handle that. So this is why it's more important from there. And this is also why exactly building up such a graph that is realtime allows to do then the impact and risk analysis also in realtime to inform then the remediation step or reasoning step to drive corrective actions whether they are remediative or preventive or for optimization reasons.

22:26 - Chris Wright
Do you look at all at observing the operations of a "Intelligent application?" So there's business logic encoded in one service, and there's an LLM and another service, and the relationship between the application calling into the LLM. Are you thinking about instrumentation and visibility into the generative AI piece of an application itself?

22:51 - Bernd Greifeneder
Yeah, actually we look at all the different layers from infrastructure, sort of where it's the bare- on CPU- and the basic metrics all the way to the application layer. What are those services communicating, how they're talking to each other, sort of think of logs, traces and metrics. And then also the end user in this, what is the end user actually getting as a result from this? Or the other machines so they can do machine to machine, of course. And but also then what is the business outcome, and is the business goal achieved with this? And this is where observing AI, is almost more the technical aspect, but we also have a layer that we call business observability because we also turn the pure technical high-volume data into a smaller set of what we call business events, and provide our customers an abstraction layer that is meaningful to the business people, because they don't wanna know how much CPU, is out of that inference engine did use, they want to know how many people actually purchased or did whatever meaningful to their business with help of this outcome. And this is exactly what we bring together by actually leveraging all that context-enriched data, bring it to a business level, and allow customers to understand the full pipeline and flow.

24:32 - Chris Wright
I think that's really important. There's a lot of questions in people's minds. What is the outcome that I can produce by leveraging AI? If you forgot the question, but you know the answer is AI, we're probably dangerously into hype and buzzword land. But if you're able to demonstrate the value, I think it really helps cement in people's minds the use cases, and you know, the efficiency that can be gained.

25:02 - Bernd Greifeneder
There's for sure value. There's no way back from AI, so this is the way forward. But we have all this certain responsibility to use it in the proper way, and this is why it's important to start with smaller steps, make sure those work, and really do crawl, walk, run with AI, or bring in external products that did that already and help customers with this so that you can build on something that provides value. It is not just a huge experiment. But also on this one, I hear very interesting use cases, some where I say, "Yeah, this is where I believe this is what humans never could do." And this is also where AI helps, so for instance, one customer wants to transcode 20 millions of mainframe code into documentation, and from documentation to Java code to clean. And this makes sense because you could not hire that many people ever to get in a reasonable timeframe. So this is a use case that is still super ambitious, no guarantee that it succeeds, but I think this is one where it has a higher likelihood actually, by using AI than trying to hire 2,000 developers to transcode this mainframe up. But this is the advice, don't look for the magic one line and stuff happens. Look for you have great structured input, and make something useful with it. This is why I also believe in this transcoding use case.

26:43 - Chris Wright
I'm with you, I give similar guidance internally to our engineers, and you know, part of it is that iterative work, we've learned really well, in open source how to do small iterations continuously and really improve a system where every iterative step, is providing clear improvements in value. There's a lot of enthusiasm around the potential here, and you'll hear conversations certainly from an executive point of view within enterprise customer-base, expectations even for the amount of efficiency gains that we can generate with AI. What are you seeing in terms of creating this agentic AI system, and observability to help manage and operate complex, large-scale infrastructure?

27:31 - Bernd Greifeneder
Yeah, I hear frequently from executives that their expectation is that within the next few years, 70% of their entire engineering workforce gets automated, and they have huge productivity gains. I think that yes, there is a huge potential and value in AI, but also right now, we need actually more people to make that AI work and really deliver the value. And the other part is every competitor in the market, will also leverage AI, basically you just shift the workforce's type of work from doing the actual things to more directing the AI, but you still need people to supervise what's going on, otherwise you can't stay competitive. So the outcomes might grow, you might get faster along the way, but I think all the people who skill up in this entire game will have more work to do than ever before. So basically, you don't need fewer people, you need people to do different things in the future. And this is, in my opinion, the big change.

28:55 - Chris Wright
Well, thank you for indulging in that question and really appreciate, Bernd, your insights, and such a long history in systems observability and extending that into agents and making that very practical and useful in a meaningful way to help us advance into this exciting future. Thank you.

29:16 - Bernd Greifeneder
Thank you very much.

29:18 - Chris Wright
It's fascinating to think that as we add more automation and AI-driven agents, the need for deeper observability doesn't just increase, it becomes absolutely fundamental to trusting the system. Bernd's point that you can't have deterministic outcomes without high-quality contextual data really hits home. Before we can hand over the keys to the kingdom, we need a shared, trustworthy view of our entire environment. It shows that the future of autonomous operations isn't just about letting go of the controls, but about building a new foundation of trust so we know exactly what the system is doing and why. Thanks to our audience for joining Technically Speaking, where we explore how open source is shaping the future of technology. I'm Chris Wright, looking forward to our next conversation.

About the show

Technically Speaking

What’s next for enterprise IT? No one has all the answers—But CTO Chris Wright knows the tech experts and industry leaders who are working on them.