InfoQ Homepage Presentations AI Agents & LLMs: Scaling the Next Wave of Automation

AI Agents & LLMs: Scaling the Next Wave of Automation

View Presentation

Speed:

01:02:48

Summary

The panelists demystify AI agents and LLMs. They define agentic AI, detail architectural components, and share real-world use cases and limitations. The panel explores how AI transforms the SDLC, addresses concerns about accuracy and bias, and discusses the Model Context Protocol (MCP) and future predictions for AI's impact.

Bio

Govind Kamtamneni is Technical Director, App Innovation Global Black Belt, Microsoft Hien Luu - Sr. Engineering Manager @Zoox & Author of MLOps with Ray. Karthik Ramgopal - Distinguished Engineer & Tech Lead of the Product Engineering Team @LinkedIn. Moderated by: Srini Penchikala - InfoQ Editor.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Srini Penchikala: Welcome to today's InfoQ Live roundtable webinar, titled, "AI Agents and LLMs: Scaling the Next Wave of Automation". In this webinar, we will discuss the latest advancements in Large Language Models, or LLMs, and the recent trend of AI agents and agentic AI architectures. I am Srini Penchikala. I serve as the lead editor of the AI, ML, and Data Engineering Community at infoq.com.

For you today, we have an excellent panel with subject matter experts and practitioners, from different specializations and organizations in the AI and ML space. Our panelists are Hien Luu from Zoox, Karthik Ramgopal from LinkedIn, and Govind Kamtamneni from Microsoft.

Would you like to introduce yourself and tell our audience what you've been working on?

Hien Luu: My name is Hien Luu. I currently work at Zoox. We are in the autonomy space. I lead the ML platform team there. I've been working on the ML infrastructure area for many years now. One of my hobbies is teaching. I've been teaching at an extension school in the areas of generative AI. It's been really fun. Right now is a very exciting time. Everybody knows we're in the midst of an AI revolution, so much innovation and advancement is going on at the moment. It's really fun.

Govind Kamtamneni: I'm part of our Global Black Belt team, basically an extension of engineering here. I'm a technical director here. I work with close to 200 customers since we launched this Azure OpenAI service. It's obviously been, as Hien noted, a revolution. Every experience and application is being evolved to be AI native. Essentially, I personalize or automate parts of your workflows or your experiences. There's a lot that we'll discuss here, things that are working for customers, but also not working.

Karthik Ramgopal: My name is Karthik Ramgopal. I'm a distinguished engineer at LinkedIn. I'm the tech lead of our product engineering team, which is responsible for all of our member and enterprise facing products. I'm also the overall tech lead for all of our generative AI initiatives, platform, and product, as well as increasingly internal productivity use cases as well. Everyone said a huge revolution is going on right now. In my opinion, after the industrial revolution, this is going to be the biggest transformation we are going to see of society in general. As with anything else, there is a lot of advancement, but also a lot of hype and noise. What I'm hoping we will see through the panel today is how to cut through some of that and focus on real-world insights, and examples of what you can actually do and what you will be able to do in the future, potentially with these technologies.

Srini Penchikala: As I always say, definitely, we want to highlight the AI innovations that our audience should be hyped about and not the hype of the AI innovations.

What's an AI Agent?

There are so many different definitions and claims about agentic AI solutions. To set a common base of understanding for this discussion and for this forum, can you define what's an AI agent and how AI agent-based applications are different from traditional AI applications that have some type of a job scheduler, a workflow, and a decision-making capability built in. How is this different from those other apps?

Hien Luu: It's good to establish some kind of a baseline. Initially, when I learned about AI agents probably about a year ago from Andrew Ng, he introduced that, I got really confused. Like, what does it mean? The term, it's hard to digest in terms of what it encompasses and stuff like that. For me, after reading a lot of blogs, learning, whatever, the thing that really helped me understand is really step back and really understand the term, what is agentic or the agentic part of AI really means in terms of its own definition. I looked it up and according to the dictionary, it says like, the ability to act independently and achieve outcomes through self-directed actions. I think when I read that and I map that to, there's a lot of discussion about what AI agent systems are, it's starting to make a lot of sense. Hopefully that makes it a little bit easier for other folks that are a little bit confused about what AI agents are. With that, right now we map it to what AI agents are.

Essentially, these are applications or systems, they're designed to solve certain complex tasks or to have certain specific goals. They're all typically centered around like o1 LLMs. LLMs now are extremely smart, so that you can use to help with reasonings, plannings, and now that tool use are a part of our tool sets now in terms of building LLM applications. These application systems where they use these tools, they can plan, they can reason, they can interact with environments, and they maintain and control how these systems accomplish tasks, essentially. That's where the agentic part comes in, of using these smart LLMs now that can reason and plan. That's my way of interpreting what this means.

I think the three concepts that I learned that's helped with understanding is the autonomy aspect of it. This is one aspect that's different than your classical AI, ML systems, where they're designed to solve specific tasks and stuff like that. The autonomy to figure out what are the process, the steps to solve the problems. The second one is the adaptability, that they now can reason and plan, they can adapt the steps based on the tools that they use in the active environment. I think the last one is the goal orientation. It's another part of these AI agent systems. I'll end with this one statement, or a definition that I saw, which I really like in simple terms is, an AI system using LLMs in a loop. You might have heard that before. I want to comment on one more thing, though. Andrew Ng just said something like, instead of arguing over what the definition is like, we can acknowledge that there are actually the different degrees of being agentic into our systems.

Agentic AI vs. AI Agents: What's the Difference?

Srini Penchikala: AI agents versus agentic AI, these two terminologies, how do you define those?

Karthik Ramgopal: Agentic AI refers to the entire system. If you look at the system, it comprises of AI agents, but it also comprises of orchestrators to orchestrate across these AI agents. It comprises of tools which you call in order to interact with the real world to do something. Again, systems for coordinating with these tools like MCP, which we'll get into a bit later in this discussion. You also have registries of various kinds to announce things, to access things. That end-to-end system is what is called as an agentic AI system. An agent is just one small component of this system.

Govind Kamtamneni: I think that's well put. In fact, Berkeley has been coining this term Compound AI system for products that we're going to ship eventually, because there will be aspects of a workflow that requires agency. For example, let's say you have an email automation system. I'm actually working with this customer who's doing this at scale. Let's say you send a tracking number. It automatically replies back what he's doing there. There's part of that where, of course, it has to access tools, but then when it drafts that email and sends it to the end user, it'll obviously factor in the tool's response. This is where the LLM's stochasticity comes into play. It can personalize the response. Maybe the reader or the person that sent the request based on their maybe reading comprehension, it could adjust the response to be more terse, short or lengthy. I think you can have workflows that are more predefined.

Obviously, we have those, business process automation, for the last 20 years we've been doing that. You can introduce agency steps within the workflow. It's a spectrum. There are some SWE agents, we'll probably get to this later, software engineering agents that can be more goal oriented, that can have a lot more agency to how they create that orchestration and create code paths autonomously. It is definitely a spectrum. The gist of it is, use the LLM in the control flow, depending on if you use the LLM to drive more of the control flow, then it becomes a lot more agentic, which means it also introduces a lot more uncertainty, let's put it that way. You can put it on rails, and you can have more control over the code path and the control flow, that makes it more of a workflow with some agency.

Karthik Ramgopal: I think one key difference with the earlier systems is what Govind was alluding to, which is the emerging ability of these systems to autonomously learn and generate tools often to solve the problem. It's called metacognition, which is learning how to learn. This is still an emergent ability. We are seeing in some of these cases, these agents are able to author snippets of code or are able to orchestrate across tools, for example, to unblock themselves to solve a particular task. Again, it sometimes goes off the rails. Sometimes these agents get stuck. These are still very nascent capabilities. That is a key difference from the systems of yore, where the level of agency, the level of cognition, and hence the level of autonomy enabled is significantly higher.

Agentic AI Use Cases: Overkill or Fit-for-Purpose?

Srini Penchikala: Agentic AI architectures are new and still evolving. Like you mentioned, they will have to go through some growing pains before they become more valuable than the biases and hallucinations, which we'll talk more later in the discussion. Just staying at the use case level, what are some use cases or applications that you are seeing where AI agents are a really good fit? Also, the other side of that question is, what are the use cases where AI agents and agentic AI solutions are either overkill or are not recommended for the applications?

Karthik Ramgopal: I think any sort of knowledge work automation is where AI agents are excelling right now. I think it's only going to be a matter of time before somebody hooks up an AI model to a set of actuators to interact with the real world and they can do physical tasks as well. At least I haven't heard of anything happening at scale so far like that. Right now, it's mostly limited to these forms of knowledge work. Anything which involves a computer, anything which involves software, anything which involves calling an API, and anything which involves orchestrating a bunch of different things to make it happen. That is what it's pretty good at. What are some of the impediments there? The first impediment is that it goes off the rail and quality suffers.

The second is that the evaluation of whether it is doing the job correctly or not is hard. The third is, even with all the cost reductions, these models are still incredibly expensive and GPU capacity is fairly constrained at scale on account of a variety of factors. There are some organic limitations in terms of cost, in terms of compute availability, and in some cases in terms of latency, which are also inhibiting crazy adoption. Where are they not good? Obviously, if you have any workflow where you want to have a lot of tight control and you do not want agency, effectively you're following algorithmic business logic, AI agents are overkill. They're going to be expensive, and they're also unnecessary.

One thing which you can still do is you can always use AI in order to help you generate that logic or generate that code beforehand before you deploy it. It's like moving it further left in the chain. You do it at build time instead of doing it at runtime, and you still have a human verify it before you go and deploy it.

How Reliable/Accurate Should AI Agents Be?

Srini Penchikala: How reliable does an agentic AI system need to be in order to be useful? For example, 90% accuracy sounds pretty good, but picking up the pieces after 10% wrong results sounds like a lot of work. Again, where are we right now, and where are we going with this?

Govind Kamtamneni: Obviously, the classical answer is it depends. Depends on your use case and your risk appetite. I have some digital native customers who are pushing the frontier. Think of, for example, one of the, not GenAI, but traditional ML systems, since Hien is here. Tesla, for example, pushes the frontier, and issue software updates for all kinds of FSD things. I was actually experiencing FSD and it's getting better, but there are those 10% cases where it does go off the rails. As a brand, Tesla is fine taking that risk on. I think it depends on the experience you want to provide to the end user. Obviously, we're all here to serve their needs at the end of the day. If you think your end user wants the latest and greatest, if you can caveat that experience. Even when ChatGPT launched, I think even now on the bottom, it says, responses might be inaccurate or something like that. Just keep in mind that it will have implications to your brand, like the trust. You don't want to tolerate too much risk or push that risk onto the end user.

Hien Luu: I think like most new technology, there's a certain amount of learnings and there's a certain amount of risks. Technology will get better. I think it's basically a journey for all of us to be on and learn. You want to be smart about how to apply in certain use cases that are the consequence of a wrong decision that is not too bad. We can learn from that experience and then build on top of that and apply to more complex use cases in the future. That's the current state of where we are. There are definitely a lot of use cases where AI or agentic AI can be really tremendous, like we saw with deep research. That use case is tremendous in terms of the ability to save time to do research.

Karthik Ramgopal: I think it's really hard if you talk about these quality percentages as a whole, because most of the use cases where AI is applied today, the product is fairly complex. If it weren't, why would you use AI. You'd just code it up. It's important to talk about components of the system and your level of risk tolerance in each of these components. There's also this concept of building confidence models and invoking human in the loop.

At least with the capabilities of AI systems right now, doing full autonomy is mostly like a pipe dream, except for very few use cases. Human in the loop and a lot of AI system architectures need to be designed for this, because you cannot put SLAs on humans. You can put SLAs on systems where you actively invoke the human to ask for clarification or to ask for approval or to ask for input before you undertake a sensitive action. That is, again, a pattern which you can fundamentally integrate. It's also important to note that you have to be very careful with the definition of quality and correctness if you're doing tasks in a loop or if you're doing tasks in a chain. Because what happens is that if you have 90% accuracy everywhere, the error rates build up. First, it'll be 90%, then it'll be 81%, so on and so forth. Progressively, it could result in a much worse error rate than what you anticipated at the beginning.

Srini Penchikala: Also, on the other hand, so we can use these solutions to iteratively train and learn. Like you said, they can learn about learning. You can use that to your advantage.

Karthik Ramgopal: It's very important to mirror the real world in the way you design these applications. I can give you a classic example, coding agents. How do coding agents correct themselves? Just like humans do. You feed them the compiler error and then they look at it and then they know, here is where I messed up. Without that integration with the tool which provides them access to the compiler error, they are equally in the dark, just like you would be if you aren't seeing error log show up in your CI/CD system.

Govind Kamtamneni: I hundred percent agree. We do see a lot of systems like Cognition's Devin, for example, does a good job. Even some of the coding agents when they're misaligned, it does ask for the human to provide clarification. Sometimes they also reward hack. As a human, you have to validate the response that it generated. For example, some of these software engineering agents, they will accomplish your acceptance criteria, but then they might have hardcoded parts of it. We see that as well. Obviously, humans are very much in the loop. We just have to work with them at a much more higher-level abstraction. The end result is output for a human goes up. Human is very much in the loop.

The Architectures of Agentic AI (Technical Details)

Srini Penchikala: Let's talk about the technical details of these architectures. What do the agentic AI application architectures comprise of? What are the key components? Then, how do they interact with each other? In other words, what additional tools and infrastructure do we need to develop and deploy these applications?

Govind Kamtamneni: I actually have a huge blog post about this with decision tree and all that. You can probably search my name and search for how to build AI agents fast. Essentially, it is almost as similar as building earlier microservices that are cloud native that we built. You need an orchestration layer or component, then obviously the model is doing the reasoning. You need to have round trips with the model. The most important thing with all AI agents and with the round trips with the model is the context.

Ultimately, you're going to be spending most of your time on organizing that information layer and the information retrieval aspect. Providing the right context is literally everything. That layer is also very important. The ways you organize information can be more semantic. There's a lot of vector databases. Pretty much every database now supports vector index capabilities. That just allows you to improve the quality of your retrieval. Then you can, of course, stitch them together, as Karthik was saying, but be careful when you have too many multi-agents that are interacting with each other, the error rate can compound. More importantly, the quality attributes of the system. At the end of the day if you want reliable, highly available systems, you got to make sure that there's a gateway in front of your system that handles authentication and authorization for agents as well.

Entitlements are very important for agents. You'll see a lot of the identity providers, including from Microsoft, we have Entra ID, move in this space where we're going to facilitate agents to have their own identities, because we do envision a world where a lot of flows with agents accessing data or tools is on behalf of user flows. We do envision a world where agents are going to be more event-driven. They're going to act independently and they'll then ask the human or someone for permissions. In that case, entitlements is very important. That's more coming soon, but for now, just making sure that you have proper fine-grained authorization for the resources that the agents are going to access, or the orchestration layer is going to access.

Then, even between the orchestration and the model layer, usually we recommend some L7 gateway that can handle failovers or things of that nature, because these models ultimately, they're doing a lot of matrix multiplications and they very much introduce latency to the experience. They are not super reliable. There's a lot of capacity problems. You want to handle that scenario at scale. Then, yes, just packaging up, like you would package any microservice and deploying it to any container-based solution, that's all the same. I think the one most important thing here is evals. Evaluations are key.

Obviously, the models are saturating a lot of benchmarks that showcase reasoning. Yes, they're highly good polymaths that can reason well, but the evals that you care about, at the end of the day whatever experience you're stitching, like create a benchmark for that, create evaluations that are idiosyncratic for your use case. Then there's a lot of libraries, we have one from Azure AI, but there's libraries out there on evals, there's RAGAs. Use those eval libraries and also do continuous monitoring, and make sure you collect user feedback. That data is also very rich potentially if you ever want to fine-tune and things like that. Make sure to store that data. There's a lot there. It's like the existing ways of doing cloud native twelve-factor apps still apply, but then there's all these nuances with GenAI stuff.

Hien Luu: That's very true, what you said there. A lot of people tend to focus on those, ok, what are the new ways of building systems, these AI systems. Similar to how we thought about building classical AI, ML system where you saw that famous picture where AI, the model is just a small piece of the overall bigger system. An AI system is similar in a way, a lot of pieces that we learned over the last 20, 30 years of building microservices, all that's still applicable. The AI part is still just a small part of the overall system. There are some new parts, like Govind mentioned, that's very specific to the nature of these stochastic systems with evaluations and all that stuff like that.

Karthik Ramgopal: The rules never change. Good systems design is still good systems design. Account for the increased level of non-determinism in these systems, and that results in a variety of ways. First is observability. A lot of these stochastic systems we have for observability don't work here. You do not have predefined paths. By definition, there is a lot of agency, and there is a lot of possibilities and various paths which can be taken. You need to invest in an observability solution which will give you intelligible data from what's happening in production in the system. That's the first.

The second is from a hardware resource capacity planning perspective as well, that this non-determinism gets in. Have you provisioned enough resources or do you have enough safeguards built into your system to be able to throttle workloads to be able to asynchronously execute them when capacity is available, prioritize them appropriately. Again, these problems happen at a certain amount of scale. That's important. I think the third thing is, a lot of these systems, in addition to being slow, also fail because you have so many moving components, you have so much non-determinism, so you may not get the right output every time. How robust is your error handling, graceful degradation, escalation to human in the loop, all these things? These are not just systems design, like they have to reflect every way, from your UI to your AI. Because some of these also manifest themselves in the UI as to how do you respond to users. That end-to-end picture is super important.

Srini Penchikala: I'm glad you all mentioned these additional topics that we need to be cognizant about in terms of AI context, authentication, authorization, API gateway routing, circuit breaker, observability, Karthik, you mentioned that. I was envisioning we would be talking about these more like next year, like a phase two of AI agents. I'm glad that we're all talking about them now, because these should not be afterthoughts. They should be built into the systems right from the beginning.

Leveraging AI Agents in the SDLC (Software Development Life Cycle)

Most of our audience are senior technical leaders in their organizations. They would like to know how they can benefit from AI agents in their day-to-day work tasks, which is basically software development and so on. How can we leverage AI agents in different parts of the SDLC process? What is the new role of a software developer with more and more development tasks being managed by AI programs? How can we be still relevant and not necessarily fear about AI agents, but embrace them and use them to better ourselves?

Karthik Ramgopal: I think what you said at the last is very important. You should not fear it. You should embrace it, and see how you can use it best. I can talk a bit about how I use AI in my day-to-day development. The first is I don't vibe code. There is so much hype about vibe coding. I still don't believe in the hype. It may be good for prototyping small things, but building anything serious, which is what I think software developers end up doing in professional environments, not so great yet. Maybe it will get there. What I do use though, is I use a bunch of these AI native IDEs. LinkedIn is part of the Microsoft family, so we use GitHub Copilot quite a bit. GitHub Copilot is getting better off late with agent mode and all these things, especially in VS Code.

Again, I use GitHub Copilot quite a bit to understand the code base, ask it questions, which I'd have to research manually normally, as well as ask it to make changes. It still makes mistakes, although it's getting better. The onus is still on me to have an understanding of, first, how do I use the tool? How do I prompt it the best? How do I ask the follow-up questions in the right way? What model do I choose? Because you have the reasoning models, you have the regular generative models, which are non-reasoning for certain kinds of tasks. Certain models are good. Again, you get to know some of these things as you start interacting with these tools more and more. The second is, how do I review the output it produces? One of the challenges I'm facing already is that AI is a great productivity accelerator, but it can produce reams and volumes of code way faster than a human can. I need to keep up with the ability to review it because it can still make mistakes.

More importantly, the mistakes it makes are often very subtle in nature. It's not very obvious. You have to have an extra eye for detail. If anything, your conceptual knowledge, as well as your general ability to understand large pieces of information and synthesize results from it, needs to get better in order for all of you to keep up, all of us to keep up, in this new world. That is one area, coding assistants and things like that. You can also use it for unit test generation of various kinds. You don't need to write pesky tests yourself, although they are still very important for quality. Then there are also aspects of using AI further right, where you can use it in order to understand anomalies in your traffic patterns, incident root causing, deployment failures, debugging issues there. That is one entire category of things. You can also move further left, where even during the design process, you can use AI quite a bit.

For example, we have integrated Glean at LinkedIn. I end up using Glean quite a bit because it's connected to our entire corpus of documents, Office 365, Google Docs, our internal wikis, which contain a ton of information. If I'm doing some research, for example, to write a design doc, I will start with a Glean chat prompt, which essentially saves a bunch of grunt work for me from going, finding that information, putting references appropriately, and crafting the design. Again, that initial outline, which is produced, I will refine it later with my HI, human intelligence, in order to get it into the final shape.

Srini Penchikala: What other parts of the SDLC are you guys seeing that AI agents are being used?

Hien Luu: I think testing is an area that these models can help quite a bit. Most software engineers probably don't like that much writing tests. I think that's an area that we can leverage these tools to help with that. Not just writing tests, but also asking insightful questions about edge cases or other things that Karthik brought up. I would love to use that. Stepping back now, I don't code as much anymore, but if I were an engineer and that's 100% of my daily tasks, I would probably give more thought about like, what kinds of questions, because these LLMs are there to answer our questions.

If we can come up with smart, intelligent questions that are relevant to help our software engineering tasks, I think I would spend a lot more time thinking about what kind of question I should ask them such that I can improve whatever task I'm doing, whether it's improving the robustness of the microservice, or handling throttling, or whatever that is. I think that's a different mindset, and a mindset that software engineers need to start building more of like how to think about what kind of thoughtful questions that would be useful for their tasks.

Govind Kamtamneni: Yes, hundred percent. I just want to add that the answers are all there. It's about the questions that we ask and how we ask that matters. What I'm doing a lot is I'm actually reading a lot. GitHub Copilot obviously has this developer inner loop experience, but there's also the outer loop experience called Padawan. It's similar to Cognition's Devin, and there are others out there that do that. What I'm using is to have it generate, it's kind of like a deep research of your code base, when you go to a new code base and have it generate Mermaid diagrams and stuff, and really have a systems thinking approach. I'm using my Kindle a lot more now because I'm reading and understanding what Hien and Karthik are saying, so that I can ask the better question and pass the context that is needed to solve the very focused problem or a use case that needs to be implemented. We're also going to launch SRE agent, and there's a bunch of these agentic experiences that will augment you.

At the same time, the onus is on the human to ultimately, again, I think it all goes back to ask the right question because the answers are all there. These models, like o3 high reasoning, if you think about just the benchmark, like you are interacting with a polymath that is one of the smartest coder out there. They're saturating all kinds of coding benchmarks, and it's only going to get better. Ultimately, it's up to you to know that user need that you're solving and mapping it throughout the SDLC process and leveraging these models throughout the process.

Karthik Ramgopal: I want to give a very crude example. When IDEs came about, we saw a transformation in the development process with respect to how people were coding inside the text editors. You had the ability to do structured find and replace. You had the ability to open files side by side. You had syntax highlighting. Developer productivity went up so much. You had debugging within the IDE. This is similar. It's just an additional tool, which gives you even more power over your code. Don't look only at code. Code is an important critical aspect of the SDLC, but there are a lot of other aspects also. There are AI tools which help you automate end-to-end, as Govind and Hien pointed out.

Govind Kamtamneni: I was going to add one quote by Sam Altman. I think it's really catchy. It's like, "Don't be a collector of facts, be a connector of dots".

Srini Penchikala: The facts can be relative. No, I agree with you also. I'm definitely more interested in the shift left side of the SDLC process. Automatically writing tests for the code is important, but how much of that code is really relevant to your requirements, really relevant to your design? Again, we can use AI agents to solve problems the right way, and also, we can use them to solve the right problems. I think it goes both ways. Govind, I'm looking forward to all those different products you mentioned about. Padawan sounds like a Star Trek connection there, for the outer loop experience.

Accuracy, Hallucination, and Bias with AI Agents

One thing we've been concerned about AI programs in general is the accuracy of their output, because they have their own concerns. The accuracy of the output and the hallucinations. Now we bring in the AI agents into the mix and automate as many tasks as possible, try to let the agents do what humans have been doing. How do you see this whole accuracy and the hallucination and biases space transforming with the agents? How worse will it get?

Hien Luu: In a short answer, it's facts of life, so we have to deal with that. This uncertainty comes to us since day one of building an AI system already. There are all these undeterministics, they give you probabilities of an answer of a prediction, whether it's spam or not spam. There's a measure that we've been taking, but it's more exaggerated with the hallucination, the natural language aspect of it's still more challenging than a numeric value. Definitely, it's a challenge. A lot of enterprises are concerned if they want to apply this into their real-world use cases, where the consequences are not just content generation, but it could impact their user or whatever that's causing damages. Definitely a big concern for enterprises. I think there's studies after study why enterprises are way behind in terms of adopting these technologies.

In general, I think these LLMs are getting smarter. These AI frontier labs, they spend a lot of effort in improving or reducing hallucination, but nevertheless, there's still that. The question is, what can we do as we build these agentic systems? At first, you understand, for your particular use case or use cases, the hallucinations, what the cause of those might be. If you're building a very domain specific agentic system and using the models and maybe those models were not trained with your domain specific area, that may be an indication that you might have to do something with that in terms of the knowledge cutoff or the limit of it. Understanding what the underlying causes might be for your use cases is the first thing to do.

In terms of what actions, or methods, or strategy that you can employ, I think there's sets of good practices that are out there now. Start with grounding. I think it's a very common technique now with grounding in terms of the context. Go back to what Govind said, it's all about the context. Grounding the relevant information and stuff like that. That's why RAG has become very powerful because of the ability to ground with relevant content in the prompting context. This is something that a lot of people don't spend a whole lot of effort in.

At the end of the day, it's like we're interacting with LLMs through prompting, and what we say in the prompt matters. Well-crafted prompt engineering is still extremely valuable and relevant to help with these kinds of challenges. Be specific with all the other stuff that good prompt engineering practices. Other things you can do outside of that is the guardrails we talked about earlier that people mentioned. Human in loop too, that's another aspect that you can build into your system at a proper time to involve the human when the responses seem suspicious and doesn't pass the smell check thing.

Then, evaluation, evaluation, evaluation. It's all out there now. Are actually people doing it or not? That's a different question. Building evaluation requires a lot of upfront investments. I think you want to do that iteratively and incrementally as well. It's not something that you probably can come up with a whole set then you're done. It's something that needs to be dealt with in an incremental manner. At the end of the day, there's techniques to help with reducing, but eliminating, I think we're not there yet, as far as my understanding goes. I'd love to hear experiences from Govind and Karthik of actually working with their customers and building AI agentic systems at LinkedIn.

Karthik Ramgopal: I can share some of the other techniques which we use. The first is this technique called critique loops, where effectively you have another observer critique the response, using the outcomes, and see if it meets the smell test of those outcomes. Again, this is a technique you use in evals as well.

At runtime, though, you cannot use this very heavily. You cannot put a powerful model. You cannot have a very complex critique logic, because, again, it's going to consume compute cycles and latency. There's always this tradeoff between quality and latency and compute costs. In your offline evals, you can actually put a more powerful model, in order for you to actually evaluate the responses of your system and understand where it failed. I think there was a question about, how do I do evaluation? Should I do random sampling? Random sampling is a good starter, but it first starts with the definition of accuracy. Do you have an objective definition of what being accurate means? Which is actually quite hard in these non-deterministic systems to get a comprehensive definition of that.

Once you have that, then you can decide how to pick and choose, because, ideally, you do not want to do random. You want to get a representative variation of responses and ensure that you did well across them. Again, that requires some analysis of your data itself. Something which a lot of folks do is they capture traces. Of course, they anonymize these traces to ensure that personal information is not emitted in whatever way possible. After that, they feed these traces offline to their more powerful model, which then runs the eval and tries to figure out what to do. The other interesting technique, which can be used is that sometimes you really do not need AI, as I said before, because you do not want reasoning.

In those cases, don't use AI or use a less powerful model. For example, for a bunch of classification tasks, you could use a much simpler AI model. You don't need an LLM. Of course, it's easier to do with an LLM, but is it the most efficient? Is it the most reliable? Probably not. It's going to be cheapest model. In some cases, you can just fall back to business logic. Last but not the least, and I say this very carefully, sometimes, when all else fails, you can also apply various kinds of fine-tuning techniques in order for you to create fine-tuned models. I say when all else fails because people prematurely jump to it without trying all they can do with prompt engineering, and RAG, and the right systems architecture, because it's expensive, since the foundation models keep advancing.

As long as you have a fine-tuned model, you have to maintain it. You have to ensure that when the task specification changes, you haven't had loss in generalization, which results in worse performance. Pick your poison carefully, but that is also a technique which is useful in some cases.

Govind Kamtamneni: Prompt rewrite is another one that is pretty much baked in now, for example, in our search service. Things like that can help a lot.

Srini Penchikala: Yes, definitely. I agree with you all. Like you all mentioned, with AI, there's a lot more options available. As a developer, as an end user, I would like to have more options that I can pick from rather than fewer options. With more options comes more evaluation and more discipline. That's what it is.

Model Context Protocol, (MCP) and the Evolution of AI Solutions

Regarding the next topic, we can jump into probably the biggest recent development in this space, the Model Context Protocol, MCP. I know I've been seeing several publications, articles on a daily basis on this. They claim it's an open protocol that standardizes how applications provide context to LLMs. They also say that it will help you build agents and complex workflows on top of LLMs. That's the definition on their website. Can you share your experience on how you see this MCP, where it fits, overall, in the evolution of AI solutions? How can it help? What can we use it for? Where do you see this going? I think it's a good standard that we have now, but like any standard, it will probably have to evolve.

Govind Kamtamneni: Actually, let's think about a world without any standards or protocols, like how things are developed actually right now. If you have an agentic system, let's pick user onboarding just for the sake of it. Let's say you're building a Compound AI system that does user onboarding and hopefully has some agency in some of the workflow tasks there. Let's say a new employee joins, the developer that is developing this orchestration layer has to understand SuccessFactors, or ServiceNow APIs, and deterministically code according to those API specs. Also, the day two aspect of that, handle the day two aspect.

If ServiceNow changes their interface, this developer has to change this orchestration layer. If you think about employee onboarding, maybe you first have to create a new employee record in SuccessFactors, or something like that. Then you have to maybe issue a laptop or something like that. That could be a ticket in ServiceNow with all the details. Then maybe even notify the manager or get manager's approval and things like that. That could be a Team's message or something like that. You need to know all these API specs and maintain them and all that stuff. I think going forward, it's becoming clear that we're not just developing systems for humans, but actually for AI agents to consume. If all these systems instead had some standard that they could all conform with a standard protocol and hopefully a standard transport layer. That's what MCP is basically proposing. Anthropic started it. It's fair to say that everyone is embracing it. Let's see how it goes because the governance layer is still shaky.

For now, at least everyone is building essentially a simple facade proxy on their existing SuccessFactors, or ServiceNow, or GitHub, this MCP server. The host of MCP, let's pick VS Code, for example, or GitHub Copilot, or this onboarding agentic system that could then work with these MCP servers and they have to follow the standard. The client developer now doesn't have to know SuccessFactors' idiosyncratic APIs, and implement them and maintain them. It's the same MCP standard for all these other systems. Ultimately for these models, again, they're very great at reasoning, but for them to be economically useful, they have to work with a lot of systems in the real world, for example, this onboarding workflow, to be successful. It has to integrate with all these systems. I think going forward it's great that we have some standard, finally, at least for now.

Then, of course, there's also Agent-to-Agent standard that is emerging. There's the new standard I was just reading that is a superset of Agent-to-Agent, I think it's called NANDA. We'll see which one wins out. It's good that the industry is rallying. This is nothing new. We had in the past, obviously we're communicating over HTTP. That was a standard established by the Internet Foundation or something. There's CNCF. Kubernetes is pretty much the standard orchestration layer now. I think standards bodies are good. It was just a matter of time, especially a platform shift this big as, I think it's bigger than the internet. Obviously, this is a standard for agents to talk with resources and tools.

Primarily, there's also prompts and other things there. There's also hopefully a standard that'll emerge where agents can talk to other agents, have a registry. That's what A2A is trying to do. Then, hopefully, this other swarm of agents, a new standard by MIT, that's also out there. There's a lot there, but it's where we are right now. I do think it's an evolving space.

Karthik Ramgopal: I think MCP is still an evolving standard. It's a great way in order to connect externally. It still has some gaps primarily around security, authentication, authorization, which are pretty critical, which is why we haven't deployed it internally yet. I'm sure it's just a matter of time before the community solves these problems. I think the important thing to note here is that MCP is primarily a protocol for calling tools. It isn't a protocol for Agent-to-Agent communication, because Agent-to-Agent isn't a synchronous RPC, or even a streaming RPC. It's way more complicated. You have asynchronous handoffs, you have human in the loops, you have multi-interaction patterns, which is why the protocols like A2A and agency and the other emerging ones, I think will be better fits. Right now, amidst the hype, everyone is trying to force fit MCP everywhere. Be thoughtful about where you use it and where you don't.

Hien Luu: It makes a lot of sense. I think everybody agrees that it has its own place for tool use. If that becomes a standard, then it opens up a whole slew of, make it easy to integrate and use tool use. An example that came to mind, something in the past, like we have something called REST, a protocol for dealing with HTTP, but this is specifically for LLMs. Let's see where it's going. There's a lot of explosion of MCP servers out there that I see. Let's see where it goes. I think it makes a lot of sense.

The Future of AI Agents - Looking into the Crystal Ball

Srini Penchikala: To conclude this webinar, I would like to put you all on the spot. What's a prediction that you think will happen in the next 12 months? If we were to have a similar discussion 12 months from now, what do you think we should be excited about?

Karthik Ramgopal: I am feeling like an oracle today, so I will predict one thing. I think that the transformer architecture and LxMs in general will start getting increasingly applied to traditional relevance surfaces like search and recommendation systems, because right now the cost curve as well as the technological advancement is at a point where it is starting to become feasible. I think we will see an improvement in quality of these surfaces as well, apart from the agentic applications. That's my prediction.

Govind Kamtamneni: I have this AGI thing up there. I take Satya's approach there, which is, if it can be economically useful, and I think his number is $100 billion of economic value, then it's AGI. Because we see a lot of benchmarks, they're absolutely saturating. At the end of the day, can it improve human well-being? That is GDP. If we can start, whether it's MCP or whatever, start actually giving actuators to these reasoners. Hopefully, again, the ultimate goal is output per human going up. We're starting to see that. Cursor was the first product - and I'll say that even as a competitor - to hit $100 million ARR. The fastest ever to $100 million recurring revenue. Hopefully, we'll see it in other domains, not just software engineering. Then, human well-being is actually improving with everything we're doing. Hopefully, that'll happen in the next two years.

Hien Luu: This is probably less of a prediction, but it's something I would love to see, especially in the enterprise, how AI agents are being applied in enterprise scenarios. We'd love to see more practical use cases being out there and see how that really works out in the enterprise. I think the part that's exciting is about Agent-to-Agent, multi-agent systems. There's a lot of discussion about that. It seems pretty fascinating. You get these agents talking to each other and they do their own things. I would love to see how that manifests into real-world useful use cases as well. Hopefully we'll see those in the next 12 months.

Srini Penchikala: I'm kind of the same way. I think these agents will help us have less thrashing of lives at work and in personal lives, so we can focus on more important things, whatever they mean, to enjoy our lives and also help the community.

Govind Kamtamneni: I just wanted to add, shared prosperity. Wherever it happens, people should not be afraid. There should be a positive outcome for everyone.

Srini Penchikala: To quote the Spock.

See more presentations with transcripts

Recorded at:

Jul 09, 2025

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?