BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Achieving Precision in AI: Retrieving the Right Data Using AI Agents

Achieving Precision in AI: Retrieving the Right Data Using AI Agents

50:00

Summary

Adi Polak explains the path from GenAI prototype to production by focusing on precision - the competitive edge. She details Agentic RAG architectures, emergent agent design patterns, and crucial feedback loops (LLM-as-a-judge) for refinement. Learn how to leverage data streaming (Kafka) to manage collaboration, memory, and scale microservices in real-time agent systems.

Bio

Adi Polak is an experienced Software Engineer and people manager. She has worked with data and machine learning for operations and analytics for over a decade. As a data practitioner, she developed algorithms to solve real-world problems using machine learning techniques. Adi builds high-performance teams focused on trust, excellence, and ownership.

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Adi Polak: What is wrong, in this slide? Can you tell me? If you be very detail-oriented and look into the text, there you might see like in AAI, or you might see precision with double I or double S. You might see that the text is not exactly as it should be, and these are really the challenges that we're facing with generative AI. Achieving precision is one of the hardest things that we need to do in order to actually operationalize AI and go from 0 to 1, MVP some prototype to actually production and see things that work. Here's another example of some slide. You can see again the double S at precision.

Probably the model really likes double S. I don't know what's the reason for it, but it is what it is. How does does precision work here? There's some her and some mistakes that happened along the way. This is nice when it's just me creating presentations, but it's not so great when we're giving our customers bad information and our customers later on can sue us. This is a real case that happened to Air Canada.

Essentially, they created a bot and each one of us can interact with that bot to check what happens with our flight. Is it going to be on time? Is it going to be delayed? Do we need to change seats? What happens to our luggage? The chatbot misled the customer. The customer lost some big business. They were traveling for business. They sued Air Canada for misleading them, and that cost the company a lot of money and a lot of reputation. As we move from generating AI just for the sake of playing with that, having fun, 0 to 1, we actually want to be able to productionize it without these lawsuits coming out to us.

This is a slide that actually worked well. It took me about 50 iterations to get to it. Let's talk about achieving precision in AI and fine-tuning agentic RAG solutions specifically. We know by now if we want to take our AI to production, which we do, precision is not a thing that we can give up on. It's not optional. It's actually going to be our differentiator. It's going to be what is going to make the decisions for our customers to use or not use our products at the end of the day. The question is, how do we even measure precision? Where do we start? How do we go about that? Later on, how do we improve it if we wish to?

If we look at traditional ML, ML that existed for many years before we started generating, AI specifically took over, we had this specific equation. There's precision and recall, and precision at the end of the day was true positive divided by true positive and false negatives. True positives are the classifications or instance that we or the algorithm correctly classified as belong to a positive class, and then we divide it by the rest of the world. How much is the positives that we actually classified correctly by how much relevant results among all the results that came in. This is how precision was. The question with GenAI is, how do we even know when we're generating a text, when we're creating an image? How can a model or how can an automated system, not necessarily a human being, know that this is a good result or not. This is where we are now standing and we're trying to understand is, what is precision in large language models? What is precision in diffusion models? How do we go about it?

Background

A little bit about me. I wrote two books. One is "Scaling Machine Learning with Spark", published with O'Reilly. The second one is "High Performance Spark" the second edition, where we're diving into the details and the nitty-gritty of how to optimize also large machine learning workloads with it, but not only that. I'm a people manager. I work for a company named Confluent. Confluent builds the data streaming platform. Has anyone here heard about Apache Kafka? We are the main contributor to Apache Kafka, Apache Flink. We're also contributing to Apache Iceberg because we believe this is part of the future. Confluent built the data streaming platform on top of all these technologies to enable people to do real-time processing and real-time data streaming, and so on. My background, I started actually in the machine learning space about 15 years back. I moved to the big data space. I worked a lot with Spark, a lot with Hadoop. Today with Confluent I'm in the cutting edge of data streaming, and this is what we're going to talk about.

How do we solve precision? What are the two options that we usually have? One is the data-centric optimization. We're now realizing that everything we do with AI at the end of the day really relies on our data. The second thing is inference. We're going to talk about these two categories. The data-centric optimization, we usually hear about something called RAG. Has anyone heard about RAG? We're going to dive deeper into it. RAG is really relevant for us in the data streaming space because we want to give concrete information about the relevant things that happen in the world right now. We want to ground everything that the GenAI algorithms gives us in current facts.

The second one is domain-specific fine-tuning. Domain-specific fine-tuning is actually me taking an existing algorithm, an existing model, and adding some documents to it to fine-tune and make it an expert in one specific domain. Later on, I can take it and deploy it and it acts as an expert on that specific domain. The difference between the two, RAG is often used in data streaming, domain-specific fine-tuning is often used in batch inference or batch workloads. We want to fine-tune the model and later on we can serve it also in the streaming or application side of things. These are two different parts of the data-centric optimization. We're going to dive into the RAG because I think there's some very interesting patterns that emerge there specifically with agents. Later on, we'll touch a little bit about inference optimization. Again, from the data point of view, from how we're building scalable systems, and less about the NVIDIA GPUs, CUDA, and so on. Although it's also an interesting space nonetheless.

RAG - Retrieval-Augmented Generation

RAG. What is RAG? Retrieval-Augmented Generation. If we have something very simple, like the LLM only. We have a user, user sends a prompt. LLM takes the prompt, it generates an output. With RAG specifically we're adding a couple of layers. We have the user query, then we're looking to retrieve the data, we're looking for some relevant information. Actually, my algorithm there is going to do some search capabilities, and we're going to dive into it. Later on, we augment that information. We're doing some embeddings and so on, but you're augmenting that information with the prompt of the user before we're sending it to my LLM to generate the last response. That means I'm doing some manipulation, I'm doing some massaging of the user query itself before handing it over to the next step.

Retrieve, very important action that we do here in RAG. It's actually going to define how good my RAG is. I need to make sure that my database there is accurate, up-to-date, and something that I can trust. Augment, like I mentioned, taking the query, taking whatever came out from the RAG, and pushing it to the LLM. Pretty straightforward. When we think about it from a data point of view, we're looking at indexing, very important capabilities that I believe everyone who works in data saw and heard about it before, and we're looking at retrieval. The way I'm indexing my data is going to define how fast I'm going to retrieve that specific data, because how we work when we do search, we want to be able to retrieve the accurate, the right data, right in time, right in there.

If we're looking at different types of RAGs and RAGs capabilities, the first one is the classic term search. Anyone here heard about TF-IDF? Yes, search world. If you ever worked with things like Elastic or OpenSearch technologies, at the end of the day there's a DocumentDB, they're doing some TF-IDF. It's some algorithm behind the scenes that help them classify and build the indexes for these documents that are considered unstructured or semi-structured data. The reason that it requires a specific algorithm is because there's no specific keys, there's no specific columns that we're looking to index the data over, but we do want to pull data very fast. There's a dedicated algorithm that was invented by Google and it used to be used by Google Search engine named TF-IDF, just to do these capabilities.

The challenge with term search is that you need to know the specific term you're looking for. Today with generative AI, we don't always have the specific term. It was good enough when we went to Google maybe a year ago and searched for specific things, if we knew the exact keyword, we will find that information. Today it's not enough and we need more. This is where similarity search comes to play.

Similarity search, it's very different than term and TF-IDF, is actually searching for something that is like what I'm looking for right now. I don't have to know the specific keyword anymore, but I'm looking for similarity which makes it even more complex in terms of algorithms, and we'll get to it. The last one is graph search. Graph search is based on relationships between different entities. It's again something that we can build as part of the capabilities, and you will see that throughout building agents we will need all three of them for different cases of improving our agents' precision at the end of the day.

Let's dive a little bit into the similarity one. Similarity search is usually based on vector search. What happens is we have all the information, we'll go through some embedding model. Embedding models means I'm getting some embedding vectors. At the end of the day, we're taking the user query, we're passing it through the embedding algorithms, and we're getting a vector now that we want to search for similarities with the rest of the vectors. This takes a long time but this is again critical for us in order to find what is similar to the rest of the world. After we're getting that information back, we have the relevant embedding vectors. We might get some weights attached to it as well.

Then we're taking it, we're transforming it back into the prompt text that we had before, and we can later on take it to the algorithm and get the response. This is how similarity search works. You can imagine at the end of the day there's some metrics, there's some arrays behind the scenes, but this is it. There are multiple algorithms as well that implements it today, but again, everyone is trying to improve it. Also, Elastic released some capabilities around that. MongoDB released some capabilities around that. All the big databases now that ever gave some solutions to the previous term search or any search are now incorporating vector search and embedding models there as well. We moved from this LLM only to this RAG application and now we know we need to pull in some relevant results.

Where are the challenges? This is a long list of challenges that we assembled in Confluent actually. Some of the result is, we're bringing irrelevant outdated results. Our database, my data is not being updated in real time. The queries are being a little bit ambiguous. Sometimes it takes a very long time to get back the result although all my data pipelines, all my application pipelines work in the latency of milliseconds in order to serve my customers and my application in milliseconds, sometimes things get stuck in the retrieval phase.

That brings scalability issues and latency issues that needs to be solved. Second challenge is augmentation phase. What is the level of integration? How can we augment the response back? Is there something that we can improve there? What are the token limitations that I have? Because sometimes when I'm working against the REST API, I have some token limitations. If you work against OpenAI or Claude or any of the other models, you would know that at some point they're starting to block you, they're starting to slow you down because of the token limitations, and you have to cut some of the information you're sending over and you have to make this conscious decision of where to make the cut. Third thing is generation phase, hallucinations, but we know RAG can help us in there. The last thing is systematic challenges, latency bottleneck, lack of transparency, outdated data, all the things we know how to solve in the data space.

How do we improve the retrieval at first? I mentioned there are a couple types of RAG, and the first approach is hybrid search. For some of the information that comes in, I can leverage an agent that would do classification for me and tell me if I need to go through a term search or through a similarity search and for which part. I can actually combine this hybrid search and leverage all the three different capabilities in my disposal in order to do that. Second thing is re-ranking. When I'm doing TF-IDF or when I'm doing similarity search, I get back ranked information of which one was similar enough or which one is actually the relevant information to come back. I can ask the algorithm to do something called re-ranking or I can add some SALT into it or I can add temperature back into the algorithm to help me find information that is a little bit more relevant. I can do summarized text comparison. Essentially, I can take the information that the user gave me, maybe it's a very long text.

First of all, I will do some summarization before sending it to any embedding model. This is how I can cut down the length of the input, so I'm sending less tokens and the response that I'll get back will be more accurate. Sometimes users will use multiple words in order to convey their story, and some of them are words that are not relevant anymore, so it's easy string manipulation that we can do on top of the data. It also relates to contextual chunk retrieval. Sometimes you want to take one chunk, we want to break the sentence into multiple chunks and we want to send it to the embedding for the retrieval in order to improve that. Again, something that we've seen in software engineering in the data space pretty often. Prompt refinement. Again, taking the prompt and asking a specific model something that is faster, doesn't go through a RAG, just to improve that specific prompt from the user, refine the original prompt. It's very straightforward.

Something that we experimented with and we actually got very good results with it, because LLM as a refinement vector works really well. Domain-specific tuning. If I have specific models that I want to get, I can do some classifications of that prompt and then send it to only specific LLMs that were fine-tuned on that specific requirement. This is what I call the small brains. Sometimes you don't need the large model, sometimes you can use a small brain that you deploy somewhere, and you can leverage for that specific domain. There are, of course, many other optimizations that you can do and it becomes very application-specific. As you can imagine, I already mentioned what I call the small brain, and this is where we're actually entering the agent world. Because in the agent world, the idea is to decrease the level of mistakes, the range of mistakes that we can do by being very specific and very focused with the task we're asking the model to do.

Agentic RAG

We're entering agentic world and we're entering agentic RAG. How did we come about that? What happened to us? We started about many years back, purpose-built AI. Maybe you've seen it somewhere before: classification, prediction, very traditional. We moved into the generative AI about two or three years ago, where we called The Brain and we tried to throw all information at it. Lastly, today what we have is the agentic phase. We're being very focused. We're being very outcome-focused from specific agents. We're training it, we're fine-tuning it to give us specific results, and we give it tools. Now we're moving away from giving an answer to everything, build me the new e-commerce application. This is not going to work with agents. We're making it very specific.

Create this database connector, for example. We give it the tools to do that work, and we're going to look into it and how it can do that. Just to summarize it, agents are taking us from what used to be programmatic, very specific sequential code that we could write, into a more autonomy, variable flow world, where we have less control on the logic itself. To achieve that precision there, we want to move into a space that gives us more feedback about what we do, so we can actually improve the model.

What is an agent? I have my small brain, the LLM, the agent core. I have some memory model that I need to take care of, and there usually we'll have different layers of memories, what we call in the data space caching. Caching is something we know how to solve for. We have tools. We're giving it the capabilities to go and generate some SQL. We give it the capabilities to leverage some template to create a SQL. Here's all the tables. Here are the accesses you have. Please generate this SQL in order to answer this answer. We give it the capabilities to plan. Some of the agentic models that we have are actually planners, so they're going to help us do the work that we want to achieve. Here's a high-level example of some BI company that decided to actually leverage LLM and agentic-driven application for understanding what will be the sales for the next three months. A user can put in a question. Let's assume we have some application, some agent in the middle that is the controller.

The controller's job, when it initialized, it needs to understand what tools are available. We'll give it access to some folder. Please figure out from this folder directory which tools you have in place. Maybe it's a template for some SQL. Maybe it's some other things. Then start and manage the flow. What my agentic app is going to do is we're going to start delegating through all the different apps. You can see there is a planner there. It's going to tell us if we have a solid plan. There's a generated SQL there. It's going to generate some SQL maybe to, what was the previous sales like? Maybe we can learn about the next quarter from the previous quarter. Then we'll have an agent that will execute the SQL, take it, connect it to some MongoDB, connect to some Postgres, execute this with SQL for me.

The last thing, very important part of every agent system is the judge. LLM-as-a-judge has proved himself really highly valuable as a feedback mechanism loop that goes beyond the human in the loop. After I'm getting all the response back, my controller, we want to go back to the judge and ask it, did I get enough information here, and my plan was good or not? After we're going back, we're querying all the sales information from previous, it could be that the judge will come back and tell us, it's great that you did some prediction based on previous sales.

Actually, if you want to know how much sales you have now, you want to look at marketing campaigns. You want to look at how many people did the marketing campaign actually achieve. Now you have to go back, generate a new SQL, execute that SQL, and gain more information before sending it to the user. This is an iterative cycle that the controller with the planner, with the judge, with all the tools that it has, is going to continue doing until they find a result that they're happy with or until we give them a stop. We can also say stop after three or four retries of that experiment.

That brings me to type of agents. There are multiple types. We give it tools. We give it autonomy. What are the things that we're asking for it? We give it goals, and we tell it go do that very specific goal as you go. What are the patterns that emerged from agents that we can actually start and leverage right now? We have perception agents. Agents that have access to IoT devices. Agents that have access to my log system. Agents that have access to different SQL databases or anything that comes across. There is a reasoning agent, an agent that was specifically fine-tuned to do reasoning on top of my specific domain.

Lastly, there's executing agents. Agents are actually going to go and execute that SQL, bring back that information, make that decision. At a high level we can classify them into three different agents. The patterns become very interesting when we think about it. Relatively, there are more patterns that emerge every day. At a high level, these are the four patterns that were strongly talked about and used in the industry. We're going to go over them one by one so you can understand what each one of them means. Let's talk about the orchestrator. There's a little bit of code. Orchestrator, what it does essentially, it acts as like the central controller of everything that we do. It breaks down every task into smaller tasks and start to delegate into workers, into agents that are relevant in order to do that task. You can see here in the code we have the orchestrator agent, we have the worker agent.

Then when we're running the execute, essentially the orchestrator is calling the worker and creating that information, that request-response that comes in and out. Communication mechanism has to be very good. We all know when we're talking to these large-scale systems, communication system might become a bottleneck, so it's something to think about as we build it. This is one pattern that emerged and people are using more and more. This is just another sample of code of how the worker agent is delegating.

The second pattern that emerged is hierarchical. Essentially, I have my top-level agent that is going to speak to the mid-level agent, that is going to speak to the low-level agent. If you think about something like a planner, a planner would be a more of a top-level agent. If you think about, let's say if we have some example there, but essentially, we want to break that into layered structures, where we have multiple delegations happening in a hierarchical manner, where each one of the agents was fine-tuned to do something very specific. We're entering a world where modularity becomes the next challenge that we need to solve, but something that we actually solved pretty well in the world.

In here, feedback loop again becomes very interesting, so we always would have some judge or some human in the loop to help us with that as well. This is just looking deep into the model. We have the mid-agent, let's say we have Zone Lead 1, Zone Lead 2, Robot 1, Robot 2, and we have the manager that helps us manage the restock of inventory specifics, and this is how they coordinate and communicate while the top agent starts to execute the restock inventory for the rest of the agents.

The next pattern is blackboard design pattern. Blackboard essentially is a collaborative problem-solving approach when you have multiple agents that has some knowledge source attached to it for some reason, maybe they have some retrieval capabilities of specific retrieving. They share the workspace, so they have a shared memory, they have the blackboard that they're sharing. The idea here, they need to tackle some complex tasks, maybe they're doing deep research and each one of them has access to a different part of a research. Maybe one can do PDF analysis. Maybe one can search the web. Maybe one can go to a specific database that we own, and so on.

Here, the agents are specialized to do only one thing and one thing alone, and they have incremental progress, which means they can learn from previous mistakes. Here in the example, there's a medical diagnostic agent, and there's a blackboard with all the medical information that we have maybe about a specific patient. Each one of the agents will go to different sources in order to solve and update the blackboard with the diagnosis that it found in some of the information sources it has access to. The blackboard essentially is being initialized with some specific data, not all the data, only some specific data, for example, patient ID, patient scenario, and then each one of them goes and continues to operate and do that search. You can imagine how it's like an iterative refinement until they find an answer that's suitable for the problem they want to solve.

Another pattern is market-based design pattern. Market-based design pattern is essentially a decentralized approach, like most of our agent, but more in a marketplace-like environment. Let's say we're bidding on who's going to take this shipping to the customer. Let's say I'm building an e-commerce and I'm doing shipping and I have three different capabilities of shipping a specific product. One is drone, the second one is a truck, the third one is a robot. My question is, we're on an auction, how am I going to deliver that package? Each one of these agents will need to go back and bid, need to understand how much it cost them to actually do the shipping. Is there value in doing this shipping, and what is the bid that they're willing to bid for actually winning that shipping? You can imagine how this market-based design pattern already translates to other domains beyond those delivery capabilities, for example, advertisement, marketing tech. We need to bid for a specific advertisement space.

There are many other biddings that we do, but essentially here what each one of the agent needs to do is predict the duration, how long it'll take them to ship it. Get the bid, calculate the bid based on the information. Update learnings, ok, I didn't win that bid, my goal is to win the bid. I didn't win that specific bid because I didn't have the best bid, what did I learn from that information? They're capturing the history of what happened in the market, discretize state, and calculate rewards. If I did win it, what was the reward? If I didn't win it, was there a potential reward for me to learn from? We're humanizing these agents through code, essentially. This is the market-based design pattern. This is looking specifically more into the code. You can see how we can run the auction. What is the auction market? What do we add? How do we go about it? What is the historical append that we do there? Essentially, it's a known pattern that works, and people are already using it today and combining that small brain of agents and putting it into very specific implementations related to markets.

At a high level, just zooming out around agents, what are the components that we need? We need the application that can talk to an existing LLM. We can deploy the LLM close to where the application runs because we're now talking about relatively small, fine-tuned algorithms that we can deploy closer to where we are so we can reduce the latency of having a packet of information traveling outside of our cluster and coming back. We need memory: short-term and long-term.

If I'm retaining some information about that customer, it could be long-term information or long-term learnings, and I need to be able to build that so I might have some caching capabilities on one side and I might have some blob storage that I want to access every once in a while, to improve the learning specifically. I need to give it the tools, what it can access, the warehouse, other applications, the database, IoT devices, everything that that specific agent can use when it starts. It needs a feedback loop, a feedback system. When we looked at the model, the feedback system, the learning system, someone else won the bid. When we looked at the market pattern, someone else won the bid. Give me back that information.

The main agent is going to send back that information of who won the bid so I can actually improve my actions next time and give a bid that is closer to the one that won if it still gives the agents a reward. There's a shared memory because agents collaborate. There's some collaboration between them depending on the model that we're building. There's some collaboration with them, and it needs to be also applicable for short memory and long-term memory. Agent can be attached to, here you see something like a plan and act. This is called the react pattern. Essentially the tools that we're giving the agents are able to act on behalf of us and do the work.

Nailing Down Precision

The real question is, how do we nail down the precision? We learned that when we take the problem and instead of asking a model to solve the big problem, we're actually granulating it and making it super focused. How do we know that we did a good job and how do we do it at a large scale? We need to evaluate and monitor all the time. We need some precision-specific metrics. There are some metrics that exist in the industry. They're not a solved solution for feedback loop. Feedback loop would always be something that is going to be an engineering challenge that we are going to solve, and each one of us is going to solve specifically for our own use case. There are state-of-the-art algorithms that exist out there like the T5-XXL, comes from Google, Text-to-Text Transfer Transformer that has FEVER as a database with some good results so we can compare ourself against. It's what the industry calls compare against state-of-the-art, or top in the industry results. Entity precision, exact match of key entities.

Sometimes, for example, the market won. We give a bid, we get back, we learn that a better bid has won. Maybe there's some small percentage there between our bid and the actual bid that won. There is a feedback loop. There is evaluation and monitoring, this is how we're solving it. Hallucination rate, so the SelfCheckGPT is a good one. LLM-as-a-judge is a good one. Generally, fine-tuning an expert for us is going to be helpful in order to judge some of the capabilities in-house. The continuous feedback loop for human in the loop. If our application enables that, and it's not always the case, but if it does, we can do a thumbs up, thumbs down, for example, and we're going to go over it. This is a great indicator knowing if the results give us anything in terms of monitoring.

Feedback loop and refinement. This is very simple. When I have an application, when I have a chatbot, sometimes at the end of the chat experience, we'll give some survey. Did you enjoy that experience? Did you get value out? Yes, no. This is later being pulled into some data pipeline that accumulates all the yes and no's and compare how many yeses to how many no's, and later on, bring it back to the algorithm in order to do that refine-tuning, give it the specific example. Again, building that pipeline for the feedback loop that's going to help us with the continuous refinement of the model.

Then we're just fitting it into the model, here's where you got it right, here's where you got it wrong. It's a simple prompt that we're giving it and it remembers it through the memory that it has. This is one feedback loop and refinement. The memory and reflection. What exists in my memory were the things that I can pull back from my database, things that worked really good for me. What the algorithm can learn from it, and how we can adjust based on past actions, how we can improve the query. It's again, simple data pipeline that evolves related to bringing back the information that I got and let me save it. Next thing is reinforcement learning. If I have something related to reward or penalties from the user, thumbs up, thumbs down. If I can learn from other interactions that the user does on my platform, I can leverage that through actions and I can bring it back as information. This is a huge world of reinforcement learning that exists today.

Lastly, I just want to summarize all the capabilities that we have in order to improve. We have the prompt engineering and RAG that we talked about. We talked about the agents and the workflow automation. We talked about feedback loop and memory. We also talked about reinforcement learning and learning from demonstrations right now. All of that in order to continuously improve the model. When I get to meet some very successful customers in the space of AI and agentic AI, one of the things they always look at as IP is actually how much feedback they were able to collect from their systems. Because the models themselves at the base are very stale. It gives us some capabilities but it's not always great. The more interactions with the models, the more we are able to feed the right information to the model, the better these models become at the end of the day. This is a very critical piece for us.

Scaling

We talked a lot about different data pipelines and the idea behind the models, but how do we make it scale? We know there's a lot of request-response information running out there between those things. Actually, if you think about it carefully, it actually lands in the world of microservices. Oftentimes, you use these models in real time. These agents will be an application that runs in real time and that brings us to the world of microservices. It's like a place that we know how to solve at a high level but how do we solve it for AI as well? In the world of microservices where we have all these tightly coupled agent dependencies, by itself, it doesn't scale. The request-response and the RFCs that are going to happen there is going to make us lose messages, be sometimes very late with the response.

Essentially, the guarantees of a message actually reaching its destination doesn't always happen, which actually slow down everything that we do. In the microservices space, we already solved it with something called event broker. With Kafka, there's a Kafka broker to it. We can actually start making some sense out of this agent space in a very engineering driven approach. It's like, let me have my published events, let me have my consumer events on the other side, and let me have this event broker in the middle that helps me make sense of all my systems so I can start scaling up and retain that information. This is how we can actually start to scale all these events that's coming into the system without losing any events because we have event guarantees. With Apache Kafka, we can enforce exactly once.

Event guarantees, exactly once, at most once or at least once. Here with agentic systems, we often need exactly once because we need to know if we're sending a request from some agent to operate. We need to know that the agent actually operated on that request and gave us back a response. We want to be very specific with exactly once here. On top of that event broker, we can actually start building some governance capabilities and we can start building some processing capabilities. It's going to help us scale this system even further. If I'm looking at the broker at the bottom, I can actually start in building on top of these real-time events. There are open-source solutions today that help me build these data streaming capabilities. One of them is Kafka Streams, for example. If I need to do some specific manipulations on the data before sending it to the model, I can leverage Kafka Streams, for example.

If I need to aggregate, I need some windowing capabilities, I can send it to Apache Flink, for example. This is how I'm keeping my agentic system live, and we operate in real-time, but actually making more sense of the architecture and how we're structuring it for the business so I can make a conscious decision and I can continue to scale it as we go. Specific in the data streaming world, there's also Kafka Connect, which makes it very easy to connect to an existing database and bring information back. I can connect it to some CRM.

Some of the Kafka Connects are open source, some of them are closed source, but you can also build your own if you want to. Yes, it can enable a lot of prompt manipulation, string manipulation. I can play with that. I can filter some of the irrelevant messages. Maybe a user sends the same message twice. I can start filtering it. I can check if actually this user is allowed to look into a specific database. This is where governance really plays an important part, especially with RAG, because RAG, essentially, I'm giving it access, go use the database. Maybe that user is not allowed to access this specific database and we're tapping into security concerns here. We want to be able to govern these requests, make sure we have some cataloging around it, validating the user access control. We want to make sure we have the right quality of data in it as well. The data streaming architecture approach really helps us do exactly that.

Real-World Use Cases

I want to talk about some real use cases that we built. One of them is SDR. SDR stands for Sales Development Representative. If sometimes you get these annoying messages on LinkedIn or in email, there's some probably cold reach-out messages. There could be some SDR behind the scenes. What we built in-house in Confluent, our SDR team is doing some research and analyzing a potential customer. Then they do some ranking around it, and then they do some scoring around it. If they found someone that's relevant to speak with, they will find a way to speak with that specific individual. In-house we're leveraging agentic systems in order to give the right scoring for that specific representative.

Essentially, we're not just spamming everyone, we're being very deliberate, very specific about the people that we speak with. The second thing that we're now building is marketing ops. There's a lot of newsletters, a lot of content that goes out, and we want to make sure it has a specific quality bar that we want everyone in the company to achieve. Confluent is a very big company. It's very hard to speak with everyone and trying to educate everyone of what high quality bar means. It's very hard to do that feedback loop, what I call pull requests and peer review loop, with each and every one. We are building now an automated system based on agents where we're giving the agents specific requirements of how to qualify if content is good enough to go out or not. Again, something in the building, we're experimenting with that, and very excited to see what will come out of it.

I said that I get to meet a lot of customers as well, and there's some very interesting use cases. One is a retail digital assistant, essentially behind the scenes, they want to give a better experience for their users online. They build the system. They have the online store. They have the retail store. They have Confluent Cloud in the middle. They're using Kafka and Kafka Connect right in there. For the AI part, there's the cloud digital assistant application that they build over multiple applications in there, using JavaScript, so we see Express.js, but the REST API to communicate directly with OpenAI through Kafka. You can see it goes through the Kafka ones. In order to bring back the information, they have a connector to MongoDB Atlas, also go through the Confluent Cloud and the Kafka applications.

All the way on the other side, this is D-ID, this is for creating avatars, video, voice, whatever you want, because at the end of the day, they want to create these videos in order to show back to their users how to use the platform or evangelize some of the products that they have in there. This is just one example of how we're solving some of the challenges for our customer. Another example that a customer that I met about two weeks ago in the cybersecurity space, they have some fraud detection capabilities that they needed. They have really good precision, so their stats, they're accurate 95% of the time, which if you live in this space, it's very good. The CTO really wanted to take that 95% to 100%.

Essentially, they said, we have some regex there, it works well, there's some if-else capabilities, but we want to take that 95% to 100%, because that 5% of customers that always complain is always going to be a huge bottleneck on the engineering team and also on the customer experience and the company reputation. One of the things that we're working with them is actually incorporating LLM-as-a-judge in order to understand what are these 5%, and how can we improve the outcome of these exact 5% that the traditional regex didn't catch. We're making it very practical. I believe the more we're tuning down into the exact problems that we want to solve, and we're not asking an LLM to build the world for us, we're actually able to get some very good results in the end.

Summary

Precision is going to be our competitive edge. This slide came out ok, I think. Also, I used AI to create it. We talked about RAG. We talked about agentic RAG. We talked about how to improve retrievals. We talked about patterns in the wild, some of the patterns that emerge now. We talked about feedback loop and how it's important, how it's very specific to the application that you're building. We touched a little bit into the data streaming world and how actually all the microservices are translated into a data streaming platform that we can build in order to provide these agentic capabilities. I shared a couple of practical use cases, some of the things that we're building in Confluent, some of the things that I am seeing with customers.

Some bloopers, because we played with AI and I got some weird things. I don't know if you can spot what's wrong in this, but it's like the woman's head somewhere and some weird things that happened in here. Here I covered some things that didn't go well, so I needed to edit it a little bit. Here it created some weird logo on this woman's T-shirt. I don't know if it's NASA or an eye or a spaceship. It was odd, so I just skipped it.

Questions and Answers

Nardon: You mentioned one of the tools that is very much used in the agents is text-to-SQL to retrieve data, but this adds a level of imprecision because usually the SQL can retrieve data, that's not easy. Thinking as a software engineer, I wonder if we should minimize using text-to-SQL and try to access data in a more precise way. What's your take on this?

Adi Polak: We actually solved the text-to-SQL one for the SDR case, and how we went about it is by templatizing. We're giving part of the tools that we're giving for executed SQL as a template. Here are the tables that you're allowed to access with that specific user. Here's a roughly template of the queries that you can ask, and just filling the blanks of what you think should come here. That really helps narrow down what the output from the LLM is and actually improve the precision.

My experience was it's never just send and to allow and get a response, it's going to be perfect, but you actually have to guide it to the level of templating the request to get a good response. With the SDR example, we actually got really good response because we know the tables, we have information of the governance. If there's a new table that's just being created, it's automatically being exposed to the streaming catalog that we have. We know the columns. We know what the user is allowed to access. We roughly know what's the questions that an SDR person would want to ask. We really narrow it down. It still gives us the capabilities to use some of its brain, but laser focused.

 

See more presentations with transcripts

 

Recorded at:

Nov 07, 2025

BT