Retrieval Augmented Generation Guide

Explore top LinkedIn content from expert professionals.

  • View profile for Aishwarya Srinivasan
    Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
    584,895 followers

    If you’re an AI engineer trying to understand and build with GenAI, RAG (Retrieval-Augmented Generation) is one of the most essential components to master. It’s the backbone of any LLM system that needs fresh, accurate, and context-aware outputs. Let’s break down how RAG works, step by step, from an engineering lens, not a hype one: 🧠 How RAG Works (Under the Hood) 1. Embed your knowledge base → Start with unstructured sources - docs, PDFs, internal wikis, etc. → Convert them into semantic vector representations using embedding models (e.g., OpenAI, Cohere, or HuggingFace models) → Output: N-dimensional vectors that preserve meaning across contexts 2. Store in a vector database → Use a vector store like Pinecone, Weaviate, or FAISS → Index embeddings to enable fast similarity search (cosine, dot-product, etc.) 3. Query comes in - embed that too → The user prompt is embedded using the same embedding model → Perform a top-k nearest neighbor search to fetch the most relevant document chunks 4. Context injection → Combine retrieved chunks with the user query → Format this into a structured prompt for the generation model (e.g., Mistral, Claude, Llama) 5. Generate the final output → LLM uses both the query and retrieved context to generate a grounded, context-rich response → Minimizes hallucinations and improves factuality at inference time 📚 What changes with RAG? Without RAG: 🧠 “I don’t have data on that.” With RAG: 🤖 “Based on [retrieved source], here’s what’s currently known…” Same model, drastically improved quality. 🔍 Why this matters You need RAG when: → Your data changes daily (support tickets, news, policies) → You can’t afford hallucinations (legal, finance, compliance) → You want your LLMs to access your private knowledge base without retraining It’s the most flexible, production-grade approach to bridge static models with dynamic information. 🛠️ Arvind and I are kicking off a hands-on workshop on RAG This first session is designed for beginner to intermediate practitioners who want to move beyond theory and actually build. Here’s what you’ll learn: → How RAG enhances LLMs with real-time, contextual data → Core concepts: vector DBs, indexing, reranking, fusion → Build a working RAG pipeline using LangChain + Pinecone → Explore no-code/low-code setups and real-world use cases If you're serious about building with LLMs, this is where you start. 📅 Save your seat and join us live: https://lnkd.in/gS_B7_7d

  • View profile for Brij kishore Pandey
    Brij kishore Pandey Brij kishore Pandey is an Influencer

    AI Architect | Strategist | Generative AI | Agentic AI

    680,515 followers

    𝗥𝗔𝗚 𝗵𝗮𝘀 𝗰𝗼𝗺𝗲 𝗮 𝗹𝗼𝗻𝗴 𝘄𝗮𝘆 — 𝗮𝗻𝗱 𝗶𝘁’𝘀 𝗻𝗼 𝗹𝗼𝗻𝗴𝗲𝗿 𝗷𝘂𝘀𝘁 𝗼𝗻𝗲 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲. Today, Retrieval-Augmented Generation (RAG) is a design space. There are many emerging patterns, but in this post, I’m focusing on the 𝘁𝗼𝗽 𝟴 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲𝘀 you should know in 2025. Why? Because the way we retrieve context 𝘥𝘪𝘳𝘦𝘤𝘵𝘭𝘺 𝘴𝘩𝘢𝘱𝘦𝘴 how intelligent, useful, and safe our AI systems are. Here are 8 RAG variants changing how we build with LLMs: • 𝗦𝗶𝗺𝗽𝗹𝗲 𝗥𝗔𝗚 𝘄𝗶𝘁𝗵 𝗺𝗲𝗺𝗼𝗿𝘆 — add past interactions to make responses more grounded • 𝗕𝗿𝗮𝗻𝗰𝗵𝗲𝗱 𝗥𝗔𝗚 — pull from APIs, databases, and knowledge graphs at once • 𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗥𝗔𝗚 — where an agent decides what to retrieve and when • 𝗛𝘆𝗗𝗲 — generate hypothetical documents to guide more targeted lookups • 𝗦𝗲𝗹𝗳-𝗥𝗔𝗚 — the system rephrases, self-grades, and reflects before generating • 𝗔𝗱𝗮𝗽𝘁𝗶𝘃𝗲 𝗥𝗔𝗚 — chooses the best data source dynamically • 𝗖𝗼𝗿𝗿𝗲𝗰𝘁𝗶𝘃𝗲 𝗥𝗔𝗚 (𝗖𝗥𝗔𝗚) — filters out noisy context using thresholds • 𝗦𝗶𝗺𝗽𝗹𝗲 𝗥𝗔𝗚 — still a valid choice when your data is clean and static    I created this visual guide to help AI engineers, architects, and teams rethink what retrieval can (and should) do. Because retrieval is no longer just “fetch and feed.” It’s evolving into an intelligent layer that brings 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴, 𝗳𝗲𝗲𝗱𝗯𝗮𝗰𝗸, 𝗮𝗱𝗮𝗽𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆, and 𝗰𝗼𝗻𝘁𝗿𝗼𝗹 into GenAI systems. Have I overlooked anything? Please share your thoughts—your insights are priceless to me.

  • View profile for Goku Mohandas

    ML Lead at Anyscale

    25,723 followers

    Excited to share our production guide for building RAG-based LLM applications where we bridge the gap between OSS and closed-source LLMs. - 💻 Develop a retrieval augmented generation (RAG) LLM app from scratch. - 🚀 Scale the major workloads (load, chunk, embed, index, serve, etc.) across multiple workers. - ✅ Evaluate different configurations of our application to optimize for both per-component (ex. retrieval_score) and overall performance (quality_score). - 🔀 Implement LLM hybrid routing approach to bridge the gap b/w OSS and closed-source LLMs. - 📦 Serve the application in a highly scalable and available manner. - 💥 Share the 1st order and 2nd order impacts LLM applications have had on our products and org. 🔗 Links: - Blog post (45 min. read): https://lnkd.in/g34a9Zwp - GitHub repo: https://lnkd.in/g3zHFD5z - Interactive notebook: https://lnkd.in/g8ghFWm9 Philipp Moritz and I had a blast developing and productionizing this with the Anyscale team and we're excited to share Part II soon (more details in the blog post).

  • View profile for Damien Benveniste, PhD
    Damien Benveniste, PhD Damien Benveniste, PhD is an Influencer

    Founder @ TheAiEdge | Follow me to learn about Machine Learning Engineering, Machine Learning System Design, MLOps, and the latest techniques and news about the field.

    172,426 followers

    Most people do not look beyond the basic RAG pipeline, and it rarely works out as expected! RAG is known to lack robustness due to the LLM weaknesses, but it doesn't mean we cannot build robust pipelines! Here is how we can improve them. The RAG pipeline, in its simplest form, is composed of a retriever and a generator. The user question is used to retrieve the database data that could be used as context to answer the question better. The retrieved data is used as context in a prompt for an LLM to answer the question. Instead of using the original user question as a query to the database, it is typical to rewrite the question for optimized retrieval. Instead of blindly returning the answer to the user, we better assess the generated answer. That is the idea behind Self-RAG. We can check for hallucinations and relevance to the question. If the model hallucinates, we are going to try again the generation, and if the answer doesn't address the question, we are going to restart the retrieval by rewriting the query. If the answer passes the validation, we can return it to the user. It might be better to provide feedback for the new retrieval and the new generation to be performed in a more educated manner. In the case we have too many iterations, we are going to assume that we just reach a state where the model will apologize for not being able to provide an answer to the question. When we are retrieving the documents, we are likely retrieving irrelevant documents, so it could be a good idea to filter only the relevant ones before providing them to the generator. Once the documents are filtered, it is likely that a lot of the information contained in the documents is irrelevant, so it is also good to extract only what could be useful to answer the question from the documents. This way, the generator will only see relevant information to answer the question. The assumption in typical RAG is that the question will be about the data stored in the database, but this is a very rigid assumption. We can use the idea behind Adaptive-RAG, where we are going to assess the question first and route to a datastore RAG, a websearch or a simple LLM. It is possible that we realize that none of the documents are actually relevant to the question, and we better reroute the question back to the web search. That is part of the idea behind Corrective RAG. If we reach the maximum of web search retries, we can give up and apologize to the user. Here is how I implemented this pipeline with LangGraph: https://lnkd.in/g8AAF7Fw

  • View profile for Shubham Saboo

    AI Product Manager @ Google | Open Source Awesome LLM Apps Repo (#1 GitHub with 70k+ stars) | 3x AI Author | LinkedIn Top Voice | Views are my Own

    59,261 followers

    RAG isn't working as well as you think 🤯 Here's what most people miss: Traditional RAG just matches your query with documents. Like using Ctrl+F with extra steps. But what if your AI model could actually understand the whole context? That's where MemoRAG comes in. Instead of just searching for keywords, It builds a complete understanding of your data. Think of it like this: A regular RAG system is like a student cramming before an exam, Looking up specific answers when needed. MemoRAG is like a student who actually understood the material, Making connections and seeing the bigger picture. Here's what makes it different: 1. Memory-First Approach ↳ It doesn't just search for answers ↳ It builds understanding, very similar to how a human would 2. Smart Retrieval ↳ Generates specific clues from memory ↳ Finds evidence other systems might miss 3. Speed Optimization ↳ 30x faster context pre-filling ↳ Reuses encoded contexts across queries But here's the practical part: You can run it on a single T4 GPU. The lightweight version makes it accessible for smaller teams. Working with RAG systems daily, I've seen their limitations: → Missing context → Shallow understanding → Repetitive processing MemoRAG solves these issues by thinking more like we do. It remembers. It connects. It understands. And the best part?  It's 100% Opensource. Want to try it yourself? Link to the GitHub repo in the comments. P.S. I create AI tutorials and opensource them for free. Your 👍 like and ♻️ repost helps keep me going. Don't forget to follow me Shubham Saboo for daily tips and tutorials on LLMs, RAG and AI Agents.

  • View profile for Ravit Jain
    Ravit Jain Ravit Jain is an Influencer

    Founder & Host of "The Ravit Show" | Influencer & Creator | LinkedIn Top Voice | Startups Advisor | Gartner Ambassador | Data & AI Community Builder | Influencer Marketing B2B | Marketing & Media | (Mumbai/San Francisco)

    165,159 followers

    RAG just got smarter. If you’ve been working with Retrieval-Augmented Generation (RAG), you probably know the basic setup: An LLM retrieves documents based on a query and uses them to generate better, grounded responses. But as use cases get more complex, we need more advanced retrieval strategies—and that’s where these four techniques come in: Self-Query Retriever Instead of relying on static prompts, the model creates its own structured query based on metadata. Let’s say a user asks: “What are the reviews with a score greater than 7 that say bad things about the movie?” This technique breaks that down into query + filter logic, letting the model interact directly with structured data (like Chroma DB) using the right filters. Parent Document Retriever Here, retrieval happens in two stages: 1. Identify the most relevant chunks 2. Pull in their parent documents for full context This ensures you don’t lose meaning just because information was split across small segments. Contextual Compression Retriever (Reranker) Sometimes the top retrieved documents are… close, but not quite right. This reranker pulls the top K (say 4) documents, then uses a transformer + reranker (like Cohere) to compress and re-rank the results based on both query and context—keeping only the most relevant bits. Multi-Vector Retrieval Architecture Instead of matching a single vector per document, this method breaks both queries and documents into multiple token-level vectors using models like ColBERT. The retrieval happens across all vectors—giving you higher recall and more precise results for dense, knowledge-rich tasks. These aren’t just fancy tricks. They solve real-world problems like: • “My agent’s answer missed part of the doc.” • “Why is the model returning irrelevant data?” • “How can I ground this LLM more effectively in enterprise knowledge?” As RAG continues to scale, these kinds of techniques are becoming foundational. So if you’re building search-heavy or knowledge-aware AI systems, it’s time to level up beyond basic retrieval. Which of these approaches are you most excited to experiment with? #ai #agents #rag #theravitshow

  • View profile for Matt Wood
    Matt Wood Matt Wood is an Influencer

    CTIO, PwC

    74,578 followers

    AI field note: introducing Toolshed from PwC, a novel approach to scaling tool use with AI agents (and winner of best paper/poster at ICAART). LLMs are limited in the number of external tools agents can use at once., usually to about 128 which sounds like a lot, but in a real-world enterprise quickly becomes a limitation. This creates a major bottleneck for real-world applications like database operations or collaborative AI systems that need access to hundreds or thousands of specialized functions. Enter Toolshed, a novel approach from PwC that reimagines tool retrieval and usage that enables AI systems to effectively utilize thousands of tools without fine-tuning or retraining. Toolshed introduces two primary technical components that work together to enable scalable tool use beyond the typical 128-tool limit: 📚 Toolshed Knowledge Bases: Vector databases optimized for tool retrieval that store enhanced representations of each tool, including: tool name and description, argument schema with parameter details, synthetically generated hypothetical questions, key topics and intents the tool addresses Tool-specific metadata for execution. 🧲 Advanced RAG-Tool Fusion: A comprehensive three-phase approach that creatively applies retrieval-augmented generation techniques to the tool selection problem, enhancing tool documents with rich metadata and contextual information accuracy, decomposing queries into independent sub-tasks, and reranking to ensure optimal tool selection. The paper demonstrates significant quantitative improvements over existing methods through rigorous benchmarking and systematic testing: ⚡️ 46-56% improvement in retrieval accuracy (on ToolE and Seal-Tools benchmarks vs. standard methods like BM25). ✨ Optimized top-k selection threshold to systematically balance retrieval accuracy with agent performance and token costs. 💫 Scalability testing: Proven effective when scaling to 4,000 tools. 🎁 Zero fine-tuning required: Works with out-of-the-box embeddings and LLMs. Not too shabby. Toolshed addresses challenges in enterprise AI deployment, offering practical solutions for complex production environments such as cross-domain versatility (we successfully tested across finance, healthcare, and database domains), secure database interactions, multi-agent orchestration, and cost optimization. Congratulations to Elias Lumer, Vamse Kumar Subbiah, and team for winning the best poster award at the International Conference on Agents and AI! For any organization building production AI systems, Toolshed offers a practical path to more capable, reliable tool usage at scale. Really impressive and encouraging work. Link in description.

  • View profile for Santiago Valdarrama

    Computer scientist and writer. I teach hard-core Machine Learning at ml.school.

    119,496 followers

    This makes your RAG application 10x better. Most people I know split their documents and generate embeddings for those chunks. But generating good chunks is hard. There's no perfect solution, but there's a simple trick to make those chunks much better. Augment each chunk with additional metadata. For example, say you're chunking research papers. Each chunk might be just a paragraph, but that paragraph by itself is often too vague. Instead of using the paragraph alone, I add the following information to each chunk: • The paper title • The page number • The section heading where the paragraph is • Any relevant keywords or tags in that paragraph • A one-sentence summary of the paragraph This extra context makes the embedding richer and way more useful at retrieval time. You can either infer this additional metadata or use an LLM to generate it. This is an extra step. Don't worry about it if you are just starting with your RAG implementation, but as soon as you have a working solution, spend the time building this. You'll never go back.

  • View profile for Greg Coquillo
    Greg Coquillo Greg Coquillo is an Influencer

    Product Leader @AWS | Startup Investor | 2X Linkedin Top Voice for AI, Data Science, Tech, and Innovation | Quantum Computing & Web 3.0 | I build software that scales AI/ML Network infrastructure

    212,984 followers

    Context-aware agents require deliberate architecture that combines retrieval-augmented generation, session memory, and adaptive reasoning. This 10-step framework begins with defining the agent’s domain, use cases, and output structure, followed by ingestion and chunking of trustworthy data aligned to safety and alignment principles. Embeddings are then generated using models like OpenAI or Cohere and stored in vector databases such as FAISS or Pinecone for efficient semantic retrieval. Retrieval logic leverages k-NN search to fetch relevant chunks based on similarity and metadata filters. Prompts are engineered dynamically using retrieved context, optionally enriched with few-shot examples, and sent to LLMs like GPT-4 or Claude with configurable parameters. Session memory can be integrated to track interaction history and enhance continuity. Continuous evaluation identifies hallucinations, prompt failures, and edge cases for iterative refinement. Deployment involves wrapping the agent in an API or interface with monitoring hooks, and expansion includes tool use, personalization, and self-corrective mechanisms. If you follow this framework, you’ll be building the pipeline forming the backbone of production-grade AI agents that reason with context and respond with precision. Go build! #genai #aiagent #artificialintelligence

  • View profile for Zain Hasan

    AI builder & teacher | AI/ML @ Together AI | ℕΨ Engineering @ UofT | Lecturer | ex-Vector DBs, Data Scientist, Health Tech Founder

    15,273 followers

    Fine-tuned larger language models and longer context lengths eliminate the need for retrieval from external knowledge/vector databases, right? ... Not quite!! NVIDIA asked the same question last month! They published a new paper(https://lnkd.in/gfn3Jubc) examining how well very large finetuned LLMs with longer context lengths compare to shorter context length RAG supported LLMs. They explore two main questions: 1. Retrieval-augmentation versus long context window, which one is better for downstream tasks? 2. Can both methods be combined to get the best of both worlds? In short, they found: 1. RAG outperforms long context alone 2. Yes they perform better together. RAG works better with longer context than with shorter context. The main finding presented in the paper was that "retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes". Some more details: 1. RAG is more important than context windows: a LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window 2. RAG is also faster: Augmenting generation with retrieval not only performs better by requiring significantly less computation and is much faster at generation 3. RAG works even better as parameter count increases because smaller 6-7B LLMs have relatively worse zero-shot capability to incorporate the retrieved chunked context: Perhaps counter intuitively the benefits of RAG on performance are more pronounced the larger the language model gets, experiments were done for LLMs with 43B and 70B params 4. RAG works even better as context length increases: Retrieval-augmented long context LLM (e.g., 16K and 32K) can obtain better results than retrieval-augmented 4K context LLM, even when fed with the same top 5 chunks of evidence 5. Retrieval-augmented LLaMA2-70B with 32K context window, outperforms GPT-3.5-turbo-16k and Davinci003 and non-retrieval LLaMA2-70B-32k baseline for question answering and query-based summarization.

Explore categories