Intro of Retrieval Augmented
Generation (RAG) and
application demos
by Henry Heng LUO
Content
vRAG Summary
vHands-on practices
1. PRACTICE of Basic RAG pipeline
2. PRACTICE of Sentence-window retrieval pipeline
3. PRACTICE of Auto-merging retrieval pipeline
RAG Summary
• Large Language Models are with intrinsic flaws.
• They are can produce misleading "hallucinations"
• They rely on potentially outdated information
• They are inefficient when dealing with specific knowledge
• They lack depth in specialized fields
• They fall short in reasoning abilities
• They lack controllability
• They cannot trace the knowledge source
• They cannot protect data privacy
• They are with high cost to train
• Retrieval-Augmented Generation (RAG) significantly improves the precision and pertinence of
content by first retrieve relevant information from an external database of documents prior to
the language model's answer generation.
RAG Summary
RAG Summary
RAG Summary
• Basic RAG
• The classic basic RAG process, also known as Naive RAG, mainly
includes three basic steps:
1. Indexing -Splitting the document corpus into shorter chunks and building a
vector index through an encoder.
2. Retrieval - Retrieving relevant document fragments based on the similarity
between the question and the chunks.
3. Generation - Generating an answer to the question conditioned on the
retrieved context.
RAG Summary
• Advanced RAG
• The Advanced RAG paradigm involves additional processing in Pre-
Retrieval and Post-Retrieval.
1. Before retrieval, methods such as query rewriting, routing, and
expansion can be used to align the semantic differences between questions
and document chunks.
2. After retrieval, rerank the retrieved document corpus can avoid the "Lost in
the Middle" phenomenon, or the context can be filtered and compressed to
shorten the window length.
RAG Summary
• Modular RAG
• Structurally, it is more free and flexible, introducing more specific
functional modules, such as query search engines and the fusion of
multiple answers.
• Technologically, it integrates retrieval with fine-tuning, reinforcement
learning, and other techniques.
• In terms of process, the RAG modules are designed and orchestrated,
resulting in various RAG patterns.
RAG Summary
RAG Summary
• To build a good RAG system, three critical questions need to be
considered:
• What to retrieve?
• When to retrieve?
• How to use the retrieved content?
RAG Summary
• Augmentation Sources. including unstructured data such as text
paragraphs, phrases, or individual words. Structured data can also be
used, such as indexed documents, triple data, or subgraphs, or
retrieving from content generated by LLMs themselves.
• Augmentation Stages. performing during the pre-training, fine-
tuning, and inference stages.
• Augmentation process. The initial retrieval was a once process,
but iterative retrieval, recursive retrieval, and adaptive retrieval
methods, where LLMs decide the timing of retrieval on their own,
gradually emerged in the development of RAG.
RAG Summary
RAG Summary
RAG Summary
• RAG is like giving the model a textbook for customized information
retrieval, which is very suitable for specific queries.
• Fine-tuning is like a student internalizing knowledge over time, better
suited for mimicking specific structures, styles, or formats.
• Depending on their reliance on external knowledge and requirements
for model adjustment, they each have suitable scenarios.
• To use RAG, Fine-tuning, Prompt Engineering together may yield the
best results.
RAG Summary
RAG Summary
RAG Summary
• The evaluation methods for RAG are diverse, mainly including three
quality scores: context relevance, answer fidelity, and answer
relevance.
• The evaluation involves four key capabilities: noise robustness, refusal
ability, information integration, and counterfactual robustness.
• In terms of evaluation frameworks, there are benchmarks such as
RGB and RECALL, as well as automated evaluation tools like RAGAS,
ARES, and TruLens, which help to comprehensively measure the
performance of RAG models.
RAG Summary
RAG Summary
RAG Summary
• To address the current challenges faced by RAG:
• Context length. What to do when the retrieved content is too much and exceeds the window limit? If the
context window of LLMs is no longer limited, how should RAG be improved?
• Robustness. How to deal with incorrect content retrieved? How to filter and validate the retrieved content?
How to enhance the model's resistance to poisoning and noise?
• Coordination with fine-tuning. How to leverage the effects of both RAG and FT simultaneously, how should
they coordinate, organize, whether in series, alternation, or end-to-end?
• Scaling Laws: Does the RAG model satisfy the Scaling Law? Will RAG, or under what scenarios might RAG
experience the phenomenon of Inverse Scaling Law?
• The role of LLMs. LLMs can be used for retrieval (replacing search with LLMs' generation or searching LLMs'
memory), for generation, for evaluation. How to further explore the potential of LLMs in RAG?
• Production-ready. How to reduce the retrieval latency of ultra-large-scale corpora? How to ensure that the
content retrieved is not leaked by LLMs
• Multimodal Expansion. How can the evolving technologies and concepts of RAG be extended to other
modalities of data such as images, audio, video, or code?
RAG Summary
• RAG can be applied to question-answering systems and more: such
as recommendation systems, information extraction, and report
generation.
• The RAG technology stack is booming. In addition to well-known tools
like Langchain and LlamaIndex, the market is seeing an emergence of
more targeted RAG tools, such as customized tools and simplified
tools.
RAG Summary
"Retrieval-Augmented Generation for Large Language Models: A Survey"
PRACTICE of Basic RAG pipeline
• We want to infuse existing database information into the LLM.
• Each query will firstly send to retrieve the context information related
to the existing database, (here vector database can be used), then the
context information is wrapped in the prompt and sent to the LLM.
• Separate the documents into small chunk.
• Search the semantic matched small chunk.
• Return the top-k small chunks.
PRACTICE of Basic RAG pipeline
The same text
chunks are
used in
embeddings
and synthesis
PRACTICE of Sentence-window retrieval pipeline
• This is suitable for plenty context information, instead of only small
chuck information.
• Separate the documents into sentence level.
• Search the semantic matched sentence chunk.
• Retrieve the sentence chunk with the previous and following
sentences window, to form the context chunk.
• Rerank the context chunks.
PRACTICE of Sentence-window retrieval pipeline
PRACTICE of Sentence-window retrieval pipeline
Query: What
are the
concern
surrounding
the AMOC?
PRACTICE of Auto-merging retrieval pipeline
• The small chunk is good to match precisely, but we also need plenty
context information.
• Define a hierarchy of smaller chunks.
• linked to parent chunks. If the set of smaller chunks linking to a
parent chunk exceeds some threshold, then "merge" smaller chunks
into the bigger parent chunk.
• Rerank the final parent chunks.
PRACTICE of Auto-merging retrieval pipeline
Auto-merging
returned chunk