If you’re an AI engineer, understanding how LLMs are trained and aligned is essential for building high-performance, reliable AI systems. Most large language models follow a 3-step training procedure: Step 1: Pretraining → Goal: Learn general-purpose language representations. → Method: Self-supervised learning on massive unlabeled text corpora (e.g., next-token prediction). → Output: A pretrained LLM, rich in linguistic and factual knowledge but not grounded in human preferences. → Cost: Extremely high (billions of tokens, trillions of FLOPs). → Pretraining is still centralized within a few labs due to the scale required (e.g., Meta, Google DeepMind, OpenAI), but open-weight models like LLaMA 4, DeepSeek V3, and Qwen 3 are making this more accessible. Step 2: Finetuning (Two Common Approaches) → 2a: Full-Parameter Finetuning - Updates all weights of the pretrained model. - Requires significant GPU memory and compute. - Best for scenarios where the model needs deep adaptation to a new domain or task. - Used for: Instruction-following, multilingual adaptation, industry-specific models. - Cons: Expensive, storage-heavy. → 2b: Parameter-Efficient Finetuning (PEFT) - Only a small subset of parameters is added and updated (e.g., via LoRA, Adapters, or IA³). - Base model remains frozen. - Much cheaper, ideal for rapid iteration and deployment. - Multi-LoRA architectures (e.g., used in Fireworks AI, Hugging Face PEFT) allow hosting multiple finetuned adapters on the same base model, drastically reducing cost and latency for serving. Step 3: Alignment (Usually via RLHF) Pretrained and task-tuned models can still produce unsafe or incoherent outputs. Alignment ensures they follow human intent. Alignment via RLHF (Reinforcement Learning from Human Feedback) involves: → Step 1: Supervised Fine-Tuning (SFT) - Human labelers craft ideal responses to prompts. - Model is fine-tuned on this dataset to mimic helpful behavior. - Limitation: Costly and not scalable alone. → Step 2: Reward Modeling (RM) - Humans rank multiple model outputs per prompt. - A reward model is trained to predict human preferences. - This provides a scalable, learnable signal of what “good” looks like. → Step 3: Reinforcement Learning (e.g., PPO, DPO) - The LLM is trained using the reward model’s feedback. - Algorithms like Proximal Policy Optimization (PPO) or newer Direct Preference Optimization (DPO) are used to iteratively improve model behavior. - DPO is gaining popularity over PPO for being simpler and more stable without needing sampled trajectories. Key Takeaways: → Pretraining = general knowledge (expensive) → Finetuning = domain or task adaptation (customize cheaply via PEFT) → Alignment = make it safe, helpful, and human-aligned (still labor-intensive but improving) Save the visual reference, and follow me (Aishwarya Srinivasan) for more no-fluff AI insights ❤️ PS: Visual inspiration: Sebastian Raschka, PhD
Training Evaluation Models
Explore top LinkedIn content from expert professionals.
-
-
LLM applications are frustratingly difficult to test due to their probabilistic nature. However, testing is crucial for customer-facing applications to ensure the reliability of generated answers. So, how does one effectively test an LLM app? Enter Confident AI's DeepEval: a comprehensive open-source LLM evaluation framework with excellent developer experience. Key features of DeepEval: - Ease of use: Very similar to writing unit tests with pytest. - Comprehensive suite of metrics: 14+ research-backed metrics for relevancy, hallucination, etc., including label-less standard metrics, which can quantify your bot's performance even without labeled ground truth! All you need is input and output from the bot. See the list of metrics and required data in the image below! - Custom Metrics: Tailor your evaluation process by defining your custom metrics as your business requires. - Synthetic data generator: Create an evaluation dataset synthetically to bootstrap your tests My recommendations for LLM evaluation: - Use OpenAI GPT4 as the metric model as much as possible. - Test Dataset Generation: Use the DeepEval Synthesizer to generate a comprehensive set of realistic questions! Bulk Evaluation: If you are running multiple metrics on multiple questions, generate the responses once, store them in a pandas data frame, and calculate all the metrics in bulk with parallelization. - Quantify hallucination: I love the faithfulness metric, which indicates how much of the generated output is factually consistent with the context provided by the retriever in RAG! CI/CD: Run these tests automatically in your CI/CD pipeline to ensure every code change and prompt change doesn't break anything. - Guardrails: Some high-speed tests can be run on every API call in a post-processor before responding to the user. Leave the slower tests for CI/CD. 🌟 DeepEval GitHub: https://lnkd.in/g9VzqPqZ 🔗 DeepEval Bulk evaluation: https://lnkd.in/g8DQ9JAh Let me know in the comments if you have other ways to test LLM output systematically! Follow me for more tips on building successful ML and LLM products! Medium: https://lnkd.in/g2jAJn5 X: https://lnkd.in/g_JbKEkM #generativeai #llm #nlp #artificialintelligence #mlops #llmops
-
Explaining the Evaluation method LLM-as-a-Judge (LLMaaJ). Token-based metrics like BLEU or ROUGE are still useful for structured tasks like translation or summarization. But for open-ended answers, RAG copilots, or complex enterprise prompts, they often miss the bigger picture. That’s where LLMaaJ changes the game. 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗶𝘁? You use a powerful LLM as an evaluator, not a generator. It’s given: - The original question - The generated answer - And the retrieved context or gold answer 𝗧𝗵𝗲𝗻 𝗶𝘁 𝗮𝘀𝘀𝗲𝘀𝘀𝗲𝘀: ✅ Faithfulness to the source ✅ Factual accuracy ✅ Semantic alignment—even if phrased differently 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: LLMaaJ captures what traditional metrics can’t. It understands paraphrasing. It flags hallucinations. It mirrors human judgment, which is critical when deploying GenAI systems in the enterprise. 𝗖𝗼𝗺𝗺𝗼𝗻 𝗟𝗟𝗠𝗮𝗮𝗝-𝗯𝗮𝘀𝗲𝗱 𝗺𝗲𝘁𝗿𝗶𝗰𝘀: - Answer correctness - Answer faithfulness - Coherence, tone, and even reasoning quality 📌 If you’re building enterprise-grade copilots or RAG workflows, LLMaaJ is how you scale QA beyond manual reviews. To put LLMaaJ into practice, check out EvalAssist; a new tool from IBM Research. It offers a web-based UI to streamline LLM evaluations: - Refine your criteria iteratively using Unitxt - Generate structured evaluations - Export as Jupyter notebooks to scale effortlessly A powerful way to bring LLM-as-a-Judge into your QA stack. - Get Started guide: https://lnkd.in/g4QP3-Ue - Demo Site: https://lnkd.in/gUSrV65s - Github Repo: https://lnkd.in/gPVEQRtv - Whitepapers: https://lnkd.in/gnHi6SeW
-
Your RAG app is NOT going to be usable in production (especially at large enterprises) if you overlook these evaluation steps -- - Before anything else, FIRST create a comprehensive evaluation dataset by writing queries that match real production use cases. - Evaluate retriever performance with non-rank metrics like Recall@k (how many relevant chunks are found in top-k results) and Precision@k (what fraction of retrieved chunks are actually relevant). These show if the right content is being found regardless of order :) - Assess retriever ranking quality with rank-based metrics including: 1. MRR (position of first relevant chunk) 2. MAP (considers all relevant chunks and their ranks) 3. NDCG (compares actual ranking to ideal ranking) These measure how well your relevant content is prioritized. - Measure generator citation performance by designing prompts that request explicit citations like [1], [2] or source sections. Calculate citation Recall@k (relevant chunks that were actually cited) and citation Precision@k (cited chunks that are actually relevant). - Evaluate response quality with quantitative metrics like F1 score at token level by tokenising both generated and ground truth responses. - Apply qualitative assessment across key dimensions including completeness (fully answers query), relevancy (answer matches question), harmfulness (potential for harm through errors), and consistency (aligns with provided chunks). Finally, with your learnings from the eval results, you can implement systematic optimisation in three sequential stages: 1. pre-processing (chunking, embeddings, query rewriting) 2. processing (retrieval algorithms, LLM selection, prompts) 3. post-processing (safety checks, formatting). With the right evaluation strategies and metrics in place, you can drastically enhance the performance and reliability of RAG systems :) Link to a the brilliant article by Ankit Vyas from neptune.ai on how to implement these steps: https://lnkd.in/guDnkdMT #RAG #AIAgents #GenAI
-
In the rapidly evolving world of conversational AI, Large Language Model (LLM) based chatbots have become indispensable across industries, powering everything from customer support to virtual assistants. However, evaluating their effectiveness is no simple task, as human language is inherently complex, ambiguous, and context-dependent. In a recent blog post, Microsoft's Data Science team outlined key performance metrics designed to assess chatbot performance comprehensively. Chatbot evaluation can be broadly categorized into two key areas: search performance and LLM-specific metrics. On the search front, one critical factor is retrieval stability, which ensures that slight variations in user input do not drastically change the chatbot's search results. Another vital aspect is search relevance, which can be measured through multiple approaches, such as comparing chatbot responses against a ground truth dataset or conducting A/B tests to evaluate how well the retrieved information aligns with user intent. Beyond search performance, chatbot evaluation must also account for LLM-specific metrics, which focus on how well the model generates responses. These include: - Task Completion: Measures the chatbot's ability to accurately interpret and fulfill user requests. A high-performing chatbot should successfully execute tasks, such as setting reminders or providing step-by-step instructions. - Intelligence: Assesses coherence, contextual awareness, and the depth of responses. A chatbot should go beyond surface-level answers and demonstrate reasoning and adaptability. - Relevance: Evaluate whether the chatbot’s responses are appropriate, clear, and aligned with user expectations in terms of tone, clarity, and courtesy. - Hallucination: Ensures that the chatbot’s responses are factually accurate and grounded in reliable data, minimizing misinformation and misleading statements. Effectively evaluating LLM-based chatbots requires a holistic, multi-dimensional approach that integrates search performance and LLM-generated response quality. By considering these diverse metrics, developers can refine chatbot behavior, enhance user interactions, and build AI-driven conversational systems that are not only intelligent but also reliable and trustworthy. #DataScience #MachineLearning #LLM #Evaluation #Metrics #SnacksWeeklyonDataScience – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gj6aPBBY -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gAC8eXmy
-
Unlocking the Next Era of RAG System Evaluation: Insights from the Latest Comprehensive Survey Retrieval-Augmented Generation (RAG) has become a cornerstone for enhancing large language models (LLMs), especially when accuracy, timeliness, and factual grounding are critical. However, as RAG systems grow in complexity-integrating dense retrieval, multi-source knowledge, and advanced reasoning-the challenge of evaluating their true effectiveness has intensified. A recent survey from leading academic and industrial research organizations delivers the most exhaustive analysis yet of RAG evaluation in the LLM era. Here are the key technical takeaways: 1. Multi-Scale Evaluation Frameworks The survey dissects RAG evaluation into internal and external dimensions. Internal evaluation targets the core components-retrieval and generation-assessing not just their standalone performance but also their interactions. External evaluation addresses system-wide factors like safety, robustness, and efficiency, which are increasingly vital as RAG systems are deployed in real-world, high-stakes environments. 2. Technical Anatomy of RAG Systems Under the hood, a typical RAG pipeline is split into two main sections: - Retrieval: Involves document chunking, embedding generation, and sophisticated retrieval strategies (sparse, dense, hybrid, or graph-based). Preprocessing such as corpus construction and intent recognition is essential for optimizing retrieval relevance and comprehensiveness. - Generation: The LLM synthesizes retrieved knowledge, leveraging advanced prompt engineering and reasoning techniques to produce contextually faithful responses. Post-processing may include entity recognition or translation, depending on the use case. 3. Diverse and Evolving Evaluation Metrics The survey catalogues a wide array of metrics: - Traditional IR Metrics: Precision@K, Recall@K, F1, MRR, NDCG, MAP for retrieval quality. - NLG Metrics: Exact Match, ROUGE, BLEU, METEOR, BertScore, and Coverage for generation accuracy and semantic fidelity. - LLM-Based Metrics: Recent trends show a rise in LLM-as-judge approaches (e.g., RAGAS, Databricks Eval), semantic perplexity, key point recall, FactScore, and representation-based methods like GPTScore and ARES. These enable nuanced, context-aware evaluation that better aligns with real-world user expectations. 4. Safety, Robustness, and Efficiency The survey highlights specialized benchmarks and metrics for: - Safety: Evaluating robustness to adversarial attacks (e.g., knowledge poisoning, retrieval hijacking), factual consistency, privacy leakage, and fairness. - Efficiency: Measuring latency (time to first token, total response time), resource utilization, and cost-effectiveness-crucial for scalable deployment.
-
This is by far the biggest challenge of LLMs! [Save this post for later] Built a fancy application using LLMs? But how do you know it works well? The word is “𝙴̲𝚟̲𝚊̲𝚕̲𝚞̲𝚊̲𝚝̲𝚒̲𝚘̲𝚗̲” The complexities of testing LLMs include some key challenges: • Non-deterministic behavior • Inconsistent outputs • Hallucinations • Arbitrary user inputs With these challenges in mind, there's a pressing need for new metrics to assess accuracy effectively. Why is effective testing crucial? It leads to faster iterations and quicker decisions on models, prompts, tools, and retrieval strategies. Three Stages of Effective Testing: 1. 𝗗𝗲𝘀𝗶𝗴𝗻: Start with simple heuristics and add hard-coded assertions that are cheap and fast to run. This approach helps catch errors before they impact users. 2. 𝗣𝗿𝗲-𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻: Measure app performance for continuous improvement and define test scenarios. Begin with 10-50 quality examples and gradually expand your dataset to cover edge cases. 3. 𝗣𝗼𝘀𝘁-𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻: Insights from real-user scenarios are invaluable. Set up tracing and evaluate performance without grounded reference responses. To navigate these complexities, consider using LLMs to grade retrieval relevance or assess hallucinations. 💡 Share this post with your network, and let’s discuss! What strategies have you found effective in LLM testing? Comment below! ====================== 𝙸̲ ̲𝚜̲𝚙̲𝚎̲𝚊̲𝚔̲ ̲𝚊̲𝚋̲𝚘̲𝚞̲𝚝̲ ̲𝙰̲𝙸̲ ̲𝚊̲𝚗̲𝚍̲ ̲𝙰̲𝚐̲𝚎̲𝚗̲𝚝̲𝚜̲ ̲𝚊̲𝚝̲ ̲𝚌̲𝚘̲𝚗̲𝚏̲𝚎̲𝚛̲𝚎̲𝚗̲𝚌̲𝚎̲𝚜̲,̲ ̲𝚃̲𝚅̲ ̲𝚂̲𝚑̲𝚘̲𝚠̲𝚜̲,̲ ̲𝚙̲𝚘̲𝚍̲𝚌̲𝚊̲𝚜̲𝚝̲𝚜̲,̲ ̲𝚊̲𝚗̲𝚍̲ ̲𝚎̲𝚟̲𝚎̲𝚗̲𝚝̲𝚜̲.̲ Visit 𝘃𝗶𝗱𝗵𝗶𝗰𝗵𝘂𝗴𝗵[𝗱𝗼𝘁]𝗰𝗼𝗺 to know more.
-
Evaluating LLMs is a minefield! 💣 The evaluation of Large Language Models (#llms) is challenging and flawed in many ways. Latest research highlights how inadequate use and evaluations of LLMs can seed misconceptions and cause harms. 🚨 Most issues are down to 3 hard problems in LLM evaluation: 1️⃣ Prompt Sensitivity Example: a paper claimed ChatGPT got worse over time. But subtle changes to the prompt recreated the original results and showed no degradation in capability. 2️⃣ Construct Validity Example: a paper claimed ChatGPT has a liberal bias. But the evaluation used a trick prompt to force the LLM to provide an opinion. Without the trick prompt, the LLM refused to express an opinion in most cases. 3️⃣ Contamination Example: GPT-4 passes the bar exam. But it's likely it relied heavily on stored information rather than reasoning because the exam Q&As had been included in its training data. These problems are further compounded by the proprietary / closed source nature of most LLMs. 🔒 Key takeaways? 👉 We need better evaluation methods for LLMs. 👉 We need to push for LLMs to be open sourced. 👉 We need to be scientifically and socially responsible before making bold statements about LLMs. What methods have you found effective for evaluating LLMs? Link to the research presentation in the first comment below 👇. #ai #generativeai #responsibleai #machinelearning Image: Petr Vaclav & Midjourney v5.2, “The Desolation of LLM Evaluation”, 2023
-
💡 "What if the key to your success was hidden in a simple evaluation model?” In the competitive world of corporate training, ensuring the effectiveness of programs is crucial. 📈 But how do you measure success? This is where the Kirkpatrick Evaluation Model comes into play, and it became my lifeline during a challenging time. ✨ The Turning Point ✨ Our company invested heavily in a new leadership development program a few years ago. I was tasked with overseeing its success. Despite our best efforts, the initial feedback was mixed, and I felt the pressure mounting. 😟 Then, I discovered the Kirkpatrick Evaluation Model. This four-level framework was about to change everything: 🔹Level 1: Reaction - I began by gathering immediate participant feedback. Were they engaged? Did they find the training valuable? This was my first step in understanding the initial impact. 👍 🔹 Level 2: Learning - Next, I measured what participants learned. We used pre-and post-training assessments to gauge their acquired knowledge and skills. 🧠📚 🔹 Level 3: Behavior - The real test came when we looked at behavior changes. Did participants apply their new skills on the job? I conducted follow-up surveys and observed their performance over time. 👀💪 🔹 Level 4: Results - Finally, we analyzed the overall impact on the organization. Were we seeing improved performance and tangible business outcomes? This holistic view provided the evidence we needed. 📊🚀 🌈 The Transformation 🌈 Using the Kirkpatrick Model, we were able to pinpoint strengths and areas for improvement. By iterating on our program based on these insights, we turned things around. Participants were not only learning but applying their new skills effectively, leading to remarkable business results. This journey taught me the power of structured evaluation and the importance of continuous improvement. The Kirkpatrick Model didn't just help us survive; it helped us thrive. 🌟 Ready to transform your training initiatives? Let’s connect with a complimentary 15-minute call with me and discuss how you can leverage the Kirkpatrick Model to drive results. 🚀 https://lnkd.in/grUbB-Kw Share your experiences with training evaluations in the comments below! Let's learn and grow together. 🌱 #CorporateTraining #KirkpatrickModel #ProfessionalDevelopment #TrainingEffectiveness #ContinuousImprovement
-
The 3 bridges to connect the Kirkpatrick levels: If unfamiliar, the Kirkpatrick levels are Level 1: reaction Level 2: learning Level 3: behavior Level 4: results Less than 5% of teams reach level 4. There’s many reasons for that. A main one is the assumption these levels stack nicely on top of and are dependent on each other (they aren’t). This leaves gaps between the levels. Here’s 3 ways to bridge them: 1) Level 1 —> RITE —> Level 2 RITE stands for relevant, important, timely, and effective - which is what you should be measuring to improve if learning will take place. (Read Thalheimer’s book Performance-Focused Learner Surveys) 2) Level 2 —> COMPETENCE —> Level 3 Specifically, Decision Making Competence (can they apply knowledge to make the right decisions) and Task Competence (can they perform the right tasks). (Also from Thalheimer - see his LTEM framework) *bonus: implement Tom Gilbert’s Behavior Engineering Model to make sure the behaviors actually take place. 3) Level 3 —> OUTPUTS —> Level 4 Behaviors don’t result in outcomes. The effects of behavior does. What valuable work (outputs) must reps produce - and to what standard - to achieve the leading indicators that your desired results depend on? —— Which bridge will you start using? #salesenablement #salestraining #training
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning