LinkedIn has published one of the best reports I’ve read on deploying LLM applications: what worked and what didn’t. 1. Structured outputs They chose YAML over JSON as the output format because YAML uses less output tokens. Initially, only 90% of the outputs are correctly formatted YAML. They used re-prompting (asking the model to fix its YAML responses), which increased the number of API calls significantly. They then analyzed the common formatting errors, added those hints to the original prompt, and wrote an error fixing script. This reduced their errors to 0.01%. 2. Sacrificing throughput for latency Originally, they focused on TTFT (Time To First Token), but realized that TBT (Time Between Token) hurt them a lot more, especially with Chain-of-Thought queries where users don’t see the intermediate outputs. They found that TTFT and TBT inversely correlate with TPS (Tokens per Second). To achieve good TTFT and TBT, they had to sacrifice TPS. 3. Automatic evaluation is hard One core challenge of evaluation is coming up with a guideline on what a good response is. For example, for skill fit assessment, the response: “You’re not a good fit for this job” can be correct, but not helpful. Originally, evaluation was ad-hoc. Everyone could chime in. That didn’t work. They then have linguists build tooling and processes to standardize annotation, evaluating up to 500 daily conversations and these manual annotations guide their iteration. Their next goal is to get automatic evaluation, but it’s not easy. 4. Initial success with LLMs can be misleading It took them 1 month to achieve 80% of the experience they wanted, and additional 4 months to surpass 95%. The initial success made them underestimate how challenging it is to improve the product, especially dealing with hallucinations. They found it discouraging how slow it was to achieve each subsequent 1% gain. #aiengineering #llms #aiapplication
MLOps for AI Development
Explore top LinkedIn content from expert professionals.
-
-
When working with multiple LLM providers, managing prompts, and handling complex data flows — structure isn't a luxury, it's a necessity. A well-organized architecture enables: → Collaboration between ML engineers and developers → Rapid experimentation with reproducibility → Consistent error handling, rate limiting, and logging → Clear separation of configuration (YAML) and logic (code) 𝗞𝗲𝘆 𝗖𝗼𝗺𝗽𝗼𝗻𝗲𝗻𝘁𝘀 𝗧𝗵𝗮𝘁 𝗗𝗿𝗶𝘃𝗲 𝗦𝘂𝗰𝗰𝗲𝘀𝘀 It’s not just about folder layout — it’s how components interact and scale together: → Centralized configuration using YAML files → A dedicated prompt engineering module with templates and few-shot examples → Properly sandboxed model clients with standardized interfaces → Utilities for caching, observability, and structured logging → Modular handlers for managing API calls and workflows This setup can save teams countless hours in debugging, onboarding, and scaling real-world GenAI systems — whether you're building RAG pipelines, fine-tuning models, or developing agent-based architectures. → What’s your go-to project structure when working with LLMs or Generative AI systems? Let’s share ideas and learn from each other.
-
There are 3 ingredients that pretty much guarantee the failure of any Machine Learning project: having the Data Scientists training models in notebooks, having the data teams siloed, and having no DevOps for the ML applications! Interestingly enough, that is where most companies trying out ML get stuck. The level of investment in ML infrastructures for companies is directly proportional to the level of impact they expect ML to have on the business. And the level of impact is, in turn, proportional to the level of investment. It is a vicious circle! Both Microsoft and Google established standards for MLOps maturity that capture the degree of automation of ML practices, and there is a lot to learn from those: - Microsoft: https://lnkd.in/gtzDcNb9 - Google: https://lnkd.in/gA4bR77x Level 0 is the stage without any automation. Typically, the Data Scientists (or ML engineers, depending on the company) are completely disconnected from the other data teams. That is the guaranteed failure stage! It is possible for companies to pass through that stage to explore some ML opportunities, but if they stay stuck at that stage, ML is never going to contribute to the company's revenue. The level 1 is when there is a sense that ML applications are software applications. As a consequence, basic DevOps principles are applied at the software level in production, but there is a failure to realize the specificity of ML operations. In development, data pipelines are better established to streamline manual model development. At level 2, things get interesting! ML becomes significant enough for the business that we invest in reducing model development time and errors. Data teams work closer as model development is automated and experiments are tracked and reproducible. If ML becomes a large driver of revenue, level 3 is the minimum bar to strive for! That is where moving from development to deployment is a breeze. DevOps principles extend to ML pipelines, including testing the models and the data. Models are A/B tested in production, and monitoring is maturing. This allows for fast model iteration and scaling for the ML engineering team. Level 4 is FAANG maturity level! A level that most companies shouldn't compare themselves to. Because of ads, Google owes ~70% of its revenue to ML and ~95% for Meta, so a high level of maturity is required. Teams work together, recurring training happens at least daily, and everything is fully monitored. For any company to succeed in ML, teams should work closely together and aim for a high level of automation, removing the human element as a source of error. #MachineLearning #DataScience #ArtificialIntelligence -- 👉 Register for the ML Fundamentals Bootcamp: https://lnkd.in/gasbhQSk --
-
Most ML systems don’t fail because of poor models. They fail at the systems level! You can have a world-class model architecture, but if you can’t reproduce your training runs, automate deployments, or monitor model drift, you don’t have a reliable system. You have a science project. That’s where MLOps comes in. 🔹 𝗠𝗟𝗢𝗽𝘀 𝗟𝗲𝘃𝗲𝗹 𝟬 - 𝗠𝗮𝗻𝘂𝗮𝗹 & 𝗙𝗿𝗮𝗴𝗶𝗹𝗲 This is where many teams operate today. → Training runs are triggered manually (notebooks, scripts) → No CI/CD, no tracking of datasets or parameters → Model artifacts are not versioned → Deployments are inconsistent, sometimes even manual copy-paste to production There’s no real observability, no rollback strategy, no trust in reproducibility. To move forward: → Start versioning datasets, models, and training scripts → Introduce structured experiment tracking (e.g. MLflow, Weights & Biases) → Add automated tests for data schema and training logic This is the foundation. Without it, everything downstream is unstable. 🔹 𝗠𝗟𝗢𝗽𝘀 𝗟𝗲𝘃𝗲𝗹 𝟭 - 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗲𝗱 & 𝗥𝗲𝗽𝗲𝗮𝘁𝗮𝗯𝗹𝗲 Here, you start treating ML like software engineering. → Training pipelines are orchestrated (Kubeflow, Vertex Pipelines, Airflow) → Every commit triggers CI: code linting, schema checks, smoke training runs → Artifacts are logged and versioned, models are registered before deployment → Deployments are reproducible and traceable This isn’t about chasing tools, it’s about building trust in your system. You know exactly which dataset and code version produced a given model. You can roll back. You can iterate safely. To get here: → Automate your training pipeline → Use registries to track models and metadata → Add monitoring for drift, latency, and performance degradation in production My 2 cents 🫰 → Most ML projects don’t die because the model didn’t work. → They die because no one could explain what changed between the last good version and the one that broke. → MLOps isn’t overhead. It’s the only path to stable, scalable ML systems. → Start small, build systematically, treat your pipeline as a product. If you’re building for reliability, not just performance, you’re already ahead. Workflow inspired by: Google Cloud ---- If you found this post insightful, share it with your network ♻️ Follow me (Aishwarya Srinivasan) for more deep dive AI/ML insights!
-
You've built your AI agent... but how do you know it's not failing silently in production? Building AI agents is only the beginning. If you’re thinking of shipping agents into production without a solid evaluation loop, you’re setting yourself up for silent failures, wasted compute, and eventully broken trust. Here’s how to make your AI agents production-ready with a clear, actionable evaluation framework: 𝟭. 𝗜𝗻𝘀𝘁𝗿𝘂𝗺𝗲𝗻𝘁 𝘁𝗵𝗲 𝗥𝗼𝘂𝘁𝗲𝗿 The router is your agent’s control center. Make sure you’re logging: - Function Selection: Which skill or tool did it choose? Was it the right one for the input? - Parameter Extraction: Did it extract the correct arguments? Were they formatted and passed correctly? ✅ Action: Add logs and traces to every routing decision. Measure correctness on real queries, not just happy paths. 𝟮. 𝗠𝗼𝗻𝗶𝘁𝗼𝗿 𝘁𝗵𝗲 𝗦𝗸𝗶𝗹𝗹𝘀 These are your execution blocks; API calls, RAG pipelines, code snippets, etc. You need to track: - Task Execution: Did the function run successfully? - Output Validity: Was the result accurate, complete, and usable? ✅ Action: Wrap skills with validation checks. Add fallback logic if a skill returns an invalid or incomplete response. 𝟯. 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗲 𝘁𝗵𝗲 𝗣𝗮𝘁𝗵 This is where most agents break down in production: taking too many steps or producing inconsistent outcomes. Track: - Step Count: How many hops did it take to get to a result? - Behavior Consistency: Does the agent respond the same way to similar inputs? ✅ Action: Set thresholds for max steps per query. Create dashboards to visualize behavior drift over time. 𝟰. 𝗗𝗲𝗳𝗶𝗻𝗲 𝗦𝘂𝗰𝗰𝗲𝘀𝘀 𝗠𝗲𝘁𝗿𝗶𝗰𝘀 𝗧𝗵𝗮𝘁 𝗠𝗮𝘁𝘁𝗲𝗿 Don’t just measure token count or latency. Tie success to outcomes. Examples: - Was the support ticket resolved? - Did the agent generate correct code? - Was the user satisfied? ✅ Action: Align evaluation metrics with real business KPIs. Share them with product and ops teams. Make it measurable. Make it observable. Make it reliable. That’s how enterprises scale AI agents. Easier said than done.
-
Check out this framework for building AI Agents that work in production. There are many recommendations out there, so would like your feedback on this one. This is beyond picking a fancy model or plugging in an API. To build a reliable AI agent, you need a well-structured, end-to-end system with safety, memory, and reasoning at its core. Here’s the breakdown: 1.🔸Define the Purpose & KPIs Start with clarity. What tasks should the agent handle? Align goals with KPIs like accuracy, cost, and latency. 2.🔸Choose the Right Tech Stack Pick your tools: language, LLM, frameworks, and databases. Secure secrets early and plan for production-readiness from day one. 3.🔸Project Setup & Dev Practices Structure repos for modularity. Add version control, test cases, code linting, and cost-efficient development practices. 4.🔸Integrate Data Sources & APIs Link your agent with whatever data it needs to take action intelligently from PDFs, Notion, databases, or business tools. 5.🔸Build Memory & RAG Index knowledge and implement semantic search. Let your agent recall facts, documents, and links with citation-first answers. 6.🔸Tools, Reasoning & Control Loops Empower the agent with tools and decision-making logic. Include retries, validations, and feedback-based learning. 7.🔸Safety, Governance & Policies Filter harmful outputs, monitor for sensitive data, and build an escalation path for edge cases and PII risks. 8.🔸Evaluate, Monitor & Improve Use golden test sets and real user data to monitor performance, track regressions, and improve accuracy over time. 9.🔸Deploy, Scale & Operate Containerize, canary-test, and track usage. Monitor cost, performance, and reliability as your agent scales in production. Real AI agents are engineered step by step. Hope this guide gives you the needed blueprint to build with confidence. #AIAgents
-
LLM agents are too expensive and too unreliable. Unfortunately, building agentic workflows that work beyond a good demo is hard. I talk daily to people who are trying, and it's tough. I have a paper and a few experiments to show you. A solution that cuts the costs of running an AI assistant by up to 77.8%. This is a game-changer for the future of agents! Just so we are on the same page, here is the most popular approach to building agentic workflows: Write a long system prompt that provides instructions to the LLM on how to answer users' queries. Tell the model how to react to different situations and the business logic it should follow to create its answers. This is simple to build, extremely flexible, and completely unreliable. One minute, it works like magic. The next, you get garbage results. No serious company will ever use this. A paper published earlier this year proposes a much more structured strategy. Its focus is on finding a middle ground between the flexibility of an LLM and reliable responses. The 10-second summary: Instead of the one-prompt-to-rule-it-all approach, this new strategy separates the business logic execution from the LLM's conversation ability. • This leads to cheaper agents (77.8% is a big deal) • Much higher consistency in following rules • More reliable responses Here is a blog post that goes into much more details about how this works and a few experiments: https://hubs.ly/Q02MQCQh0 Look at the attached image: that's the difference in cost and latency between the new approach, and a more traditional agent. You'll find the link to the paper in the image ALT description. To reproduce the experiments, check out this GitHub repository: https://lnkd.in/gydzjjcu
-
Customer-facing AI agents keep failing in production...🤯 Because existing agent frameworks lack some fundamental features. I've spent months building with every major AI agent framework and discovered why most customer-facing deployments crash and burn: → Flowchart builders (Botpress, LangFlow) create rigid paths that customers often break → System prompt frameworks (LangGraph, AutoGPT) excel in demos but fail due to AI's unpredictability Parlant's opensource Conversation Modeling Engine solves this. Here's how and why it matters: 1. Contextual Guidelines vs. Rigid Paths ↳ Instead of mapping every possible conversation flow, define what your agent should do in specific situations. ↳ Each guideline has a condition and an action - when X happens, do Y. ↳ The engine matches only relevant guidelines to each customer message. 2. Guided Tool Use That Stays Reliable ↳ Tools are tied directly to specific guidelines. ↳ No more random API calls or hallucinated data. ↳ Your travel agent won't suddenly search flights when someone asks about baggage fees. 3. Priority Relationships for Natural Conversation ↳ Guidelines have relationships with each other. ↳ When multiple guidelines match, the engine selects based on priority. ↳ Creates step-by-step information gathering without rigid flowcharts. 4. The "Utterances" Feature for Regulated Industries ↳ Pre-approve specific responses for sensitive situations. ↳ Agent checks if an appropriate Utterance exists before generating. ↳ Completely eliminates hallucinations in critical interactions. It works with any major LLM provider - OpenAI, Anthropic, Google, Meta. This approach handles what flowcharts and system prompts can't: The messy reality of actual customer conversations. Your IP isn't the LLM. It's the conversation model you create. The explicit encoding of how your AI agent should interact with customers. For anyone building agents that need to stay reliable in production, this might be the framework you've been waiting for. Check it out: https://lnkd.in/dNPSDJ7P P.S. I create AI Agent tutorials and opensource them for free. Your 👍 like and ♻️ repost helps keep me going. Don't forget to follow me Shubham Saboo for daily tips and tutorials on LLMs, RAG and AI Agents.
-
Machine learning models aren’t a “build once and done” solution—they require ongoing management and quality improvements to thrive within a larger system. In this tech blog, Uber's engineering team shares how they developed a framework to address the challenges of maintaining and improving machine learning systems. The business need centers on the fact that Uber has numerous machine learning use cases. While teams typically focus on performance metrics like AUC or RMSE, other crucial factors—such as the timeliness of training data, model reproducibility, and automated retraining—are often overlooked. To address these challenges at scale, developing a comprehensive platform approach is essential. Uber's solution involves the development of the Model Excellence Scores framework, designed to measure, monitor, and enforce quality at every stage of the ML lifecycle. This framework is built around three core concepts derived from Service Level Objectives (SLOs): indicators, objectives, and agreements. Indicators are quantitative measures that reflect specific aspects of an ML system’s quality. Objectives define target ranges for these indicators, while Agreements consolidate the indicators at the ML use-case level, determining the overall PASS/FAIL status based on indicator results. The framework integrates with other ML systems at Uber to provide insights, enable actions, and ensure accountability for the success of machine learning models. It’s one thing to achieve a one-time success with machine learning; sustaining that success, however, is a far greater challenge. This tech blog provides an excellent reference for anyone building scalable and reliable ML platforms. Enjoy the read! #machinelearning #datascience #monitoring #health #quality #SLO #SnacksWeeklyonDataScience – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gj6aPBBY -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/g6DJm9pb
-
Scaling MLOps on AWS: Embracing Multi-Account Mastery 🚀 Move beyond the small team playground and build robust MLOps for your growing AI ambitions. This architecture unlocks scalability, efficiency, and rock-solid quality control – all while embracing the power of multi-account setups. Ditch the bottlenecks, embrace agility: 🔗 Multi-account mastery: Separate development, staging, and production environments for enhanced control and security. 🔄 Automated model lifecycle: Seamless workflow from code versioning to production deployment, powered by SageMaker notebooks, Step Functions, and Model Registry. 🌟 Quality at every step: Deploy to staging first, rigorously test, and seamlessly transition to production, all guided by a multi-account strategy. 📊 Continuous monitoring and feedback: Capture inference data, compare against baselines, and trigger automated re-training if a significant drift is detected. Here's how it unfolds: 1️⃣ Development Sandbox: Data scientists experiment in dedicated accounts, leveraging familiar tools like SageMaker notebooks and Git-based version control. 2️⃣ Automated Retraining Pipeline: Step Functions orchestrate model training, verification, and artifact storage in S3, while the Model Registry keeps track of versions and facilitates approvals. 3️⃣ Multi-Account Deployment: Staging and production environments provide controlled testing grounds before unleashing your model on the world. SageMaker endpoints and Auto Scaling groups handle inference requests, powered by Lambda and API Gateway across different accounts. 4️⃣ Continuous Quality Control: Capture inference data from both staging and production environments in S3 buckets. Replicate it to the development account for analysis. 5️⃣ Baseline Comparison and Drift Detection: Use SageMaker Model Monitor to compare real-world data with established baselines, identifying potential model or data shifts. 6️⃣ Automated Remediation: Trigger re-training pipelines based on significant drift alerts, ensuring continuous improvement and top-notch model performance. This is just the tip of the iceberg! Follow Shadab Hussain for deeper dives into each element of this robust MLOps architecture, explore advanced tools and practices, and empower your medium and large teams to conquer the AI frontier. 🚀 #MLOps #AI #Scalability #MultiAccount #QualityControl #ShadabHussain
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development