Is your LLM overkill?

Sep 23, 2025

A Tiered Approach to AI: The New Playbook for Agents and Workflows

A Small Language Model (SLM) is a neural model defined by its low parameter count, typically in the single-digit to low-tens of billions. These models trade broad, general-purpose capability for significant gains in efficiency, cost, and privacy, making them ideal for specialized tasks.

While I’ve been cautiously testing SLMs, their practical value is becoming clearer. For example, smaller, fine-tuned models are already highly effective for generating embeddings in RAG workflows. The rise of agentic systems is making an even stronger case. A recent Nvidia paper argues that most agent tasks — repetitive, narrowly-scoped operations — don't need the power of a large model.

This suggests a more efficient future: using specialized SLMs for routine workflows and reserving heavyweight models for genuinely complex reasoning. With this in mind, here are the strongest reasons to consider SLMs — and where the trade-offs bite.

AI Everywhere: From Cloud to Pocket

SLMs unlock deployment scenarios that are simply impossible for their larger cousins, particularly in edge computing and offline environments. Models with fewer than 3 billion parameters can run effectively on smartphones, industrial sensors, and laptops in the field. This capability is critical for applications that require real-time processing without relying on a cloud connection. Think of a manufacturing firm embedding a tiny model in AR goggles to provide assembly instructions with less than 50ms of latency, or an agricultural drone analyzing crop health in a remote area with no cellular service.

To me, this is the most compelling and durable reason to be excited about SLMs. As AI becomes more deeply integrated into every facet of our work and lives, we will increasingly demand access to models that run on all our devices, regardless of internet connectivity. The ability to function offline or with minimal resources is a fundamental advantage that massive, cloud-dependent models cannot easily replicate, positioning SLMs as essential components of a truly ubiquitous AI future.

The Specialist's Edge

It is a common assumption that more parameters equal better performance, but for domain-specific tasks, carefully fine-tuned SLMs often outperform their larger, general-purpose counterparts. By training a smaller model on a narrow dataset, you can create an expert that is more accurate and reliable for a specific function than a jack-of-all-trades LLM. We have seen this play out in benchmarks: the 3.8B parameter Phi-3 model nearly matched the 12B parameter Codex in a bug-fixing test, and a math-specific 1.5B model achieved performance on par with 7B generalist models on key benchmarks, demonstrating a four-to-five-fold advantage in performance-per-parameter.

Counterpoint: the trade-off for this high performance, however, is brittleness. A model that has been hyper-specialized for one task will excel within its training distribution but can fail catastrophically when presented with something outside of it. Furthermore, fine-tuning for one capability, like conversation, can degrade another, like coding performance. This reality means that adopting a specialized model strategy often requires building, maintaining, and serving a portfolio of different models, which introduces its own operational complexity that teams must be prepared to manage.

The Need for Speed

By their nature, SLMs deliver substantially lower latency, making them suitable for real-time interactive applications. Achieving first-token latency under 100 milliseconds becomes possible, which is a critical threshold for voice assistants, gaming AI, and other systems where a half-second delay would render the application unusable. This speed, which stems from lower memory bandwidth and faster computations, directly translates into a more natural and responsive user experience.

Counterpoint: while SLMs hold the theoretical advantage here, I wouldn't underestimate the engineering investments foundation model providers are making to accelerate their flagship models. Users already unknowingly interact with "flash versions" of capable models — larger than typical SLMs but optimized enough to deliver acceptable responsiveness for most use cases. The latency gap continues narrowing as both camps optimize aggressively.

Lowering Your AI Bill

While deploying even a moderately-sized LLM might require a cluster of more than 20 GPUs, an SLM can often run effectively on a single high-end workstation with consumer-grade hardware. The cost difference is stark, with studies showing 10 to 30 times lower costs for compute and energy when comparing a 7B model to a 70B alternative. For instance, a logistics company that replaced GPT-4o-mini with Mistral-7B for a specific task saw its per-query cost drop from $0.008 to $0.0006, saving around $70,000 per month. This efficiency allows teams to operate with more predictable budgets and deploy multiple specialized models for the price of a single large one.

Counterpoint: cost advantages shrink when tasks demand broad world knowledge or multi-step reasoning. Cloud providers also keep trimming API prices, and some large-model vendors benefit from scale. My experience echoes what many founders say: pricing is rarely the deal-breaker — especially with competitive open-weights via OpenRouter and with Azure’s enterprise-friendly OpenAI pricing. The main exception seems to be the Claude family of models, which are excellent but pricey enough that they must be used judiciously. For most teams I talk with, the focus is less on the base cost and more on diligent monitoring and optimization of their existing LLM usage.

Keeping Your Data Yours

The ability to run models entirely within organizational boundaries fundamentally changes the security equation for regulated industries. Hospitals deploying Meerkat-8B for patient symptom analysis ensure protected health information never traverses external networks, while European banks running Gemma-2B within their OpenShift clusters satisfy stringent ECB audit requirements without compromising transaction data sovereignty. Defense contractors maintain completely air-gapped deployments for mission-critical systems, achieving immunity from supply chain disruptions that could cripple cloud-dependent alternatives. This local control extends beyond mere compliance checkboxes — it preserves intellectual property and maintains competitive advantages that would evaporate if sensitive data flowed through external APIs.

Counterpoint: the trade-off involves assuming the full burden of infrastructure management, security patching, and GPU orchestration that cloud providers typically handle. Proprietary LLM providers are also starting to close this gap. Google, for instance, has announced that its Gemini models will be available for local deployment through its Google Distributed Cloud platform. This solution offers a fully managed on-premise cloud that can even be run in a completely air-gapped configuration.

Most agent tasks are repetitive, narrowly-scoped operations. They don’t need the conversational breadth or the cost of a large model.

From Idea to Production, Faster

The smaller size of SLMs allows for iteration cycles that are orders of magnitude faster than with LLMs. Fine-tuning a model to adapt to new data, enforce a strict JSON output, or learn domain-specific terminology can be done in GPU-hours instead of weeks. Parameter-efficient methods (e.g., LoRA) and fine tuning services put this within reach of small teams. This agility is crucial in production environments where system requirements are constantly evolving.

Counterpoint: the catch is that fine-tuning and post-training often aren't optional—many small models fail completely on structured tasks without customization, adding engineering overhead and requiring high-quality fine-tuning datasets that may be scarce or expensive to create. What appears as flexibility often becomes mandatory complexity.

Building with AI Legos

SLMs fit neatly into service-oriented designs. Instead of a monolith, you compose a system from simple, reliable pieces: an entity extractor, a sentiment rater, a compliance checker, each fine-tuned for its niche and scaled independently. For example, a financial services firm could build a processing pipeline that combines separate, fine-tuned models for entity extraction, sentiment analysis, and compliance checking, with each component doing one thing exceptionally well.

Counterpoint: while elegant in theory, this approach introduces significant orchestration complexity in practice. Managing the routing logic, inter-model communication, and version dependencies across a fleet of specialized models is a non-trivial engineering challenge. For smaller teams, the overhead required to build and maintain such a distributed system can sometimes outweigh the efficiency gains it promises.

Beyond the Binary: A Tiered Approach to AI

The debate between large and small models is evolving beyond simple capability trade-offs. Jakub Zavrel of Zeta Alpha recently noted that we've reached an inflection point where frontier models are "good enough" for multi-agent systems. The new bottleneck is not raw model capability, but architecture and specialization — the ability to break down complex problems into modular, specialized components.

The bottleneck in AI is no longer model capability — it's system architecture. The new challenge is breaking down problems into modular, specialized components.

This shift makes a powerful case for an "SLM-first" architecture. Instead of relying on a single monolithic model, systems can be composed of a fleet of efficient SLMs, each an expert in its narrow domain. A more powerful and expensive LLM is reserved only for tasks requiring complex, open-domain reasoning.

For teams that are LLM-first today, the migration path is pragmatic: log your workflows, cluster recurring tasks, and fine-tune small specialists to handle them. Route tasks intelligently by policy and measure three key metrics — cost per action, latency to decision, and task reliability. Done right, your systems will become cheaper, faster, and more robust without sacrificing the option to escalate when the job truly demands it.

Top 10 Open-Source Projects in the Large Model Ecosystem

This chart is adapted from a global landscape analysis by Ant Group, recently featured in an InfoQ article.

This leaderboard ranks the ten most influential open-source projects in the AI development ecosystem using OpenRank, a metric that measures community collaboration rather than simple popularity indicators like stars. The list spans the entire technology stack, from foundational infrastructure such as PyTorch for training and Ray for distributed compute, to high-performance inference engines like vLLM, SGLang, and TensorRT-LLM. At the application level, it features agent platforms and development tools including Dify, n8n, and Gemini, which are predominantly built with TypeScript, in contrast to the Python-based infrastructure. The significant influence of academic research is evident, as three key projects — vLLM, Ray, and SGLang (SGL) originated from UC Berkeley’s Sky Computing and RISE Labs, demonstrating a direct path from academic innovation to production-ready tools.

Ben Lorica edits the Gradient Flow newsletter and hosts the Data Exchange podcast. He helps organize the AI Conference, the AI Agent Conference, the Applied AI Summit, while also serving as the Strategic Content Chair for AI at the Linux Foundation. You can follow him on Linkedin, X, Mastodon, Reddit, Bluesky, YouTube, or TikTok. This newsletter is produced by Gradient Flow.

Why Your Database Can't Handle the Coming Agent Swarm

Ben Lorica 罗瑞卡

Sep 16, 2025

Subscribe • Previous Issues

Rethinking Databases for the Age of Autonomous Agents

As the AI community buzzes with the potential of autonomous agents, I've been pondering a less glamorous but critical question: what does this mean for our data infrastructure? We are designing intelligent, autonomous systems on top of databases built for predictable, human-driven interactions. What happens when software that writes software also provisions and manages its own data? This is an architectural mismatch waiting to happen, and one that demands a new generation of tools.

This isn't just about handling more transactions. It represents the next stage in a broader convergence of operational and analytical data systems, a trend accelerated by the cloud's elastic nature. The core challenge is no longer just how to support agents' actions, but how to make the data from those actions immediately available for analysis, insight, and retraining the next generation of models. Yet even before we tackle this convergence, we're struggling with the basics.

Consider what happened recently to a mid-sized e-commerce site. The operations team woke up to find their database crippled, response times through the roof. It wasn't a DDoS attack or a code bug. A single AI company's web crawler was hammering their product API with 39,000 requests per minute, each triggering complex database queries. According to recent analysis from Fastly, which monitors 6.5 trillion web requests monthly, this is becoming routine. AI bots from companies like Meta and OpenAI are already pushing database-backed systems to their breaking points with what are, fundamentally, simple read operations.

These bots are just the prelude. Their workload is relatively simple, consisting mostly of read-only operations — the database equivalent of window shopping. The real challenge will come when these bots evolve into agents that can take action. An agent won't just read a product page; it will perform a complex, multi-step task requiring a full transaction: checking inventory (SELECT), adding an item to a cart (UPDATE), processing an order (INSERT), and perhaps coordinating with other agents (JOINs and concurrent writes). This fundamental shift from read-heavy to transactional workloads will break systems designed for the former.

If basic fetchers can stall sites, how can we possibly support millions of autonomous agents performing complex, stateful tasks? The answer is, it will be challenging — particularly with the tools we have today. We need a new blueprint.

New Blueprints: Databases Designed for Agent Scale

New architectural blueprints are emerging to meet this challenge. Some focus on agent-specific optimizations, while others, like the 'Lakebase' architecture articulated by Databricks co-founder Matei Zaharia, tackle the broader challenge of unifying operational and analytical systems. They aren't just faster versions of PostgreSQL or MySQL; they represent a fundamental rethink of how databases should work in an agent-driven world. Let me walk through the core principles driving this transformation.

First, treat databases like files: lightweight, fast, and ephemeral. The biggest shift is moving from databases as permanent infrastructure requiring careful provisioning to treating them as lightweight, disposable artifacts. Agent DB and Neon (acquired by Databricks) exemplify this approach with sub-second database creation requiring nothing more than a unique identifier. A code-generation agent can spin up a test database for each pull request in 500 milliseconds, run validation queries, and tear it down — all within a single CI pipeline step. Traditional databases requiring 10-minute provisioning wizards simply can't support this pattern. The scale of this shift is staggering. Agents are already creating four times more databases than humans. Neon recently reported that AI agents went from creating 30% to over 80% of all new databases on their platform within months. This pattern of high-frequency creation and deletion is the new normal.

Second, make databases speak the model's language. Agents shouldn't waste expensive tokens and cycles trying to figure out a database's schema. Systems like Agent DB implement the Model Context Protocol (MCP), providing agents with URLs containing everything needed: schema definitions, data types, even sample queries. The agent generates correct SQL on the first attempt without exploratory queries. Meanwhile, Turso's evolution of SQLite to support concurrent writers without locking means multiple agents can collaborate on shared data — imagine a planning agent and three execution agents all updating a task database simultaneously without blocking each other.

Third, give every agent its own sandbox. Managing permissions for thousands of agents in a shared database is a security and operational nightmare. The new model, pioneered by platforms like Turso and Cloudflare D1, gives each agent or user session its own isolated database instance. Agent DB’s file-per-db model makes sandboxes a first-class primitive. A financial analysis agent creates separate, encrypted databases for each client's data — complete isolation without complex permission matrices. These "personal silos" can be distributed to edge locations globally, co-locating data with compute to slash latency from hundreds of milliseconds to single digits.

Fourth, bridge the gap to analytics. While the first three principles focus on optimizing the transactional workload itself, a parallel trend seeks to eliminate the wall between transactional and analytical systems. A prime example is the Lakebase approach, which embeds transactional capabilities directly into a data lakehouse, enabling applications to query historical patterns while maintaining transactional state. An inventory management agent can check real-time stock levels against predictive demand models without complex data pipelines. This operational-analytical convergence represents another path forward, particularly for organizations already invested in lakehouse architectures.

We're already seeing these ideas in action: developer agents spin up temporary databases for CI runs, planning agents fork databases to test strategies in parallel, and product agents create private data silos for each user at the edge. This is what becomes possible when we give agents disposable, private workspaces that live right next to their code.

From Storage to Memory: Building Truly Stateful AI

These new database capabilities are more than just infrastructure solutions; they are the foundation for creating truly stateful AI agents. The critical link between this new infrastructure and agent intelligence is memory.

Don't make your agent guess. Provide a machine-readable context that eliminates exploratory queries.

Memory is the logical layer above this foundation. Frameworks like Letta define the logic of how an agent manages context — the rules for what stays in its "RAM" versus its "disk." But it's your database choice that determines if that logic can actually perform. The database is the agent's external "disk." Unified platforms that combine structured data and vector search are ideal for this role, making retrieval fast and debugging simple, letting you defer specialized vector stores until scale truly demands them. Some platforms take this further by unifying operational and analytical layers entirely. When transactional databases can directly access lakehouse tables, agents gain unprecedented context without the latency and complexity of data movement.

With this model in mind, the path forward for builders becomes clear:

Treat databases as ephemeral, task-specific resources, not permanent fixtures.
Prioritize isolation with database-per-task or per-user patterns for any multi-agent or multi-tenant application.
Unify your memory stack on platforms that combine relational data and vector search to simplify your architecture.
Consider where operational-analytical convergence matters for your use case — if agents need real-time access to both transactional state and analytical insights, explore platforms that unify these layers.

As agents handle the mechanics of provisioning and querying, our role becomes more curatorial. We shift from writing low-level code to the higher-level work of orchestrating how our systems handle both operational state and analytical intelligence. Whether through ephemeral databases optimized for agent workloads or integrated platforms that bridge transactional and analytical processing, the key is choosing the right architecture for your specific needs. Just as Google's search quality relies on relentless human refinement, the most effective agentic systems will be those where we constantly monitor, correct, and teach our agents what 'good' actually looks like — regardless of which database philosophy we embrace.

Learn • Connect • Build: Open-Source AI in SF

A pragmatic guide to enterprise search that works

Ben Lorica 罗瑞卡

Sep 09, 2025

Subscribe • Previous Issues

The Enterprise Search Reality Check

Before the AI hype cycle exploded with ChatGPT in late 2022, I was focused on a less glamorous, but equally important shift: the resurgence of enterprise search. Neural retrieval and vector embeddings finally looked practical. After the release of ChatGPT, an assumption among some AI teams was that these powerful new models would solve the long-standing “enterprise search” problem. AI teams dove into fine-tuning, Retrieval-Augmented Generation (RAG), and agentic frameworks, expecting to conquer the corporate knowledge base. But despite the incredible advances in foundation models, enterprise search remains stubbornly difficult. It's a baffling disconnect: the same model that can eloquently explain quantum mechanics is often unable to give a straight answer to a seemingly simple question like, "What are our current quarterly goals?" After interviewing some founders and engineers working on this problem, I've discovered why. The real obstacles aren't what you'd expect.

1. The Foundational Rot: It's a Data Quality Problem, Not a Model Problem

The core issue in enterprise search is the nature of the data itself. Unlike the public web, where pages have clear owners and URLs serve as stable identifiers, enterprise information lacks clear ownership, governance, and structure. For example, a system might contain three different versions of a "Q3 Sales Strategy" document — a draft in a shared drive, an outdated wiki page, and a final PDF in an email. This inherent ambiguity is compounded by "shadow documents" created by employees when they can't find the original, further polluting the knowledge base. Staleness and duplication create a low-signal environment where even strong retrievers struggle to find ground truth. This isn’t a failing of an algorithm; it’s a reflection of the input. Garbage in, garbage out.

This reality forces a shift in focus from the AI model to the data foundation. One approach is organizational: appointing dedicated "Knowledge Managers" to curate critical information, establishing clear governance, and building a culture of data hygiene. The other is architectural: implementing systems like knowledge graphs that programmatically create structure by identifying entities (people, projects, documents) and mapping their explicit relationships. Graphs generate the reliable signals — like "Engineer A owns Jira Ticket B" — that are missing from unstructured text, creating a trustworthy foundation before a language model is ever involved. Without this foundational work, any search initiative is built on sand.

2. The Signal Problem: Why Enterprise Ranking Fails

Web search thrives on a rich set of signals: PageRank, click-through rates, backlinks, and user behavior at a massive scale. Enterprise environments have none of this. Relevance is deeply contextual and ambiguous. Is a new document from a CEO more important than a five-year-old, battle-tested engineering policy? The answer depends entirely on who is asking and why. A sales executive searching for "quarterly goals" needs a completely different result than a software engineer using the same query. This lack of clear, universal authority signals means that simple retrieval methods, whether keyword-based or basic vector similarity, often fail, returning results that are semantically related but contextually useless.

To overcome this, teams use multi-faceted, hybrid retrieval systems. They might use BM25 for exact phrase matching (crucial for finding specific contract clauses), dense embeddings for conceptual similarity (helpful when users don't know the exact terminology), and (knowledge) graph traversal for authority-based discovery (finding documents through trusted authors or recent approvals).

Advanced systems add an "instructable reranker" layer that can be explicitly programmed with business logic. A pharmaceutical company might configure their reranker to always prioritize FDA-approved documents over internal research notes, while a law firm might boost documents based on the seniority of the authoring partner. This transforms ranking from an opaque algorithm into a configurable business tool. The most sophisticated systems improve the signals feeding that reranker with hard-negative mining (teach the system to tell near-neighbors apart) and enterprise-tuned embeddings that understand your acronyms and ontology.

3. The RAG Paradox: A Powerful Tool That Magnifies the Core Problem

RAG and its many variants have become the default architecture for grounding LLMs in private data. However, its effectiveness is entirely dependent on the quality of the initial retrieval step. If the right documents aren't surfaced in the first pass, the system fails. Anant Bhardwaj describes this failure mode as "worse than hallucination" because the model provides a confident, well-written answer based on incomplete or incorrect information. An employee asking about parental leave policy might receive a perfectly articulated summary of an outdated draft, a dangerously misleading outcome. The system doesn't know what it doesn't know, and the polished output masks the critical omission.

This highlights that RAG is not a magic bullet but a component in a larger system that needs to be engineered for robustness. Some teams respond with long-context models and “just dump more text,” which helps recall but inflates cost and latency and could still miss tables, images, or cross-doc dependencies. The more reliable pattern is “RAG 2.0”: start with document intelligence (layout-aware parsing, section hierarchy, provenance); retrieve with a mixture of retrievers to maximize recall; apply a strong reranker to enforce your rules; then generate with a grounded model that cites sources and is trained to say “I don’t know” on insufficient evidence. For recurring questions, seed a curated FAQ/answer bank so common queries don’t depend on brittle retrieval at all. For sensitive topics, gate any external lookups with confidentiality filters.

4. The Architectural Shift: From Search Boxes to Curated "Answer Engines"

The goal of enterprise search is evolving beyond just returning a list of links. Employees, now accustomed to tools like ChatGPT, expect direct answers. However, a general-purpose search tool layered over a messy data lake cannot reliably provide them. The liability of providing an incorrect answer often outweighs the efficiency gains of having any answer at all: providing an incorrect or incomplete answer is too high for business-critical functions like HR, finance, or legal compliance.

This is driving a strategic split. Instead of one monolithic "enterprise search," the more practical approach is to build multiple, curated "answer engines" for specific, high-value domains. For example, an HR team might maintain an engine built exclusively on a vetted, up-to-date corpus of official policy documents. This approach treats the problem as one of building a trustworthy, predictable system, where a well-understood scope and predictable failure modes are more valuable than a high but brittle accuracy score on a broad, uncontrolled dataset.

Employees prefer using a reliable, narrow system over a broad but unpredictable one. For coverage gaps, blend three sources: internal documents; pre-written expert answers for anticipated questions; and (when allowed) on-demand external enrichment — kept on a short leash and never used for sensitive queries. This narrows scope, raises trust, and keeps supportable SLAs.

5. The Implementation Reality: Enterprise Search is a Service, Not a Product

Many vendors, particularly those new to the enterprise space, underestimate the sheer complexity of real-world IT environments. Enterprise data is fragmented across dozens of siloed SaaS applications and legacy systems, each with its own APIs, permissions, and quirks. Deploying an effective search system requires deep integration, robust security plumbing that respects fine-grained access controls, and significant customization to align with a company's unique vocabulary and workflows.

This reality has led to two clear trends. First, enterprises are increasingly choosing to buy specialized third-party applications rather than build their own search solutions from scratch, acknowledging that it is a full-time engineering challenge. Second, the successful model for deployment is a "platform plus services" approach. This combines a strong, flexible software platform with professional services to handle the extensive integration, tuning, and customization required. For AI teams, this means budgeting not just for software licenses, but for the significant engineering effort needed to make it work. As Jakub Zavrel of Zeta Alpha notes, “turnkey” enterprise search solutions rarely survive contact with reality.

6. The Measurement Mandate: Proving Reliability in Your Own Context

The first question a practitioner often asks is, "How good is this model?" The immediate temptation is to check public leaderboards. This is a mistake. A model that aces a public trivia QA benchmark is useless if it can't distinguish between your company's internal 'Project Titan' and the dozen other 'Project Titans' it learned about from the public web. Enterprise success is not measured by open-domain accuracy but by reliability within a specific, messy, and private context. Standard benchmarks fail to test for the things that actually break enterprise systems: procedural multi-step queries, the ability to synthesize answers from multiple documents, and, most importantly, knowing when to say nothing at all because the information is missing or ambiguous.

Enterprise search is never going to be turnkey out of the box. It requires deep customization.

— Jakub Zavrel of Zeta Alpha

The only way to solve this is to stop looking at external leaderboards and start building your own internal evaluation suite. This starts by creating a gold-standard test set from a versioned snapshot of your own knowledge base. This internal benchmark must be designed to probe for common failure modes, including questions that are intentionally unanswerable. To measure relevance, many are moving away from noisy 1-10 scores and toward pairwise comparisons — using either human judges or an LLM to decide which of two results is better for a given query. This creates a clearer signal for what "good" means to your users. Ultimately, trust comes from explainability. The system must be able to provide clear citations and trace the lineage of its answers, proving not just what it knows, but how it knows it.

7. The Next Frontier: From Retrieval to Agentic Workflows

The paradigm for enterprise information access is undergoing its third major shift. The first was the search box: "find me a document." The second, driven by RAG, was the chatbot: "answer my question." The emerging third paradigm is the agent: "do this task for me." This evolution is driven by the need to automate complex business processes that require more than a single query-response loop. Answering a question like, "Summarize the key risks and decisions from the Q3 product planning cycle," is not a single search. It requires finding meeting notes, cross-referencing Slack channels, checking related project tickets, and synthesizing a coherent narrative from these disparate sources.

This leap from simple retrieval to multi-hop reasoning requires a new architecture. Instead of a monolithic RAG pipeline, teams are building agentic systems that treat retrieval as one of many "tools" an orchestrator can use. In this model, an agent can plan and execute a sequence of steps: query a database, look up a file, parse its contents, and then feed the synthesized context to a language model. These workflows are often encoded as repeatable graphs (DAGs) to ensure reliability and support human-in-the-loop checkpoints. This is the true endgame for enterprise AI: not just to make information findable, but to put that information to work, automating the complex knowledge-based tasks that impact business metrics.

What AI teams should internalize

Enterprise search is fundamentally a systems engineering and data governance challenge that happens to use AI, not an AI problem that happens to involve data. Foundation models have transformed what is possible — turning search results into conversational answers — but they have not eliminated the hard parts. If anything, they have raised the stakes. An incorrect answer from a chatbot is a nuisance; an automated action from an agent based on flawed data is a liability. This is why the most mature teams are shifting their focus from chasing leaderboard scores to building rigorous, internal evaluation frameworks that prize reliability over occasional brilliance.

The enterprises succeeding with AI-powered search are not those with the biggest models, but those that have accepted the messy reality of their data and built systems designed for predictability and trust. They understand that the true endgame is not just to find documents, but to build an auditable, trustworthy foundation upon which reliable automation can be built. They are engineering the information supply chain for an agentic future.

Pragmatic steps to take now

Start with a Data Census, Not a Model Evaluation. Inventory your critical knowledge sources, identify owners (if they exist), and understand update cadences and access controls. The gaps you find will define the real scope of your challenge.
Ship Hybrid Retrieval with Reranking. Your RAG system's intelligence is capped by what it retrieves. Combine keyword, dense, and graph approaches; add an instructable reranker and hard-negative mining. A brilliant language model working with the wrong documents is worse than useless.
Stand Up One Curated Answer Engine. Build narrow and deep before going broad and shallow. Pick a high-value, well-bounded use case like HR or IT support; restrict its sources, require citations, and implement "I don't know" as a feature, not a limitation.
Evaluate Privately and Continuously. Version your knowledge bases and build internal benchmarks that include unanswerable questions and multimodal data. Prioritize predictable failure over unpredictable brilliance; a system that is right 80% of the time with understood failure modes is more valuable than one that is right 90% of the time but fails randomly.
Think in Workflows, Not Just Answers. Before building a complex agent, map the human process it is meant to replace. Start by augmenting that workflow with reliable, single-step tools before attempting full, end-to-end automation.
Budget for Integration and Stewardship. Whether building or buying, expect platform-plus-services costs. Assume you will spend as much on integration, customization, and maintenance as on core technology. Treat any promise of a "turnkey" solution with healthy skepticism; it likely signals a misunderstanding of the problem's depth.

Quick Takes

Evangelos Simoudis and I cover these three topics:

1. The “AI Governance Industrial Complex”: Who Should Regulate AI

2. Dan Wang’s “Breakneck”: Inside China’s Engineering-Led AI Quest

3. The Recent MIT Survey: What to Do When AI Value Doesn’t Match the Hype

Robotics Is Becoming AI’s Ultimate Testing Ground

Ben Lorica 罗瑞卡

Sep 02, 2025

Subscribe • Previous Issues

Foundation Models in Robotics: From Bespoke Machines to Generalist Brains

I've been reading a great deal about modern manufacturing, an industry where robotics has been a central figure for decades. For all their success in the structured environment of a factory, these robots have struggled to break out of their cages and into more dynamic, general-purpose roles. This situation is not without precedent; for those of us who use AI, it mirrors the exact challenge we had with natural language processing until very recently — our models excelled within their narrow domains but couldn't transfer their capabilities beyond the specific use cases they were built for.

For anyone involved in building AI applications today, the term "foundation model" — or "frontier model" — should be a familiar one. We've seen foundation models revolutionize knowledge work through language processing and redefine creativity with visual generation. But a more interesting question is now on the table: what happens when a model needs to do more than process digital bits? What if it needs to physically act in the world?

This question brings us to a long-standing frustration in robotics. Historically, every new application has been a bespoke, ground-up effort. If you wanted a robot to fold laundry, you had to build a custom system for that specific task. If you then decided you wanted it to make coffee, you were essentially starting from scratch. This approach is akin to designing a new car for every single trip — it is slow, costly, and does not scale. It is the core reason we have single-purpose robots bolted to factory floors instead of the generalist, adaptable helpers many have long envisioned.

From what I've gathered digging through recent papers, talks, and company websites, that old paradigm is slowly beginning to crack. The goal is to create a single, adaptable AI — a highly capable "robot brain" — that can be pre-trained on the physics of interaction and then quickly fine-tuned to control different robots for thousands of different tasks. The fundamental shift is in the “foundation model's” output: from generating text or pixels to generating physical action.

The Secret Sauce: A New Kind of Data

The success of foundation models is built on a powerful insight: performance scales predictably with the size of the model and, crucially, the volume and quality of its training data. But where language and image models benefit from the abundant "digital exhaust" of the internet, robotics confronts a fundamental data scarcity. There's no pre-existing "internet of physical experience" to mine. To solve this, researchers are pursuing three primary "recipes" for gathering the necessary data.

First is learning in a virtual world, a strategy often called "sim-to-real." Here, a robot practices a task millions of times in a hyper-realistic simulation. DeepMind's Proc4Gem system, for example, trains robots in thousands of procedurally generated virtual living rooms. In one experiment, a quadruped robot trained exclusively in simulation was able to successfully push a trolley to specified targets in the real world. It even generalized to objects it had never seen, like a 1.5-meter-tall toy giraffe, showing that the learned skills weren't tied to a specific training environment.

The second approach is learning by watching humans through teleoperation. In this setup, a human operator "drives" a robot using a control rig, and the AI learns from these demonstrations. Google's robotics models have learned complex tasks like folding an origami fox or packing a lunch box after observing just 50-100 human-led examples. This method provides high-quality, real-world data that captures the nuances of physical manipulation.

The most sophisticated strategy is the hybrid or "data pyramid" approach, exemplified by NVIDIA's GROOT initiative. This model is trained on a heterogeneous mix of data sources. At the pyramid's massive base is web-scale data, like YouTube videos of humans performing tasks. The middle layer consists of synthetic data from simulations. At the peak is a smaller amount of high-quality, real-world robot data collected via teleoperation. This diverse diet allows the model to learn both high-level semantic context (e.g., "cleaning a kitchen" involves putting dishes in the sink) and the low-level physical skills required to execute tasks.

The Different Flavors of Robot Brains

As the field matures, we're seeing a few distinct architectures emerge, each suited for different applications. Understanding these "flavors" is key to seeing where the technology can be applied.

The All-in-One (Vision-Language-Action Models): These are the closest thing to a complete, drop-in robot brain. Models like Google's Gemini Robotics and Physical Intelligence's π take high-level inputs — an image of a scene and a text command like "put the Japanese fish delicacy in the lunch-box" — and directly generate the low-level motor commands to execute the task. They handle the entire pipeline from perception to action. The key strength here is generalization; these models can perform tasks correctly even with novel objects (like sushi) or in unfamiliar environments.
The Planner (Embodied Reasoning Models): These models act as the "thinking" part of the brain but delegate the final action. Models like RoboBrain 2.0 or Google's Gemini Robotics-ER specialize in perception, spatial understanding, and multi-step planning. For instance, you could ask, "Where can I grasp the handle of this pan?" and it would output precise 3D coordinates or a motion trajectory. These planners excel at decomposing complex commands into a coherent sequence of steps, which can then be passed to a separate motor control system.
The Specialist: In contrast to general-purpose models, some foundation models are being built for a single, massive task. Amazon's DeepFleet is a prime example. It is a highly specialized model focused exclusively on multi-agent trajectory forecasting to optimize the movements of over one million robots in its fulfillment centers. While it can't pick up an object, it has delivered tangible benefits like a 10% improvement in fleet efficiency. This proves that training a large model on vast, real-world operational data to learn complex system dynamics is a powerful strategy not just for generalist robots, but for targeted industrial tasks as well.

Major Roadblocks and the Path Forward

Despite the rapid progress, AI developers should be aware of significant hurdles. The sim-to-real gap remains a major challenge; skills learned in a clean simulation often fail when faced with the unpredictable physics and sensor noise of the real world. Safety is paramount, and the stakes are infinitely higher than with a language model. A robot "hallucinating" a physical action could lead to property damage or injury. Finally, these models have immense computational and real-time constraints. A robot can't pause to "think" for 300ms in the middle of a delicate task, so overcoming inference latency is critical.

The same data and safety breakthroughs powering robot brains will shape all autonomous agents.

Looking ahead, the field is moving toward a future where training a robot is less about fine-tuning and more about simply prompting it. The ultimate vision — telling a robot to "clean the kitchen" and having it figure out the rest — remains distant but is no longer fantastical. This progress is being fueled by a dynamic between open-source models, like Physical Intelligence's π, and proprietary systems from giants like Google and Amazon. For teams building AI applications, the takeaway is clear: the foundational technology that transformed our digital world is now being used to command the physical one. As data collection scales and architectures mature, the era of the bespoke robot is ending, and the foundation for the generalist machine is being laid.

For those building agents to navigate digital spaces, the work being done in robotics may seem distant. It’s not. Robotics is, in many ways, the same problem of autonomous action played in more difficult settings. The challenges of grounding a model in reality are magnified to their absolute extreme when that reality is governed by physics, not code. Because the cost of failure is so high — a physical "hallucination" is far more consequential than a digital one — robotics teams are forced to pioneer the most robust solutions for data scarcity, safety, and reasoning.

The creative data strategies they employ, like the "data pyramid" that blends web, simulation, and real-world data, offer a powerful template for any team struggling to source training data for complex enterprise workflows. Their intense focus on "semantic safety" — teaching a model why an action is unsafe, not just that it is — provides a glimpse into the future of building truly trustworthy agents. Watching the field of robotics, therefore, isn't just about an interest in robots; it's about seeing the core challenges of building Large Action Models stress-tested in the most demanding environment imaginable. The solutions they invent today will likely inform how enterprise teams build autonomous agents tomorrow.

See you in San Francisco!

The "AI Doom Loop" is here

Ben Lorica 罗瑞卡

Aug 26, 2025

Subscribe • Previous Issues

The Great Hollowing-Out: AI and the Junior Role Crisis

I’m coming across more articles about the looming impact of AI on jobs, and many paint a particularly dire picture for recent college graduates and young professionals. The data, unfortunately, reinforces this narrative. For the first time in over 45 years of recorded data, the unemployment rate among recent college graduates has inverted, now standing between 5.8% and 6.6% — significantly higher than the national average of around 4%. This isn't just a statistical blip; it's a structural shift affecting even elite institutions. Between 2021 and 2024, the share of Stanford Business School graduates with a job three months after graduation fell from 90% to 80%.

The anecdotes behind these metrics are just as telling. Entry-level job postings in the U.S. have fallen more than 20% below pre-pandemic levels, while senior roles have actually increased. The job search has become a high-volume, low-response ordeal. One computer science graduate documented applying to over 5,000 tech positions, which resulted in only 13 interviews and zero offers. Graduates from top-tier schools like Cornell and Boston College, armed with strong academic records and prestigious internships, report sending out 150 to 200 applications with minimal success, often receiving automated rejections within minutes. The data indicates that this is a structural transformation of the entry-level market, not a temporary fluctuation.

The Missing First Rung: A Junior Role Crisis

The signs that AI is reshaping the employment market for young professionals are becoming too clear to ignore. The impact is particularly acute in sectors that were once seen as reliable paths to a stable career. Tech companies, for instance, have cut the share of entry-level positions in their hiring by half since 2019. Today, recent graduates account for just 7% of their new hires. This trend is driven by a fundamental change in corporate philosophy, crystallized by Shopify’s CEO in a directive to his company: "No more new hires if AI can do the job." This isn't just happening in tech; major financial institutions like DBS Group are planning to eliminate thousands of positions through AI adoption.

This corporate transformation has created a difficult paradox for job seekers. The application process has devolved into what some call an "AI doom loop," where candidates use AI to mass-apply and companies use AI to mass-reject, often without any human review. UK employers now receive an average of 140 applications for every graduate position, a 59% surge in just one year. At the same time, employers are raising their standards. A recent Deloitte survey found that 61% of employers have increased experience requirements for entry-level roles over the last three years, creating an impossible catch-22: young professionals cannot get a job without experience, but the very roles that once provided that experience are disappearing.

Perhaps the most concerning development is the automation of foundational "grunt work" that has long served as the training ground for entire professions. Law firms report a 40% reduction in the document analysis work traditionally assigned to first-year associates. Investment banks acknowledge that because AI now handles basic data collection and visualization, they are "reducing entry-level headcount." This automates portions of the grueling analyst work — the kind of all-night pitch book sessions detailed in Karen Ho's Liquidated, which I reviewed years ago — that once served as a brutal but essential rite of passage. But that rite of passage was the learning mechanism, and as it's automated away, companies are increasingly expecting senior-level judgment from day-one employees. Of course, AI is not the only factor cited in the current market weakness (see this graphic).

Cultivating Future Leaders When Junior Roles Vanish

As companies automate entry-level tasks, they face a critical long-term risk: what my friend Evangelos Simoudis calls "hollowing out" their talent pipeline, creating a leadership vacuum in five to ten years. If there are fewer junior employees moving up the ranks, where will future partners, executives, and senior managers come from? Forward-thinking organizations are beginning to address this challenge not by halting automation, but by fundamentally redesigning their approach to talent development. This involves a strategic shift from pure role elimination to workforce transformation. DBS Group, for example, plans to cut 4,000 jobs but is simultaneously creating 1,000 new AI-specialist positions, focusing on redeployment.

The most effective strategies treat AI as a tool for augmentation, not just automation. Instead of replacing junior staff, companies like Accenture are deploying AI "copilots" that allow them to tackle more complex assignments, accelerating their development. McKinsey has deployed thousands of AI agents to handle routine analytical tasks, but it retains junior employees to oversee, verify, and direct these systems. This approach reframes the entry-level role from a "creator" of first drafts to a "verifier" and "orchestrator" of AI-generated work, preserving a crucial rung on the corporate ladder.

To compensate for the loss of learning through routine tasks, some companies are creating new, intensive training models. The pipeline operator Williams, for instance, established a two-day onboarding program where senior executives directly teach business fundamentals to new hires. This deliberate knowledge transfer ensures the "company way" is passed down even when apprentices can no longer learn it by osmosis. These initiatives show a commitment to maintaining a sustainable pipeline of talent, recognizing that a corporate structure with a missing base is ultimately unstable.

Educational partnerships are also critical. OpenAI's collaboration with California State University and Microsoft's $4 billion investment in AI training ensure graduates arrive workforce-ready. Forward-thinking firms are reimagining internships to focus on AI-augmented work, teaching students to critically evaluate AI outputs while identifying errors and biases. By maintaining these developmental opportunities, companies avoid creating a leadership vacuum that could cripple them in five to ten years.

Partnership, Not Replacement: An Augmented Future

In a recent conversation I had with Jackie Brosamer and Brad Axen of Block, they offered a compelling vision for how AI is changing software engineering. Brad explained that new engineers are onboarded with AI coding tools from day one, allowing them to navigate complex codebases without needing a mentor for every basic question. Jackie noted that this levels the playing field. "We're all beginners here," she said, explaining that because best practices for using these new tools are still emerging, junior engineers are excited to be on the forefront alongside everyone else. Brad envisions a future where 90% of code is AI-generated, with engineers providing the final, crucial 10% of quality control, integration, and strategic oversight.

As foundation models and the tools to customize them for specific workflows continue to improve, this "augmentation" model offers a path forward. The companies that thrive will not be those that simply cut the most jobs, but those that thoughtfully integrate AI while investing in their human workforce. To avoid a hollowed-out future, leaders should focus on:

Preserving Mixed-Experience Teams: Resist the urge to eliminate all junior positions. Research from companies like BMW has found that teams with a mix of novices and experts consistently outperform homogeneous groups, as fresh perspectives from newcomers complement the wisdom of experienced veterans.
Redefining the Entry-Level Value Proposition: Rebrand the junior role to attract talent by focusing on higher-value work. The investment firm Carlyle now "pitches to recruits that they won't perform grunt work," instead reframing the job around verifying and challenging AI-generated analysis.
Reimagining Apprenticeships: Actively design new training models and "gateway" roles that provide structured learning paths. The goal is to ensure that even as AI handles routine work, the next generation of leaders has a clear and viable path to develop the skills and institutional knowledge needed to succeed.
Implementing Structured and Intentional Knowledge Transfer: Since new hires can no longer learn by observing routine tasks ("learning by osmosis"), companies must create deliberate systems for mentorship. Some firms are being "far more intentional" about mentoring, using structured debriefs and simulations to actively develop both technical and soft skills in new hires.
Developing Uniquely Human Skills: Prioritize training in critical thinking, complex problem-solving, creativity, and emotional intelligence — capabilities that AI cannot replicate. Companies are already reporting they are twice as likely to prioritize these soft skills in hiring.
Creating Internal Mobility Pathways: Instead of laying off employees whose roles are automated, build robust programs to reskill and redeploy them. One UK bank successfully retrained 60 at-risk branch employees into data analysts and software developers, achieving 100% redeployment.

Data Exchange Podcast

**Apple • Spotify •** **Overcast • YouTube** **• RSS**

Loading more posts…