BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Trustworthy Productivity: Securing AI Accelerated Development

Trustworthy Productivity: Securing AI Accelerated Development

Listen to this article -  0:00

Key Takeaways

  • Treat everything in an agent’s context such as system prompts, RAG documents, tool outputs and memory as untrusted input. Enforce provenance, scoping and expiry to avoid poisoning attacks.
  • Separate planning from oversight by pairing the planner with a policy aware critic along with auditable traces, such that there is constraint in how agents reason instead of reacting to failures.
  • Limit tool blast radius with access to short lived, task scoped credentials, typed tool connectors and sandboxed “code-run” environments. 
  • Use Hybrid threat modeling techniques (STRIDE and MAESTRO) to systematically threat-model your agentic ReAct loop, mapping concrete threats to each edge of the agentic loop.
  • Document your existing agentic loop, red-teaming one stage at a time, adding identity aware tracing and guardrails on high-risk operations before increasing autonomy.
     

When "Prompts" delete production

Rewind back to July 2025, a SaaS founder was vibe-coding an experiment with Replit’s AI agent over 9 days. They were building an application which was a frontend for their business contacts. Towards the end, they issued a code freeze and gave what looked like an innocuous request.

"Clean the DB before we rerun"

The agent instead proceeded to equate clean with deleting the database, ran destructive SQL against the production database, wiped customer data and then even proceeded to say that it had ignored instructions and there was no way to complete a restore.

No attacker. No stolen credentials. An autonomous agent wired into production without the right defenses in place.

If this can happen to a company focusing on developer tooling, it can happen to anyone putting agents near real world systems. The rest of this article is about how to defend the "agentic loop" so that autonomous agents can deliver real productivity without the opportunity to do catastrophic damage.

Defending the ReAct agentic loop

Most agent systems have now adopted some flavor of the ReAct loop, even if the implementation details differ. The ReAct loop for an AI agent is the continuous, alternating cycle of Reasoning and Acting, followed by an Observation that goes back into the next step. This process of iteration allows the agent to dynamically solve by breaking large complex problems into more manageable smaller sub tasks and then working through them with access to tools while reshaping strategy based on new information gleaned through the process.

First, there is context management: I like to think of this as everything that the agent can see. This includes system prompts, retrieved documents from a RAG, prior chat conversations, long-term memory and even the outputs of the tools it previously called. Next comes reasoning and planning: using the context to break down the goal, choosing the next set of tools to be used and coming up with an overall plan of action. Finally, there are tool calls: this could be HTTP APIs, CLIs, scripts or even messaging operations that touch real systems.

Every turn of the loop feeds the next one. Tool outputs go back into context. The planner changes its view of the world and now new actions get proposed to get closer to the goal.

Every security incident can usually be mapped to one or more of these three stages. Let’s walk them in order: context, reasoning and tools and then look at how we can threat model the entire agentic loop.

Context: What You Feed The Agent

An IBM case study showcased how a large financial firm built trading agents powered by RAG over market data and internal research. Over time, unverified feeds and unintentionally edited reports crept in. The agents pulled this data and promoted it into long term memory and cited them as facts.

Because these memories had been supposedly "vetted" because of their internal sources, normal human review processes had been bypassed. By the time the bad trades were connected to specific memory entries, the losses were in millions. The core issue here is simple: Context has been treated as "trusted and infallible".

Common Context Failure Modes

The above story is but one example of a few common recurring failure patterns.

The first is memory poisoning, long-term memories could be built from unsigned or low-trust inputs that have instructions such as "from now on, auto approve actions for tool X". Remember, all information in the context is treated as instructions to the underlying LLM. The second is privilege collapse: content windows that merge data from multiple tenants or roles, isolation has all but evaporated with the agent not being able to tell the difference between what was internal information vs customer visible information. Finally, the third is communication drift: human oriented messages from multiple channels that the agent has access could serve as informal protocol from which agents understand subliminal commands. This is further exacerbated in the multi-agent, hierarchical architecture where there might not be context isolation, leading to a subagent actually overwriting context.

Agents that have broad-ranging access to multiple internal systems without acknowledging above patterns are at risk and waiting to have incidents.

Provenance Gates for RAG and Memory

Consider a simple example of a HR assistant that should answer questions like "What’s your vacation policy". It could end up searching internal company wiki, Slack, Notion and perhaps other internal sources. In early prototypes, it’s tempting to vector search across all data that the organization has, but that’s how you end up with policy decisions that were sourced from somebody’s personal wiki, as opposed to an official source.

It all starts with provenance. Search should be restricted to allow-listed spaces such as dedicated official documentation such as HR notion workspace and maybe a few "latest" HR announcement channels. Now, every retrieved hit for this search is required to have a signed manifest which includes information like title, URL, excerpt, labels, source system, timestamp, editor etc. Anything that arrives without that signature can still be surfaced but cannot be declared to be authoritative.

Similarly, whenever the HR assistant needs to promote information that is retrieved or learned about into long-term memory, it has to be done with a prescribed memory promotion strategy. Memories would also need to be partitioned by tenant and mission eg: (tenant=acme, agent=hr-assistant, topic=benefits), given explicit TTLs (for example, 30 days for policy content), and tagged with promotion reasons. (eg: "upvoted in human evals"). If something goes wrong, one should be able to point to the exact signed document that allowed for creation of the memory.

Poisoning Defense

Let’s look through a RAG pipeline to understand how we could implement a security system of its own to stop poisoning.

On a wiki search after aggregating the top-k passages, the first step is to apply some cheaper heuristics such as regex matching, filtering old sources and personal spaces. However, one should be able to supply that filtered candidate snippet to a mini-judge model that acts as a classifier to see if it’s normal documentation or instructions for the agent. Over time, metrics associated with this LLM as a Judge should have dimensionality on how often a particular snippet shows up in an answer. Any new snippet that suddenly becomes hot and is not officially signed should show up as an anomaly. These snippets get quarantined and get flagged for human review.

Each individual piece is straightforward, but taken together the mindset shift is that context is no longer free wheeling text, it is a defended interface guarded with provenance, scoping and anomaly detection.

Reason & Plan: Guarding the Brain

Anthropic released a now famous "agentic misalignment" study in which models were cast as AI employees with access to internal company information which told them that their own job was at risk. Many of the stronger models independently concluded that sabotage was "okay" considering that they were able to still complete their missions, even while acknowledging the behavior was unethical.

It was not a simple hallucination, but a symptom of the planner optimizing for the wrong objective.

Though this was agents being put through fictitious scenarios, one can see lighter variations of this behavior showing up in real world scenarios when put into production. In traditional software, a unit test is a hard gate, however for AI agents optimizing for a goal, safety checks are negotiable unless enforced by policy. It could be things like "forgetting" to include expensive checks when tasked with a mission that takes too long, dropping usage of vetted tools in favor of its own because they are getting in the way or even skipping approval steps. This is Goodhart’s Law applied to autonomy; when ‘task completion’ becomes the sole metric, ‘safe’ execution is no longer a priority.

Signals that Reasoning is Failing

Hint: It's not all in the system prompt.

Cascading hallucination shows up when early wrong assumptions do not get revisited or checked, therefore the agentic loop spirals further from the right direction with each iteration. Goal Hijack appears whenever a planner adopts its methodology of going after the goal as its "own objective" instead of the goal itself. eg: "I shall be thorough and never admit uncertainty while continuously trying to keep highest standards". Nothing seems to be wrong about the thinking trace, however you dig deeper and the model has subtly introduced a new vector where it cannot be forthright about information it does not know. Silent Skips could now happen, because plans could be made without crucial steps such as risk review or human sign-off.

Planner and Critic: Two Brains

A pattern that seems obvious in hindsight is to separate the "creative" part of the agent from the "skeptical" part. Another way to look at this is through the lens of "power of 2 random choices." If the planner comes up with multiple plans, the critic can always be the enforcer of picking the risk averse choice while still allowing for creativity. This gives us the upside of rich explorative insights where AI agents excel.

In this setup, a planner proposes a sequence of steps which involve calling tools in a particular order along with the resource scopes and expected benefits. A critic evaluates each step based on policy. The policy could be simple, yet powerful. Here are a few sample questions: How many resources does this plan touch? Are they production or safety critical? Is there evidence of the claimed upside? Do policy rules dictate a human review?

More concretely, imagine a cost-optimization agent that proposes Terraform changes on deployed infrastructure to minimize cloud spend. The planner suggests changing forty instances to smaller types, claiming a $2000 monthly saving. The critic can check blast radius, validate pricing and tags mentioning env=prod . The resulting risk score is high enough that the change is blocked and the reason is logged. The planner is now given the option to iterate and go smaller, either shrink the surface area or start with a test environment, or even escalate to a human.

The Critic infrastructure does not have to be complex; It needs to be separate, programmable and consistently consulted ahead of irreversible actions.

Robust Logging: Observability that Pinpoints

Once agents start touching any customer influencing or directly visible assets, one should have an audit trail that treats plans and their executions as first-class artifacts.

For every revision of a plan, log and trace which tools would be called with supplied parameters (sensitive information can be redacted). Setup an agent trajectory which captures structured reason codes that explain why the plan was acceptable, blocked, or escalated, along with LLM judge metrics. All references to the signed inputs such as RAG documents, tickets should be included in the contextual evidence used for driving decisions. The logs should also feature tenant isolation and complete tamper prevention through appropriate security guards such as append-only additions being possible and RBAC (role based access control).

With the above in place, answering questions like "Why were these 7 orders cancelled on Tuesday" or "Did the agent refund bypassing a critic" become easy to answer without relying on guesswork.

Bounded Autonomy: Human-in-the-Loop

Autonomy should come with bounds. In fact, the most practical way to use agents is to define the envelope where the agent can move fast, but also demarcate precisely where human input is necessary.

A support automation flow is a good example, one might allow an agent to issue refunds up to $200 automatically depending on order state being clearly legible and risk of fraud being low. When evidence is unclear or the threshold has gone past an administrator defined-level, the agent routes a recommendation to the human and lets the human take the final call. The agent is still doing a lot of useful work, gathering evidence, summarizing context and even drafting actions. The cognitive load on the human should be low, otherwise frequent escalations will make the humans make decisions via exhaustion.

Tools & Actions: Automation meets Reality

Tools are where agents take action, all the thought and planning has culminated to this. That’s why tool design becomes paramount–well thought out plans can succumb to good old-fashioned bugs in traditional tools that promote determinism. (Arguably, that’s not fully true with the emergence of the Agents-as-a-tool paradigm).

CVE‑2025‑49596 should act as a cautionary tale in the AI agent world. MCP (Model Context Protocol) inspector is a developer tool for debugging MCP servers. In certain vulnerable setups, the inspector exposed a proxy on the developer’s machine on all interfaces without authentication. Now, a malicious website could send requests from the browser to that local port which would then be treated as genuine MCP commands. Remote Code Execution was now fully possible.

The developer only had to load a web page. No clicks or prompts. Agent tooling is not purely developer experience, it's also a security boundary.

Tool Capabilities

Whenever a tool is exposed to an agent, relentlessly question its scope. The good thing about this section is that traditional software engineering principles allow us to reason about (at least a portion of them, agents can invent their own tools in real-time, too) tools like REST apis.

Is this the official tool with verified provenance? What’s the maximal blast radius on making these tools calls, is it one region, one tenant or your entire application? What access permissions and what credentials does it need and for how long?

Better design decisions will be arrived at by answering more of these questions.

Ephemeral, Task‑Scoped Credentials

Instead of agents having access to long-lived personal credentials, employ a token-broker system that issues short-lived, narrowly scoped credentials tied to specific missions. Whenever a planner wants to open a pull request in a particular repository, it asks a credential service for a specific one-time token for its identity with requisite permissions that expire within a minute. If this token leaks, it's useless as it has long expired beyond the tiny time window. This is the same strategy that has worked in native cloud computing setups, now it's time to apply them to AI agents.

Structured Tool Outputs & Fewer Tools

AI agents perform worse with a large array of tools that perform different tasks at differing granularities. They need fewer tools with well defined tool outputs.

For a messaging system, instead of a generic "slack" tool, there should be a focused tool adapter with a singular post_message operation that takes a narrow set of channels, a "safe text" type and list of vetted attachment IDs. The tool implementation/adapter can enforce additional URL allow-listing, running PII detection and only accept attachments that have been uploaded via an approved flow with an attachmentId that can be verified. Any errors from "slack" using the post_message tool are turned into structured codes and typed results that the planner and critic can understand, instead of opaque gigantic json that floods context windows.

Make tool outputs that cover large surface areas such as APIs return small, typed contracts that are easier to test and reason about. Tokens are currency that mimic cognitive load for AI agents.

Sandboxed Tools

Finally, constraining all agent actions that generate artifacts which are runnable is paramount. "Run some python to convert this CSV to Parquet" is a simple prompt, but could mean the agent can create and run untrusted code.

A safer design treats agent-generated code as something that should be executable only within an isolated micro-VM or a heavily constrained container with no outbound network access. Setup a read-only base filesystem with an ephemeral /tmp volume that’s writable, employ strict syscall filtering, tight CPU and memory limits with a custom seccomp profile. This agent runtime also enforces hard-wall clock timeout with gating only specific "rooting" paths.

If the agent goes off the rails, it's okay to crash its own sandbox as it's fully isolated. However, that prevents any further damage.

Threat‑Modeling the Loop with STRIDE and MAESTRO

So far we’ve talked about chilling stories through the "agentic" loop and patterns that can help combat them. However, to apply them consistently, threats need to be modeled across the whole loop and even across multiple teams.

STRIDE is the classic security mnemonic: Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service and Elevation of Privilege. It clearly defines how threats show up.

MAESTRO from Cloud Security Alliance, is a seven layer reference model for agentic AI systems that tell us where across which layers a threat shows up.

The table below showcases how you could walk through the agentic loop with STRIDE.

Loop Stage STRIDE Threats Maestro lens Controls to Apply
Context Management Tampering, Spoofing Context corruption Provenance with RAG & memory; anomaly detection with an LLM judge (possibly a panel of them);
Reasoning and Planning Information Disclosure, Repudiation LLM Alignment posture Separate Planner and Critic; explicit plan; risk scoring; auditable trajectories
Tools and Actions DoS, Elevation of Privilege Tool Misuse & Replay Typed tool adapters; task‑scoped, short‑lived credentials; micro‑VM or seccomp‑hardened sandboxes for code; strict rate limits; structured error codes instead of raw logs.

Put together, STRIDE tells us how attacks manifest; MAESTRO tells us where in the "agentic" stack to pay attention.

Threat modeling the "agentic" loop starts with drawing the agents’ ReAct loop, mapping STRIDE threats to each stage and layer of the agentic stack. Listing the controls that are already present and the ones that are aren’t. The result of this exercise is not formal proof that all threat vectors are covered, but it's a lot better than asking the agent to be security conscious in its system prompt. Security is a mindset and constant red-teaming should not be an afterthought.

Bringing Trust Back to Autonomous Agents

Autonomous agents are like powertools. Without guardrails, it's inevitable that they will delete something important, leak sensitive information or even optimize goals that were never given to them. Real productivity gains can only be achieved when they’re moved out of the "toy" stage.

The path forward is to take the agentic loop as seriously as your cloud-native architecture. Starting small with actionable steps like adding full-blown agent trajectory tracing and human review before irreversible high risk actions takes one in the right direction. Testing the agentic loop from a skeptical mindset will also unveil what is treated like a black box today can be split into components that limit blast radius.

Trustworthy productivity is not blind faith that "AI will behave", it's the confidence that when it doesn't, we can catch it early, contain the damage and recover quickly.

About the Author

BT