Your Enterprise Architect

Beyond the Context Window: Implementing CoALA for State-Aware Enterprise Agents

Ruben Rotteveel — Wed, 10 Dec 2025 16:22:54 GMT

1. Summary

The Problem: Statelessness in Enterprise Workflows Large Language Models (LLMs) are powerful reasoning engines, but their utility in business environments is significantly constrained by their inherent statelessness. In a standard deployment, an agent resets after every session, retaining no context of user preferences, specific project history, or previously corrected errors. This lack of persistent state forces users to redundantly provide context and correct the same mistakes across multiple interactions, capping the efficiency gains that agents can provide.

The Goal My objective was to evolve the agent from a transient session-based tool into a persistent Institutional Asset—a system that retains operational context, learns from user feedback, and improves its baseline performance over time.

The Solution To achieve this, we implemented an architecture based on the CoALA (Cognitive Architectures for Language Agents) framework. By engineering a background memory processor (the "Hippocampus") and a dynamic context injection layer, we created a system that persists experience and democratizes institutional knowledge (a learning agent). This article details the technical implementation of that architecture.

2. The Constraint: Managing the Context Window

Communication with an agent is governed by the Context Window, which effectively functions as the agent's working memory. This window is finite, measurable in tokens, and determines the scope of data available for immediate reasoning.

fig 1: The agent context window.

Standard implementations often utilize a "sliding window" approach to manage this limit, where the oldest messages are discarded as new ones arrive. In a complex workflow, this leads to Contextual Drift—the loss of critical instructions or project constraints established early in the session.

The Fix: Context Engineering via Summary Injection To mitigate data loss, we rejected the sliding window in favor of a structured Context Stack that prioritizes information relevance over recency:

The System Layer: Contains immutable behavioral instructions and guardrails.
The Summary Layer: Rather than discarding history, the system compresses prior turns into a high-density "Summary Message." This preserves the global context of the session without consuming the token budget of raw logs.
The Active Thread: The most recent interactions are retained in high fidelity to facilitate immediate reasoning.

This shift allowed us to move from a linear log of text to a cyclic learning system.

3. The Process: Asynchronous Episode Extraction

Preserving the current session is only the first step. The critical challenge is capturing lessons from past sessions. To achieve this, we developed an asynchronous background process, internally referred to as the Hippocampus, that executes when a conversation reaches a natural pause.

This process transforms unstructured chat logs into structured Episodes. However, a simple log dump is too noisy. To extract meaningful signal, we apply a strict segmentation logic based on our "Swarm" architecture.

Step 1: Segmentation (Defining Boundaries) Our system utilizes a swarm of specialist agents, each equipped with specific tools. We use these natural architectural divisions to slice the conversation stream:

Conversation Boundaries: Defined by Agent Changes. When the routing layer switches from the "Coder Agent" to the "Billing Agent," the current Conversation object is closed and a new one begins.
Topic Boundaries: Defined by Entity Changes. Within a conversation, if the user pivots from "Project Alpha" to "Project Beta," we detect this entity shift (via spaCy) and create a new Topic node.

Step 2: Classification & Validation Once boundaries are established, we extract granular metadata for every message, including Domain, Operation, and Speech Act (Question, Command, Comment).

Finally, we employ a reasoning model to validate the extracted segments against strict rules. The primary acceptance criterion for an Episode is that a set of actions must culminate in a tangible result. If a segment is just "chatter" without an outcome, it is discarded. If the validation fails (e.g., the result is ambiguous), the model receives feedback and retries the extraction.

This rigorous filtering ensures that our Vector Database is populated only with high-value operational patterns, rather than noise.

4. The Feedback Loop: Episodic Injection

The existence of stored memory is insufficient; the agent must be architected to retrieve and apply it contextually.

We evaluated several retrieval strategies. Chain of Thought (CoT) prompting often proved too rigid; if a retrieved memory did not perfectly align with the current scenario, the agent would hallucinate constraints or fail to adapt.

The Solution: The "Memory Message" We implemented a pattern of Episodic Injection.

Retrieval: When a user initiates a prompt, the system queries the vector database for semantically similar past Episodes.
Injection: Relevant Episodes are formatted into a dedicated Memory Message and injected into the Context Stack prior to the user's new prompt.

Operational Example:

User Request: "Generate a SQL query for the User Analytics table."
Memory Injection: "Observation: In a previous session regarding 'User Analytics', the user rejected a query for lacking index hints. Result: Negative."
Agent Action: The agent preemptively includes index hints in the generated SQL, avoiding the previous error.

5. The System Architecture

By integrating these components, the architecture transitions from a linear input-output model to a cyclic cognitive system.

The Context Window manages immediate execution.
The Hippocampus manages consolidation (writing experience).
The Injection Layer manages retrieval (reading experience).

This creates a closed loop: Sense → Reason → Act → Learn.

The business value of this architecture extends beyond efficiency. It effectively captures Tacit Knowledge, the unwritten, experiential knowledge of senior staff and structures it. This allows junior team members to benefit from the accumulated experience of the organization automatically, as the agent retrieves "senior" strategies to guide "junior" requests.

6. Future Development

This implementation establishes the foundation for a state-aware agent. Our roadmap focuses on three advanced capabilities:

Topic-Based Retrieval: Moving beyond semantic similarity in individual prompts to analyzing the broader topic or domain of a conversation, enabling the retrieval of strategic context rather than just tactical corrections.
Innovation via Negation: Rather than prescribing a specific path, we aim to retrieve "Failure Modes" to define a boundary of negative constraints. This allows the agent to innovate within the solution space while strictly avoiding known pitfalls.
Synthetic Best Practices: Pre-loading the vector database with "Synthetic Memories" derived from corporate documentation and policy. This would provide a newly deployed agent with a baseline of institutional competence immediately upon activation.

Real World Agentic Solutions: Turning Microservices into an AI Workforce.

Ruben Rotteveel — Tue, 16 Sep 2025 16:08:06 GMT

I don’t remember the last time I had so much fun building something. The process felt less like writing software and more like assembling a cognitive entity — giving it personality, skills, memory, and the ability to learn and evolve.

At times it felt like Frankenstein — but instead of a monster, I was creating a team: the best Business Analysts, Product Owners, Project Managers, and Resource Managers, rolled into agents that never sleep, always learn, and are eager to help. It was magical, and honestly, humbling.

But then came the real challenge: how do you move beyond proofs of concept and demos, and integrate a multi-agent solution into a real production platform?

This article is the overview of how I approached that challenge in building Mezzoic’s Agent helper. It introduces the architecture I used, the problems I ran into, and the principles I found essential.

In this series, I dive deeper into each of those problem areas:

Context is Everything → how prompts, memory, and tokens define agent reliability.
From APIs to Agent Tools → how MCP servers transform APIs into safe, intent-driven tools.
Security & Trust → how to align agent security with your existing backend policies.

Together, these pieces are about one thing: turning multi-agent systems from fragile experiments into a real domain workforce inside your architecture.

Challenge: Keep the Architecture Simple

The magic of multi-agent systems is easy to capture in a demo. The hard part is making them work inside a real production system.

With Mezzoic, I started from a simple idea: if we already have a solid domain architecture with domain services, why not extend them into domain agents? Services represent bounded areas of expertise in the product, accounting, quoting, scheduling. Agents can be thought of as the experts that live inside those domains, capable of reasoning and acting just like the people on a project team.

That shift, from domain services → domain agents → domain workforce, became the foundation of the design.

But here’s the key: I didn’t want to rip up a working system just to experiment with agents. I wanted agents to act as extensions, not modifications, of the core. This is the idea of applying the Open/Closed Principle from SOLID design to your architecture:

Keep the platform closed to modification (stable, proven, secure).
Keep it open to extension (agents can plug in on top).

This is my golden rule of don’t break what’s already working.

The way to achieve that in practice is by wrapping existing APIs with MCP servers. MCP provides a protocol, much like REST, but tailored for agents. It turns APIs into discoverable, intent-driven tools that agents can actually use.

👉 I dive deeper into this in the article From APIs to Agent Tools, but at the overview level the principle is simple: Don’t redesign your backend for agents. Extend it with MCP.

Architecture Overview

Mezzoic mutli agent solution uses the agent swarm model for orchestration. Instead of a single supervisor agent delegating tasks, agents collaborate more fluidly, each equipped with the tools it needs to step in and act. This makes the system more resilient but it also means tool design must be clear to avoid collisions. Each agent is a domain expert, and in the Mezzoic context, that means there’s a dedicated Project Manger, Product Owner, Business Analyst and People Manager agent that you interact with.

The color coding is meant to differentiate between what’s new and what exists. Blues are new agent focused, purple are the MCP server extensions to your backend and Yellow are existing systems you’re leveraging.

One key design choice: agent layer = extension, not modification.
Separation of concerns:
- Agents → reasoning + orchestration
- APIs → business logic + enforcement
- MCP → bridge between them

👉 you can read some more about it in my in depth article, From APIs to Agent Tools

Challenge: Understanding and Managing Context

Context is everything.
The biggest challenge by far, is context, it is the new skill we have to master to make Agents and GenAI work. Context is the power and the achilles heel of agents. Get it right and your users will love the agent, get it wrong and they’ll get frustrated pretty quickly.

Context is what the model receives, instructions, tools, history, responses, it defines the agents personality, skills and reliability.

The challenge I faced were:

Running up against the token limit, LLMs limit the context window, and when the data you pull is content heavy, context fills up really quickly.
Producing effective prompts and descriptions
Creating easy to use tools that don’t require elaborate and hard to follow instructions.
Managing context so I give the agent enough context to be helpful and consistent.

The solution was to treat context as a first-class architectural concern. It wasn’t an afterthought; it became the core design problem.

What worked:

Summarization & focus → keep only what’s relevant, collapse old threads into summaries.
Specialized agents → each agent carries only the prompts and tools it needs, minimizing clutter.
Semantic & episodic memory → facts (preferences, roles) and experiences (what worked last time) stored outside the prompt, retrieved when needed.
Async/event-driven updates → keep the agent focused without drowning it in stale data.
Iterating on Prompts and tool descriptions until the LLM understood and could sucessfully accomplish the use cases we have.

👉 Read more in Deep Dive #2: Context Is Everything

Challenge: Security & Trust

The concern with adding agents is that they might bypass existing controls — giving users unintended powers or introducing a shadow security model.

Solution: leverage the existing API security model.

In Mezzoic, agents are simply extensions of existing domain services. That meant I could lean on the security model I already trusted. The APIs already implement Mezzoic’s governance and security requirements. and agent tools just wrap those APIs and can’t circumvent them.

Agents don’t need special powers; they need to respect the same policies as the rest of the system.

What worked:

OAuth OBO → agents act strictly on behalf of users.
Narrow scopes & short-lived tokens → tools stay limited and time-boxed.
Egress allow-lists → agents only reach approved MCP servers.
Auditability → every call is logged: who did what, when, via which tool.

👉 Read more in Deep Dive #3: Security & Trust

Closing

Extending Mezzoic with agents taught me three big lessons:

Keep the architecture simple (KISS) → agents should extend the system, not complicate or replace it.
Context is everything → the reliability and personality of agents depend more on context management than on the model itself.
Security & trust must be inherited → agents don’t need a new security model; they need to respect the one you already trust.

Those principles turned Mezzoic’s agents from fragile demos into a domain workforce: specialists that collaborate, stay within policy, and scale with the platform.

This article gave the overview. The real detail lives in the deep dives:

Context is Everything → tokens, memory, prompts.
From APIs to Agent Tools → MCP servers, marshal-by-reference, tool patterns.
Security & Trust → aligning agents with existing governance.

Together, these form a blueprint for moving beyond proofs of concept and into production-ready multi-agent systems.

Agent Architecture: Security & Trust

Ruben Rotteveel — Tue, 16 Sep 2025 13:14:19 GMT

Everyone I’ve spoken with about agents asks the same thing: “What about security?”

The concern isn’t just technical, it’s governance. If an agent makes a mistake, who’s accountable? If it accesses data, which policies apply?

In this article, I share the principles I use to answer those questions and align agent security with enterprise trust models. The context is Mezzoic, the product and project management platform I built, but the lessons apply to any organization exploring multi-agent systems.

Challenge: Trust

The biggest question in Security & Trust is accountability: who is responsible for the actions taken by (or on behalf of) the user?

My opinion is clear: accountability lies with the user and the application developers, not the agent. Agents are too new, too naive, to be trusted with unsupervised accountability. They lack mature governance models, and “guardrails” today are more marketing promise than operational reality. There are simply too many loopholes to plug with brittle, complex rules.

That’s why I avoid over-engineered guardrails altogether. Instead, I focus on two principles:

Propagate the user’s context all the way from the agent to the API layer (OAuth OBO).
Require Human-in-the-Loop (HITL) confirmation for any risky action.
Use Rules engines and Workflows to determine if actions are allowed (could be process or risk based).

Guardrails and Human-in-the-Loop

Guardrails ≠ governance. They may provide some protection, but cannot ensure safety, these guardrails aren’t hard constraints, they’re soft because they’re interpreted by the LLM and the LLM is unconstrained and unpredictable. Human review, on the other hand, guarantees accountability where it matters.

Transparency over hidden controls → The agent should show its reasoning (“I plan to delete these 5 records because…”) and ask for confirmation.
Effort vs. impact → Guardrails are costly and brittle; human review is clearer, cheaper, and more adaptable.
Fail gracefully → If the agent isn’t sure, it should escalate — not guess.

👉 Use agents to accelerate safe work, not bypass human judgment in high-risk areas.

Risk Hierarchy (where Human-in-the-Loop is required)

Information loss → Any delete operation (soft or hard) must require explicit confirmation.
Information confusion → Creating or updating critical entities should always prompt user review.
Financial transactions → Any movement of funds requires human approval. (Specialized automated trading systems are the exception, as they rely on dedicated risk controls.)

Challenge: Authentication and Authorization

Where does authorization belong? Not in the agent. It belongs in the MCP servers or your APIs, which already enforce your organization’s policies. That way, agents can’t “make up” permissions — they only operate within the boundaries you already trust.

The MCP specification recommends using OAuth 2.1 for authentication and authorization, particularly when exposing MCP servers. In practice, support for OAuth 2.1 is still maturing across identity providers, so many teams continue to rely on OAuth 2.0 flows (e.g., Authorization Code with PKCE, On-Behalf-Of) to achieve the same goals.

In practice, agents should never have their own ‘superpowers.’ They should always inherit the same identity, rules, and auditability as the human user. In Mezzoic users authenticate via OAuth 2.0 + PKCE, and downstream calls use an On Behalf Of-style exchange so every tool invocation carries the user’s identity and scopes. This keeps authorization centralized in the APIs while staying consistent with the MCP spec’s direction of travel.

Why On Behalf Of Matters

In Mezzoic, agents never act as their own identity. Instead, they always act On Behalf Of (OBO) the user. In practice, that means:

Every API call is user-scoped. The downstream token carries the user’s identity and entitlements.
Policies remain intact. Existing RBAC/ABAC rules apply just as if the user had called directly.
Audit trails stay clear. Logs show which user did what, even if the agent initiated the call.

This preserves accountability: if an agent triggers a workflow, it’s still the user’s token authorizing it.

Why not other approaches?

Static API keys (even per-user): Technically possible, but long-lived keys are harder to rotate, scope, and audit.
Agent-owned service accounts: Creates “shadow identities” with broad privileges, eroding accountability.
User IDs in payloads: Easy to spoof and bypasses real authorization. Shouldn’t be relied on.

👉 By contrast, OAuth with OBO reuses your IdP, MFA, and conditional access policies. Tokens are short-lived, scoped, and centrally governed.

Challenge: Data Security & Memory

Memory creates new risks because by default it has no natural boundaries. An agent could potentially “remember” across tenants, users, or projects, but should it?

This is not a technical detail to leave implicit. Data isolation must be a deliberate product decision. Teams need to code explicit constraints so memory aligns with governance requirements:

User scope: What information is strictly personal to a single user?
Team/Org scope: What knowledge can be shared safely across a team or department?
Tenant scope: What data must never cross boundaries between customers?

Without these rules, “helpful” memory can quickly become a security liability, an agent recalling sensitive details from the wrong context.

What this means in practice

Retention policies: Decide how long facts or episodes should persist, and when to expire or archive them.
Indexing strategy: Memory should be tagged with clear ownership (user ID, team ID, tenant ID) so lookups respect boundaries.
Privacy by default: If in doubt, agents should forget or re-request rather than risk leaking data across contexts.
Auditability: Memory writes and reads should be logged just like API calls. You want to know who stored what, and who retrieved it later.

👉 Memory can make agents feel smart, but without scoped design it can also make them dangerous. The safe approach is to treat memory like a database with permissions.

Challenge: Autonomous Agents

Autonomous agents are powerful but risky when their actions have financial, legal, or safety implications. Treat agents like privileged humans: enforce least privilege, require explicit approvals, and maintain a full audit trail. Put a policy-decision point (rules engine) in front of every high-impact tool, and drive execution through a workflow engine that can require approvals, cap the blast radius, and record decisions. If the rules engine flags risk above a threshold, escalate to a human, to a second agent for consensus, or invoke a two-person rule, depending on risk appetite and use case.

Conclusion: Questions Teams Should Ask Before Deploying Agents

Accountability: If an agent takes an action, whose identity and permissions does it use? Who is ultimately responsible for the outcome?
Authorization: Are agents operating strictly within existing policies (RBAC/ABAC), or do they create shadow permissions?
Human-in-the-Loop: What types of actions (e.g., deletes, fund transfers, sensitive updates) should always require human confirmation?
Auditability: Can we trace every action back to a user, a token, and a timestamp? Are logs complete enough to satisfy compliance audits?
Memory Boundaries: How do we prevent agents from “remembering” across tenants, teams, or projects where data should stay isolated?
Token Management: Are tokens short-lived and scoped narrowly enough to minimize risk if leaked?
Failure Modes: If the agent is uncertain, does it escalate gracefully, or does it guess and risk unintended consequences?
Data Governance: What retention policies apply to agent memory and logs? Do they align with corporate or regulatory requirements (GDPR, HIPAA, etc.)?
Third-Party Dependencies: If agents rely on external APIs/tools, how are those governed, and do they inherit your security model?
End-User Awareness: Do users understand that agents act on their behalf — and do they know when confirmation will be required?

Context Is Everything: Managing Tokens, Memory, and Prompts for Multi-Agent Systems

Ruben Rotteveel — Mon, 15 Sep 2025 16:49:43 GMT

If I had to simplify the work of integrating agents into one phrase, it would be this:

Context is everything.

The models are powerful, the APIs are stable, but context determines whether your agent feels like a helpful collaborator or a confused chatbot.

What We Mean by “Context”

For an LLM, context is everything it receives to generate a response:

System messages (instructions)
Tool descriptions
Conversation history
Tool responses
Even its own “internal thoughts”

How you manage this context defines:

The personality of the agent
Its consistency and accuracy
And ultimately, how well it delights your users

Why Context Matters

The ability of an agent to deliver results and to feel trustworthy depends more on context than on the model itself.

Context is the foundation of the agent’s identity, personality, and responsibilities, and most importantly its ability to make the user happy:

Personality & Voice → Context defines who the agent is. Without consistent prompts and framing, tone drifts and the agent feels incoherent.
Continuity of Thought → LLMs don’t think persistently; they simulate reasoning turn by turn. Context is the bridge that makes an agent appear continuous instead of starting over each time.
Shared Understanding → Context carries the conversation state: which tools were used, what the goal is, what’s already been decided. Without it, the user has to re-explain, which kills trust.
Boundaries of Expertise → By constraining instructions and tools, context defines what an agent should and shouldn’t attempt. That prevents overreach and hallucinations.
User Experience Consistency → Users don’t judge the LLM; they judge whether the agent “gets them.” Context is what remembers preferences, adapts, and keeps interactions smooth.

👉 Put simply: context isn’t just input. It’s the agent’s memory, personality, boundaries, and shared understanding with the user. Without it, the agent isn’t really an agent — it’s just a one-off prompt.

How to Build Prompts and Tool Descriptions

There’s a lot of content in the wild on prompt engineering (here’s a good prompt guide), but not nearly enough has been said about tool instructions and descriptions. In practice, I’ve found this to be one of the hardest areas to get right.

In the past, users read the manuals. They figured out workflows, experimented, and learned the system themselves.

Now, that job shifts to the LLM. The model must “read the manual” through your tool descriptions, understand what each tool does, and use them correctly. The level of detail required isn’t always obvious which makes testing and iteration critical.

Example: imagine a Quote Agent that helps create and edit customer quotes. The agent doesn’t learn workflows from a user guide — it depends entirely on tool descriptions like quote.add_line_item or quote.set_delivery_date. If those descriptions are vague, the agent stumbles.

What this means

Documentation becomes prompts and tool descriptions.
These descriptions should live in their own files or repositories, but remain tightly coupled to the tool they describe.
Treat them as living artifacts — versioned, reviewed, and tested like code.

Common Challenges with Tool Descriptions

1. Complex parameters

Problem: The LLM fails to generate valid inputs, forcing it to ask the user too many clarifying questions.
Solution: Keep tools and parameters simple. Create tools with clear, narrow responsibilities. (I dive deeper into this in the Agent Architecture article.)

2. Overlapping responsibilities

Problem: Tools with similar or unclear scopes confuse the agent. It may pick the wrong one and fail to complete its task.
Solution: Define exclusive domains for tools. Be explicit about when and why each tool should be used.

An exclusive domain means each tool has a clear, non-overlapping area of responsibility, so the agent doesn’t have to guess between multiple tools that “sort of” do the same thing.

Think of it like team roles: if two employees both think they’re responsible for scheduling meetings, you’ll get duplication or confusion. Same with tools.

How to define it

When designing tools, write down:
- Purpose → What specific outcome this tool achieves.
  
  * Scope → What it doesn’t do. If another tool covers that, this one must stay out.
  
  * Trigger conditions → When the agent should call it (e.g., “use this tool when the user wants to create a new quote, not when editing an existing one”).

3. Capability & description drift

Problem: Descriptions and tool code often live in different places, owned by different people. When one changes without the other, the agent gets confused — outcomes don’t match the instructions.
Solution: Co-locate responsibility. Update code and descriptions together. Build processes that make drift visible (e.g., testing generated plans against descriptions).

👉 Pro tip: Treat prompts and tool descriptions as first-class code artifacts — with versioning, testing, and clear ownership. Don’t let them live as throwaway comments; they’re the interface between your system and your agent.

The Challenge of Token Limits

Once prompts and tool descriptions are solid, the next constraint you’ll hit is the model’s token limit.

LLMs have hard context windows. You can’t keep everything in context. Long Prompts, long descriptions, a long message history, all the data coming back from the tools, files it has read, web search content it has found, and all internal thoughts are part of the context. It fills up pretty quickly.

Managing context is a significant challenge and has a big impact on the agents effectiveness.

Strategies

FIFO truncation: Keep only the most recent messages. Simple, but loses long-term awareness.
Topic-aware summarization: Summarize past exchanges by topic. When the user switches back, the summary brings the relevant history forward.
Hybrid: Recent messages stay raw; older ones collapse into summaries.

👉 The goal is focus: give the model exactly what it needs for the current task, nothing more.

Semantic and Episodic Memory

Context management doesn’t end at the prompt. Agents need memory to learn from experience, feel reliable and personal.

⚠️ Note: Memory is not naturally limited to a single thread or session. It’s technically unbounded — you could design it to span users, teams, projects, or even tenants. That flexibility is powerful, but it also means you must enforce explicit boundaries and retention policies to stay within privacy, regulatory, and data sovereignty requirements.

Semantic Memory (facts & awareness)

Stores facts about users, teams, or projects.
Lets the agent adapt to preferences, roles, and organizational context.
Can be scoped at user, team, or org levels.
Stored in vector databases (e.g., Elastic, Weaviate, Pinecone).

Example:

“This user prefers concise answers.”
“Team A is working on Project X; Bob owns feature Y.”

Episodic Memory (experience & learning)

Captures experiences as episodes: subject + actions + results.
Lets the agent learn from past outcomes and apply them in similar situations.
Can be leveraged to convert a new user request into a few-shot prompt or chain-of-thought (CoT) reusing the successful steps from a prior interaction to accomplish a similar goal.
Built from conversation parsing, sentiment analysis, and conclusion markers (“that worked”, “let’s move on”).
Stored in vector stores for retrieval in future sessions.

Example:

“Last time the user asked for a resource plan, the timeline format was off — next time, use Gantt style.”

👉 This way, episodic memory isn’t just “remembering experiences” — it becomes structured training data for the agent’s next decision.

Retrieval-Augmented Generation (RAG)

RAG extends memory beyond what fits in the prompt:

Index your knowledge base into a vector store.
Let the agent retrieve relevant chunks on demand.
Feed those chunks into the context window.

This lets agents answer with organizational knowledge instead of hallucinations.

👉 At Mezzoic, ElasticSearch has proven to be the most stable and scalable option, but the space is evolving quickly.

Mezzoic’s Approach to Context

In Mezzoic, context management is built into the workflow orchestrator and MCP client layer, not bolted on afterward.

Asynchronous, event-driven updates prevent the model from being overloaded with stale or irrelevant info.
LangChain, LangGraph, and LangMem provide the tooling, but we customize them heavily. They’re powerful, but they come with performance costs.
Prompt and tool descriptions are versioned and tested alongside code, ensuring consistency.

Result:
Agents that stay focused, adaptive, and consistent without drowning in irrelevant context.

Practical Takeaways

Context is the main design problem. Treat it as first-class architecture.
Documentation becomes prompts. Maintain them like code.
Token limits force discipline. Use summarization, focus, and hybrid strategies.
Memory makes agents human-like. Semantic = facts; episodic = experience.
RAG extends knowledge. It’s the way to scale beyond what fits in context.
Build it into the platform. Don’t duct-tape context management after the fact.

Closing

LLMs aren’t limited by their intelligence — they’re limited by their context.
Get context right, and everything else becomes easier.

👉 Next in this series: Security & Trust — how to keep agents safe, scoped, and governed with the same policies as your web apps.

From APIs to Agent Tools: Designing for Multi-Agent Systems

Ruben Rotteveel — Mon, 15 Sep 2025 16:26:07 GMT

Quick Overview

In this article, I explore key principles of agent and tool design — and how they integrate with your backend in ways that can make or break your project.

We’ll look at the problem areas that matter most:

Why multi-agent systems → when and why specialization helps.
Organizing agent tools and responsibilities → minimizing context switches and enabling success.
MCP design concepts → using Model Context Protocol to wrap APIs safely.
Integration patterns → how to connect agents to your APIs without breaking what already works.

The goal: a clear set of principles you can apply to design agents that are effective, safe, and production-ready.

1. Why Multi-Agent Systems

One agent with every tool sounds simple. In practice, it quickly collapses under complexity. Multi-agent systems solve this by:

Specialization → Agents are scoped to a domain (e.g., Business Analyst, Project Manager, Resource Manager). Narrow context = better focus.
Reduced context switching → Each agent carries fewer tools and instructions. Smaller prompt = more consistent behavior.
Collaboration → Agents hand off results or share context where needed.

Think of it as organizational design for AI: teams succeed when roles are clear, responsibilities aligned, and overlap limited to backup coverage.

Orchestration: Supervisor vs. Swarm

Multi-agent systems don’t just need specialized roles — they need a way to decide who acts when. Two common orchestration models are:

Supervisor pattern → A central coordinator (a “manager agent”) delegates tasks to domain agents and integrates results. Great for hierarchical workflows and strict governance.
Swarm pattern → Each agents is aware some or all other agents and their expertise, when a task is out of the bounds of the current agent, they’ll hand it off to the agent based on their expertise. The specialist takes over the conversation and you engage with it until it can’t, it then hands off the task to another specialist and so on.

Mezzoic uses a swarm approach. This means:

Agents are given overlapping tool access to minimize context switching.
Context and workflow orchestration live in the platform, not in a single “boss agent.”

👉 The trade-off: swarms require more careful tool design (clear descriptions, scoped responsibilities, overlapping coverage) so agents don’t collide or wander.

2. Organizing Agent Tools and Responsibilities

Tools are the bridge between your backend and your agents. Organizing them well is critical.

Principles:

Start from use cases → Map what the agent must accomplish end-to-end, then derive tools.
Minimize context switches → Prefer one agent completing a flow over bouncing between agents.
Specialize, but don’t over-partition → Give agents enough overlap to complete work instead of stalling.

Exclusive domains:
Each tool should have a clear, non-overlapping responsibility. Ambiguity (two tools that can both edit a quote) leads to failure.

How to define:

Purpose → specific outcome it achieves.
Scope → what it doesn’t do.
Trigger conditions → when the agent should call it.

👉 Treat tools like team roles: overlap creates confusion, clarity enables success.

Agent to tool mapping diagram:

These agents have a primary set of tools that enable their roles, but they also have access to the same utilities, access to the users details, security modules and any tangentially related tools that they may need.

These tools are thin wrappers around mcp_clients that in turn are proxies to mcp_servers.

3. MCP Design Concepts

What is an MCP server? MCP servers are a standard way to provide tools for your agents. Just like APIs are a standard way to define your core business logic. In a micro services architecture, there’s typically another layer above your microservices, a specialized API Wrapper that implements a set of use cases that use the micro services but is dedicated to an application. These Backend For Frontends BFFs are similar to MCP Servers, or an MCP server is similar to a BFF, it’s a dedicated tool for your agent, that in this case wraps around your micro services.

MCP servers:

Wrap existing APIs in a standard interface agents can query at runtime.
Expose available actions (quote.add_line_item, quote.set_delivery_date) instead of generic endpoints.
Provide consistent descriptions that become the agent’s “manual.”

Marshal-by-Value vs Marshal-by-Reference

Let’s briefly talk about Marshal by Reference vs Value. Agents have a hard time constructing complex objects, an LLM doesn’t edit a json object, it regenerates it on the fly and more often than not, gets it wrong. You can solve this by providing a rich set of instructions and describe every field etc, that’s a lot of work, adds to context and it’ll still get it wrong. Instead we design the tools with simple responsibilities and parameters that are easy to undertand.

Traditional APIs (marshal-by-value):
- Clients (like SPAs) hold a rich domain object (e.g., a Quote).
- They apply business logic locally, then send the whole or partial object back (PUT /quote {…}).
- Works great with deterministic clients — humans don’t forget required fields.
Agents (marshal-by-reference):
- LLMs are bad at reconstructing complex objects. They often drop fields, mis-format, or overwrite.
- Instead, agents do better when they reference an entity ID and apply small, intent-driven changes.
- Example:
  - quote.set_delivery_date(quote_id, date)
  - quote.add_line_item(quote_id, sku, qty, price)

Why it matters:

Smaller prompts → less chance of hallucination.
Narrow scope → fewer invalid states.
Server enforces business rules → no fragile prompt gymnastics.

Yes, it’s two hops (fetch + patch), but in agent workflows the LLM’s “thinking time” dwarfs network latency. Reliability wins over raw performance.

Tool Design Patterns (recap)

Intent verbs > CRUD → tools reflect user goals, not table ops.
Patch over PUT → safe partial updates instead of risky overwrites.
Validation at the edge → business rules live in MCP, not the LLM.
Idempotency → every tool call safe to retry.

Example

Instead of:

PUT /quotes/123
{
  "id": "123",
  "customer": { ... },
  "lineItems": [ ... ],
  "deliveryDate": "2025-09-12",
  ...
}

Expose tools like:

quote.set_delivery_date(quote_id, date)
quote.add_line_item(quote_id, sku, qty, price)
quote.assign_owner(quote_id, user_id)

Each one mutates a small part of the aggregate by reference. The MCP server handles the fetch → patch → persist cycle.

4. Integration Patterns with APIs

The second principle of SOLID — Open/Closed — matters here: extend, don’t modify. Keep your backend stable, add MCP Servers as extensions. Don’t break what works.

Patterns:

Backend-for-Frontend (BFF) → MCP server as the backend for your agent, wrapping API calls into intent-driven tools.
Sidecar → MCP deployed alongside a service, tightly coupled to its API.
Gateway → centralized MCP layer, routing requests to many services.

Security:

Agents act on behalf of users → OAuth OBO flow.
Tools inherit API-level RBAC/ABAC → no shadow permissions.
Narrow scopes → tool:quote.read, tool:quote.create.

5. Trade-offs and Reality Check

Pros:

Clearer prompts, fewer invalid states.
Agents behave more predictably.
Security and governance aligned with backend.

Cons:

More tools to design and maintain.
Extra round trips (fetch + patch).
Need for concurrency control.

But in practice: reliability > raw RPS. Agents spend more time “thinking” than calling APIs. Safe, intent-driven tools are worth the overhead.

6. Practical Takeaways

APIs don’t translate 1:1 into tools.
Wrap APIs into intent-driven MCP servers.
Organize tools with exclusive domains and clear responsibilities.
Treat prompts and tool descriptions as first-class code — versioned, tested, and owned.
Build agents as extensions, not modifications, of your backend.

Closing

APIs power your systems. But agents need something more: tools they can discover, understand, and use safely. MCP servers bridge that gap, transforming APIs into intent-driven capabilities that agents can wield reliably.

The takeaway: don’t let agents call APIs raw. Wrap them, describe them, test them — and you’ll move from fragile demos to production-ready multi-agent systems.

Your Enterprise Architect

Beyond the Context Window: Implementing CoALA for State-Aware Enterprise Agents

1. Summary

2. The Constraint: Managing the Context Window

3. The Process: Asynchronous Episode Extraction

4. The Feedback Loop: Episodic Injection

5. The System Architecture

6. Future Development

Real World Agentic Solutions: Turning Microservices into an AI Workforce.

Challenge: Keep the Architecture Simple

Architecture Overview

Challenge: Understanding and Managing Context

Challenge: Security & Trust

Closing

Agent Architecture: Security & Trust

Challenge: Trust

Guardrails and Human-in-the-Loop

Risk Hierarchy (where Human-in-the-Loop is required)

Challenge: Authentication and Authorization

Why On Behalf Of Matters

Why not other approaches?

Challenge: Data Security & Memory

What this means in practice

Challenge: Autonomous Agents

Conclusion: Questions Teams Should Ask Before Deploying Agents

Context Is Everything: Managing Tokens, Memory, and Prompts for Multi-Agent Systems

What We Mean by “Context”

Why Context Matters

How to Build Prompts and Tool Descriptions

What this means

Common Challenges with Tool Descriptions

How to define it

The Challenge of Token Limits

Strategies

Semantic and Episodic Memory

Semantic Memory (facts & awareness)

Episodic Memory (experience & learning)

Retrieval-Augmented Generation (RAG)

Mezzoic’s Approach to Context

Practical Takeaways

Closing

From APIs to Agent Tools: Designing for Multi-Agent Systems

Quick Overview

1. Why Multi-Agent Systems

Orchestration: Supervisor vs. Swarm

2. Organizing Agent Tools and Responsibilities

3. MCP Design Concepts

Marshal-by-Value vs Marshal-by-Reference

Tool Design Patterns (recap)

Example

4. Integration Patterns with APIs

5. Trade-offs and Reality Check

6. Practical Takeaways

Closing