Skip to content

AI · Agentic AI · Software Architecture · LLMs · Technology · Context Engineering · Prompt Engineering · Software Engineering

From Workflows to Agents: The Modern Architecture of Agentic AI

18 min readMichael Luo

The production agentic AI stack: tool calling, MCP, planner/executor patterns, multi-agent design, memory, sandboxing, approval gates, and eval loops.

For years, most people understood AI through the metaphor of a chatbot. You typed a question, the model replied, and the interaction ended.

That mental model is now outdated.

Modern AI is moving from passive text generation toward agentic systems: software architectures where large language models act as reasoning engines inside broader execution frameworks. These systems do not simply answer questions. They plan, call tools, inspect results, revise their approach, use memory, request approvals, and sometimes coordinate with other agents to complete complex work.

The most important shift is this:

The model is no longer the whole application. It is the reasoning layer inside a much larger operating system of tools, workflows, memory, protocols, sandboxes, and governance.

This is the real story of agentic AI. Not just smarter prompts. Not just better chat. A new software architecture is emerging.

1. AI Models Do Not Actually “Do” Things

Before we talk about agents, we need to correct a common misconception.

AI models do not directly execute actions.

They do not truly send emails, query databases, move money, deploy software, create Jira tickets, edit files, or click buttons by themselves. What they do is generate structured intent.

In modern AI systems, the model acts as the reasoning engine. The surrounding application, runtime, server, or sandbox acts as the execution environment. The model decides what should happen. The system around the model performs the action.

That is the foundation of the tool-use contract.

A user gives the system a goal. The model interprets the goal and emits a structured request, often in JSON. The execution layer receives that request, runs the relevant tool or function, captures the result, and returns it to the model. The model then decides whether to continue, correct course, ask for approval, or produce a final answer.

This repeated cycle is often called the agentic loop:

Perceive → Plan → Execute → Observe → Verify → Continue or Finish

This loop is what makes agentic AI feel different from traditional chat. The system is no longer producing one answer. It is operating through a sequence of decisions and actions.

A chatbot says:

“Here is how you could analyze those sales records.”

An agentic system can:

Read the files, write a script, run the analysis, identify anomalies, generate a chart, summarize the findings, and create a draft email for human approval.

That is not just a better prompt. That is a different architecture.

2. Agents vs Workflows

Not every AI system needs to be an agent. In fact, one of the biggest mistakes in modern AI engineering is using agents where a simple deterministic workflow would be safer, cheaper, and easier to maintain.

A workflow is a predefined sequence of steps. It is predictable. It follows rules. Given the same input, it should produce the same output. Workflows are excellent for regulated, repeatable, and safety-critical processes.

Examples include:

Validate form → Check policy → Route to approver → Send notification

or:

Receive payment request → Validate account → Run fraud check → Process payment

Workflows are reliable because they are constrained.

An agent, by contrast, is adaptive. It can interpret a goal, choose tools, reason through ambiguity, and decide the next step dynamically. Agents are useful when the path to the answer is not known in advance.

Examples include:

Investigate why a deployment failed

or:

Research competitors and produce a strategic briefing

or:

Refactor this codebase while preserving existing behavior

These tasks cannot always be reduced to a fixed path. The agent needs to inspect, decide, act, observe, and adjust.

The practical rule is simple:

Use workflows when the process is known. Use agents when the path must be discovered. Use both when the system needs adaptability inside controlled boundaries.

The best production systems are rarely pure agents. They are usually hybrid systems: deterministic workflows for structure and compliance, agents for reasoning and dynamic problem-solving.

3. Tool Calling and Function Calling

Tool calling is the mechanism that allows an AI system to interact with the outside world.

A tool could be anything the model is allowed to request:

Search the web
Read a file
Query a database
Create a calendar event
Run Python code
Open a GitHub issue
Send a Slack message
Call an internal API

The model does not execute the tool directly. It emits a structured request. The system executes it and returns the result.

Function calling is a specific implementation of this idea. In function calling, developers define available functions with names, parameters, and schemas. The model then returns structured arguments for one of those functions.

For example:

{
  "function": "search_customer_records",
  "arguments": {
    "customer_id": "CUST-12345",
    "include_recent_cases": true
  }
}

The application receives this request, validates it, runs the function, and returns the output to the model.

The distinction is useful:

Tool calling is the broader concept. Function calling is the structured API-level pattern. Agents are systems that repeatedly use these mechanisms to make progress toward a goal.

In a simple assistant, the model may call one function and stop. In an agentic system, the model may call many tools over multiple steps, using each result to decide the next action.

4. MCP: The Interoperability Layer for Agents

This is where MCP, or Model Context Protocol, becomes important.

As agents become more capable, they need access to more tools and more context: files, databases, APIs, code repositories, browsers, calendars, design tools, observability platforms, CRMs, cloud environments, and internal business systems.

Without a standard protocol, every AI application has to build custom integrations for every external system. That does not scale.

MCP solves this by creating a standard way for AI applications to connect to external tools and context sources. Instead of hardcoding every integration into the AI app, developers can expose capabilities through MCP servers. The AI client can then discover and use those capabilities through a consistent interface.

A simple way to frame it:

Function calling lets a model call a specific function. MCP standardizes how AI applications connect to many external tools, resources, and context providers.

For example, an MCP server might expose:

Read files from a local project
Search a codebase
Query a database
Access GitHub issues
Inspect logs
Fetch documentation
Interact with a design tool

This makes MCP especially relevant for agentic systems because agents need reliable access to the world around them. They need to know what tools exist, what inputs those tools accept, what resources are available, and what permissions apply.

MCP is not the agent itself. It is part of the connective tissue around the agent.

If the LLM is the reasoning engine, MCP is one way to standardize the agent’s access to tools and context.

5. Planner/Executor Patterns

As agentic systems become more complex, it becomes useful to separate planning from execution.

In a simple agent, the same model may decide the plan, call tools, inspect results, and produce the final answer. That can work for small tasks. But for larger tasks, it can become messy. The system may lose track of the goal, overuse tools, skip verification, or get distracted by irrelevant context.

The planner/executor pattern introduces clearer separation.

The planner decides the strategy:

What is the goal?
What steps are required?
Which tools may be needed?
What risks or constraints matter?
What should be verified?

The executor performs specific actions:

Call this API
Run this command
Read this file
Query this table
Apply this code change

A stronger architecture may also include a verifier:

Did the action succeed?
Did the output match the goal?
Did tests pass?
Was anything unsafe or unexpected?

This creates a more disciplined loop:

Planner → Executor → Verifier → Planner

The planner updates the strategy based on what the executor and verifier discover.

This pattern matters because production agents need control. We do not want a model blindly taking actions. We want a system that can reason, act, check, and revise.

In software engineering, this might look like:

Planner: identify the likely cause of a failing test
Executor: inspect files and run targeted commands
Verifier: confirm whether the test now passes
Planner: decide whether more changes are needed

In business operations, it might look like:

Planner: determine what information is needed for a customer escalation
Executor: gather CRM notes, support tickets, and account history
Verifier: check whether the evidence supports the recommendation
Planner: produce a response for human approval

The more serious the use case, the more valuable this separation becomes.

6. Single-Agent vs Multi-Agent Systems

A single-agent system uses one agent to interpret the goal, manage the task, call tools, and produce the result.

Single-agent systems are useful when the task is focused, the toolset is limited, and the relevant context can fit within one working memory. They are simpler to build, easier to debug, and often cheaper to run.

Examples include:

Summarize a document
Generate a report from one dataset
Fix a small bug
Answer a question using a few tools

But single-agent systems have limits. As the task grows, the context window can become overloaded. The model may see too much information, lose focus, select the wrong tool, or carry forward a bad assumption from an earlier step.

A multi-agent system breaks work across specialized agents.

For example:

Manager agent → coordinates the task
Research agent → gathers information
Coding agent → modifies implementation
Review agent → checks quality
Security agent → checks risk
Verifier agent → confirms outcome

This can improve focus because each agent receives a narrower context. A research agent does not need the full implementation history. A security reviewer does not need every brainstorming note. A coding agent does not need the full executive summary.

Multi-agent systems are useful when tasks are complex, parallel, or multi-domain.

But they are not free. They introduce coordination overhead. Agents may duplicate work, disagree, pass poor context to each other, or increase latency and cost.

So the design principle is:

Start with the simplest architecture that works. Move from workflow to single agent to multi-agent only when the task complexity justifies it.

Multi-agent architecture is powerful, but it is not automatically better.

7. Agent Memory and Context Engineering

A human worker does not operate only from the current sentence. They rely on memory: what they know, what they have seen before, what they are trying to achieve, what happened earlier, and what preferences or constraints matter.

Agents need memory too, but memory must be engineered carefully.

There are several useful types of agent memory.

Short-term memory is the immediate conversation or task context. It contains the current user request, recent tool results, and active instructions.

Working memory is the temporary scratchpad used during a task. This may include intermediate notes, partial calculations, task state, or a plan.

Long-term memory stores persistent information across sessions, such as user preferences, project facts, prior decisions, or known constraints.

Episodic memory stores previous task traces. For example, how a similar incident was resolved last time.

Semantic memory stores general knowledge or reusable concepts, often retrieved through RAG.

Procedural memory stores learned ways of doing things, such as preferred workflows, coding conventions, review checklists, or operational playbooks.

This is where context engineering becomes more important than simple prompt engineering.

Prompt engineering asks:

What should I tell the model?

Context engineering asks:

What should the model know right now?
What should be hidden?
What should be retrieved?
What should be summarized?
What should be stored outside the context window?
What should be isolated in another agent or sandbox?

This matters because the model’s context window is not infinite, and more context is not always better.

Too little context makes the agent ignorant. Too much context makes it distracted. Wrong context makes it dangerous. Stale context makes it misleading.

A production-grade agent needs deliberate context management: selecting relevant information, compressing long histories, writing durable state outside the prompt, and isolating heavy data from the model unless needed.

This is one of the biggest shifts in modern AI engineering:

Prompt engineering tells the model what to do. Context engineering designs the environment in which the model thinks.

8. Programmatic Tool Calling and Code Execution

One of the most important advances in agentic systems is programmatic tool calling, also known as code execution.

In a naive agent loop, the model calls tools one by one and pulls every intermediate result back into its context window. That quickly becomes inefficient.

Imagine an agent needs to analyze thousands of expense records. A weak design would retrieve all records, stuff them into the context window, and ask the model to reason through them. That is expensive, slow, and error-prone.

A better design lets the model write and run code.

The agent can generate a Python or TypeScript script that:

Loads the records
Filters irrelevant rows
Groups expenses by employee
Calculates totals
Detects policy breaches
Produces a final summary

Only the final result needs to return to the model.

This is powerful because code is better than language models at deterministic computation. The model should not manually reason over thousands of rows. It should write code, execute the code, and interpret the output.

This pattern makes agents more scalable because large data processing can happen outside the model’s context window.

The model provides judgment. The code provides computation. The sandbox provides safe execution.

9. Sandboxing: Safe Places for Agents to Work

If agents can write code, run commands, edit files, or interact with systems, they need safe execution environments.

A sandbox is an isolated workspace where an agent can operate without directly endangering production systems or the user’s machine.

A good sandbox may provide:

A temporary filesystem
A terminal
A package manager
A code editor
Test execution
Network restrictions
Secret isolation
Resource limits
Audit logs

This allows the agent to inspect code, run tests, make changes, and verify outcomes in a controlled environment.

Sandboxing matters because tool use increases risk. An agent with poorly constrained shell access could delete files, leak secrets, consume resources, or make unintended changes.

The goal is not to prevent agents from acting. The goal is to let them act within a bounded environment.

For software engineering agents, this is essential. The agent should be able to explore and test, but not freely mutate production infrastructure. It should work in a branch, a container, a microVM, or another isolated runtime before changes are reviewed and committed.

The principle is:

Give the agent enough freedom to solve the problem, but not enough freedom to create uncontrolled damage.

10. Human-in-the-Loop and Approval Gates

As agents become more autonomous, approvals become more important.

Not every action needs human approval. If an agent reformats a local draft or summarizes a document, automatic execution may be fine. But if the action is irreversible, regulated, expensive, public, or high-impact, the agent should not directly commit the change.

It should propose.

This is the principle of separating intent from execution.

The agent can generate a structured proposed action:

{
  "action": "send_customer_email",
  "recipient": "customer@example.com",
  "subject": "Update on your support case",
  "risk_level": "medium",
  "evidence": [
    "Case notes reviewed",
    "Refund policy checked",
    "Manager approval required"
  ]
}

Then the system pauses and routes the proposal to a human reviewer.

The reviewer can approve, reject, edit, or request more information.

This pattern is critical for:

Sending customer communications
Changing production systems
Approving financial transactions
Deleting or modifying important data
Granting access permissions
Making legal, compliance, or HR-sensitive decisions

Approval design can vary by risk level.

For low-risk actions, the system may use automatic approval. For medium-risk actions, it may require one human reviewer. For high-risk actions, it may require two-person approval. For mature workflows, it may use exception-only review. For regulated environments, it may require a full evidence pack and audit trail.

This is how organizations move from manual work to supervised autonomy.

The objective is not blind automation. It is controlled delegation.

11. Guardrails and Policy Enforcement

Human approval is one kind of guardrail, but production systems need more than that.

They also need automated policy enforcement.

A model may request an action that looks plausible but violates a rule. For example:

Access a file it should not read
Call an API outside its permission scope
Send data to an external service
Run a destructive shell command
Modify production configuration
Expose credentials in a response

Guardrails should sit around the model, not only inside the prompt.

That means the system should enforce:

Tool permissions
Role-based access control
Data boundaries
Rate limits
Allowed action types
Blocked commands
Approval thresholds
Logging and auditability

The model may reason, but the platform must govern.

A serious production agent should never rely only on “the model knows not to do that.” The system itself must prevent dangerous behavior.

This is especially important in enterprise environments, where AI agents may interact with customer data, financial systems, internal platforms, cloud infrastructure, and regulated workflows.

12. Idempotency and Verification

Agentic systems are still software systems. In fact, they are often distributed systems with probabilistic reasoning inside them.

That means traditional engineering discipline becomes more important, not less.

One essential concept is idempotency.

If an agent retries an action after a timeout, the system must not accidentally perform the same side effect twice. For example, it should not send the same customer email twice, create duplicate tickets, approve the same payment twice, or apply the same database update multiple times.

This is why approved actions should use idempotency keys.

Another essential concept is verification.

After an agent acts, the system should check whether the action actually worked.

Examples:

Did the file change correctly?
Did the tests pass?
Was the ticket created?
Did the database update affect the expected rows?
Was the email saved as a draft?
Did the deployment succeed?
Did monitoring show recovery?

A weak agent acts and assumes success.

A strong agent acts, observes, verifies, and corrects.

That is the difference between a demo and a production system.

13. Agent Evaluation

Evaluating agents is harder than evaluating simple chatbot responses.

A chatbot can be judged on the final answer. An agent must be judged on the full trajectory.

Did it understand the goal? Did it choose the right tools? Did it avoid unnecessary actions? Did it recover from errors? Did it verify the result? Did it follow safety rules? Did it stop at the right time? Did it ask for approval when needed?

This requires multiple types of evaluation.

Outcome evals check whether the final result was correct.

Trajectory evals check whether the agent followed a sensible path.

Safety evals check whether the agent avoided dangerous actions.

Regression evals check whether previously reliable capabilities still work after changes.

Capability evals test whether the agent can handle harder tasks.

Cost and latency evals check whether the agent solved the task efficiently.

A mature agent platform should capture real execution traces, review failures, convert failures into test cases, and use those tests to improve prompts, tools, policies, and orchestration logic.

This creates an agent improvement loop:

Run agent → Capture trace → Evaluate outcome and trajectory → Identify failure → Create regression test → Improve system → Run again

This is where agent development starts to look like modern software engineering.

You need observability. You need tests. You need review. You need versioning. You need feedback loops.

Agents do not become reliable by accident. They become reliable through systematic evaluation.

14. The Production Agent Stack

When you put all of this together, a production-grade agentic system is not just an LLM with a prompt.

It is a layered architecture.

User goal
↓
Agent orchestration layer
↓
Planner / executor / verifier pattern
↓
Tool calling and function calling
↓
MCP and integration layer
↓
Context engineering and memory
↓
Sandboxed execution environment
↓
Human approval and policy gates
↓
Deterministic workflows and system APIs
↓
Observability, evaluation, and audit trails

Each layer solves a different problem.

The LLM provides reasoning. Tools provide capability. Function calling provides structured action. MCP provides integration standardization. Memory provides continuity. Context engineering controls what the model sees. The planner/executor pattern structures action. Sandboxes reduce execution risk. Approvals preserve human control. Workflows provide reliability. Evals drive continuous improvement.

This is the real architecture of agentic AI.

15. The Future Is Not “Agents Everywhere”

The future of AI is not simply replacing every workflow with an autonomous agent.

That would be dangerous, expensive, and unnecessary.

The better future is agentic systems inside governed workflows.

In this future, deterministic systems still handle what must be predictable. Agents handle ambiguity, reasoning, investigation, synthesis, and dynamic decision-making. Humans remain involved where judgment, accountability, ethics, or high-impact approval is required.

The winning systems will not be the ones that give agents unlimited freedom.

They will be the ones that combine:

Autonomy with control
Reasoning with verification
Memory with context discipline
Tool use with permissions
Speed with approval gates
Adaptability with deterministic guardrails

This is why modern AI architecture is not just about choosing the best model.

It is about designing the system around the model.

The next generation of AI products will not feel like chatbots. They will feel like coordinated digital teams: able to reason, use tools, remember context, execute tasks, verify results, and collaborate with humans.

But the teams that succeed will be the ones that understand the deeper lesson:

Agentic AI is not a feature. It is an architecture.