Best practices for building and orchestrating Multi-Agent Systems?
Moving beyond chains: How to manage state, memory, and collaboration in agentic workflows with LangGraph and AutoGen.
Everyone's suddenly "going agentic." Instead of a single LLM call or a simple chain, you've got multiple stateful agents talking to each other, calling tools, planning, critiquing, and acting over time. Frameworks like LangGraph and AutoGen make it easier to wire this all up, but they don't tell you whether what you're building is smart engineering or an overcomplicated toy.
1. Before anything: Do you even need multiple agents?
Start with this uncomfortable question:
Do I need multiple agents, or do I just need one well-prompted model with tools?
Multi-agent systems add complexity: more model calls, more state to track, more failure modes.
You reach for multi-agent only when a single agent + tools starts to break down, for example:
- Clear specialization: Planner vs executor, generalist vs domain expert (code, legal-ish, data, etc.)
- Long-running workflows: Multi-step tasks that span minutes/hours, tasks that need retries, backtracking, and coordination
- Parallel work: Multiple subtasks across different tools or domains, aggregation and comparison of separate results
- Explicit structure and control: Graph/state machine instead of "let the LLM improvise a plan every time"
If your use case is just "answer questions over our docs," you don't need a multi-agent circus. You need solid RAG and maybe one agent.
2. Think in terms of orchestration patterns, not "vibes"
Most useful multi-agent systems fall into a few patterns. Name them and design deliberately.
2.1 Manager–Worker pattern
Pattern: One manager agent breaks a task into subtasks. One or more workers execute those subtasks (possibly specialized). Manager aggregates, checks, and returns a final result.
Use when: Tasks can be decomposed ("analyze, then implement, then summarize"). You want parallel workers (e.g., multiple retrieval strategies, multiple coding agents).
Gotchas: Manager can become a bottleneck. If the manager is dumb, you're just adding hops for no gain.
2.2 Router pattern
Pattern: A router agent decides which specialist to send the request to: "Docs Q&A" agent, "SQL / analytics" agent, "Code" agent, "Policy / compliance" agent.
Use when: Different capabilities are clearly separated. You want to route to the cheapest / smallest model or tool that can handle the task.
Gotchas: Router mistakes are expensive (wrong specialist = nonsense answer). Evaluate the router itself; don't assume it "just works."
2.3 Critic / Reviewer pattern (a.k.a. Reflexion loop)
Pattern: A primary agent proposes an answer. A critic agent reviews it: checks for correctness, safety, formatting, hallucinations. Optionally sends it back for revision.
Use when: Quality matters more than latency/cost. Code generation, complex reasoning, compliance-sensitive outputs.
Gotchas: You've just doubled the number of model calls. If the critic is too similar to the primary model, you get correlated failures.
2.4 Tool Specialist pattern
Pattern: Agents specialized by tool + domain: "SQL agent" talks to the warehouse, "Docs agent" talks to RAG index, "Code agent" calls repos, CI, etc. A coordinator agent decides which tool agent to call next.
Use when: You have many tools / APIs and want sane separation. You need fine-grained control of how each tool is used.
Gotchas: Coordination loops can explode in length if you don't cap steps. Tool misuse gets harder to debug across agents.
3. State and memory: the real difference between "toy" and "system"
Multi-agent systems only become useful when they're stateful. You need to decide where state lives and who owns it.
3.1 Shared vs per-agent state
Per-agent state: Each agent maintains its own memory: recent messages, agent-specific notes / scratchpad, tool results relevant to its role. Good for clear separation of concerns and local reasoning.
Shared/global state: A central store (think "blackboard" or LangGraph shared state) holds: task metadata, intermediate results, global flags (status, errors, timeouts). Good for coordination and inspection, debugging and observability.
Real systems use both: agent-local memory for short-term reasoning, shared state for cross-agent context and orchestration.
3.2 Keep state structured, not just "more text"
Naive pattern: dump everything into a giant conversational history and hope.
Better pattern: Use structured state objects: task_id, status, subtasks[], artefacts[], errors[]. Agents write updates to this structured state, not just more prose.
Tools like LangGraph make this easier by: treating the system as a graph of nodes (agents, tools), passing around a typed state object, enforcing allowed transitions (like a state machine).
This gives you something that's debuggable and easier to reason about than endless chat logs.
4. Use LangGraph, AutoGen, etc. for structure, not magic
Frameworks like LangGraph and AutoGen are useful, but only if you're clear what you're building.
4.1 LangGraph: graph + state machine mindset
LangGraph is good when you want:
- Agent orchestration as an explicit graph: nodes = agents/tools, edges = transitions
- Stateful workflows with loops, retries, timeouts
- Clear control over: max steps per run, which node can send control where, persistence / resuming of long tasks
Think: "agent orchestration patterns as code" instead of opaque magic.
4.2 AutoGen: multi-agent conversations
AutoGen is good when:
- You model your system as agents that talk to each other in structured dialogues
- You want patterns like: "user proxy" ↔ "assistant" ↔ "critic", multi-step cooperative problem solving
Just don't confuse "we wired up some agents in AutoGen" with "we have a robust system." You still need: state control, step limits, logging and metrics, guardrails.
Frameworks don't remove design work; they just reduce boilerplate.
5. Guardrails: cap the chaos before it hits production
Multi-agent = more ways to spin out of control. Put hard edges around it.
5.1 Step limits and timeouts
- Max steps per task: e.g., 10 graph hops, 5 message exchanges, 3 tool calls
- Global timeout: Hard cap on wall-clock duration
If the system hits a limit: return partial result + explanation, log it as a failure case for analysis.
5.2 Restricted transitions
Don't let every agent talk to every other agent arbitrarily. Use a graph or state machine: Manager → Worker, Worker → Critic, Critic → Manager. Explicitly disallow loops that don't make sense.
This is where LangGraph truly shines: you encode allowed paths instead of hoping the LLM behaves.
5.3 Tool and data access control
Each agent should have: a minimal tool set it can call, access only to the data it needs.
Don't give your general chat agent: direct SQL access to prod, full file system write access, permission to trigger sensitive flows.
Multi-agent safety starts with: who can call what, from where, and how often.
6. Evaluation and observability: treat agents like microservices
If you don't measure multi-agent behavior, it will absolutely surprise you in production.
6.1 Log at the agent step level
For each step: agent_name, input_summary (or hashed), tools_called, tokens_in, tokens_out, latency_ms, state_diff (what changed in shared state), next_agent / next node.
You should be able to replay: "For task XYZ, how did control flow across agents, and where did it go wrong?"
6.2 Scenario-based evaluation, not just "did it work?"
Define scenarios: single-agent-equivalent tasks (baseline), multi-step tasks, tool-heavy tasks, edge cases (ambiguous instructions, contradicting goals, missing data).
For each scenario: run the multi-agent system end-to-end. Use LLM-as-a-judge or human eval to score: task completion, correctness, safety, unnecessary agent hops / tool calls.
Compare against: a simpler baseline (one agent + tools), variants of your agent orchestration pattern (different graphs, different roles).
If the multi-agent variant isn't clearly better or more robust, kill it.
Go deeper on monitoring & analytics
If you want a dedicated deep dive on how to monitor and analyze your agentic workflows in production—and how to use those traces to drive fine-tuning—read {' '} Multi-Agent Systems & Agentic AI: From Hype to Reliable Operations .
7. Cost and latency: agentic ≠ license to burn money
Agentic is cool until you realize each "hop" is another model call.
7.1 Hard budgets per request
For each endpoint: max total LLM calls, max total tokens, target P50/P95 latency.
If a task wants to exceed that: short-circuit with a partial answer or escalation. Don't let agents negotiate themselves into a 20-step loop.
7.2 Model routing inside the agent system
Combine earlier unit-economics tricks:
- Use cheaper models for: planning, routing, simple subtasks
- Use larger models only for: final user-facing answers, complex reasoning steps
Multi-agent systems are a perfect fit for model routing—just don't forget to actually use it.
8. Checklist: sane multi-agent system design
Use this as a sanity check before you ship your "agentic" thing:
- Clear reason to use multiple agents (not just "because hype")
- Explicit orchestration pattern (manager–worker, router, critic, etc.)
- Structured, shared state (not just growing chat history)
- Per-agent and global state separation
- Hard caps on steps, time, and tool calls
- Restricted transitions between agents (graph / state machine)
- Per-agent tool and data access scoped to role
- Step-level logging and replayable traces
- Scenario-based evaluation vs simpler baselines
- Cost / latency budgets enforced at the system level
Do all that, and "agentic" stops being marketing speak and becomes what it should be: a practical way to structure complex LLM systems so they're understandable, observable, and controllable.
9. Real Example: Data Analysis Assistant (Multi-Agent in Practice)
Let's build a concrete multi-agent system instead of hand-wavy "agents will coordinate" nonsense.
9.1 Use Case: Data Analysis Assistant
Goal: A user asks natural language questions about their data:
Compare this quarter's churn rate to the previous four quarters by customer segment, and explain what changed.
We want the system to: understand the question, plan the steps, generate correct SQL, run it safely, summarize the results in human language.
We'll use a multi-agent architecture for: better separation of responsibilities, better observability and debugging, clear guardrails (SQL execution, planning, explanation).
9.2 High-Level Multi-Agent Architecture
Three main agents:
- Planner Agent – Interprets user request, breaks it into steps/sub-queries, decides which tables/metrics to use, writes structured "analysis plan" into shared state
- SQL Agent (Data Agent) – Converts plan into SQL queries, executes them via a controlled SQL tool, stores results (tables, aggregates) in state
- Explainer Agent – Reads the plan + data results, produces a narrative explanation and, optionally, charts/table summaries for the user
Orchestration pattern: Manager–Worker + Tool Specialist pattern implemented as a graph. Planner = manager, SQL Agent = tool specialist, Explainer = finalizer.
9.3 The Shared State Object
Instead of just passing raw chat history around, we use a structured AnalysisState:
{
"task_id": "uuid-123",
"user_query": "Compare this quarter's churn rate...",
"status": "planning" | "running_sql" | "explaining" | "completed" | "error",
"plan": {
"steps": [
{"id": "step1", "description": "Identify quarters", "status": "done"},
{"id": "step2", "description": "Compute churn by segment", "status": "in_progress"}
],
"assumptions": ["Use subscriptions table", "Churn = inactive > 30 days"],
"tables_used": ["customers", "subscriptions"]
},
"sql_queries": [
{
"id": "q1",
"step_id": "step2",
"sql": "SELECT ...",
"status": "succeeded",
"result_table_name": "churn_by_segment_quarter"
}
],
"results": {
"tables": {
"churn_by_segment_quarter": {
"schema": {"segment": "string", "quarter": "string", "churn_rate": "float"},
"sample_rows": [
{"segment": "SMB", "quarter": "2024-Q1", "churn_rate": 0.08}
]
}
}
},
"final_answer": null,
"errors": []
}
Key points: single shared state every agent reads/writes, structured fields (not just "extra text"), easy to log/debug/replay/inspect, fits naturally with LangGraph's typed state idea.
9.4 The LangGraph-Style Graph
Nodes: PlannerNode, SQLNode, ExplainerNode, ErrorNode, DoneNode
Transitions:
[PlannerNode] → [SQLNode] → [ExplainerNode] → [DoneNode]
On errors at any stage → [ErrorNode]
In more realistic form: if plan status != complete → stay in PlannerNode or error; if any SQL queries failed and retries left → SQLNode again; once all queries succeed → ExplainerNode.
This is classic agent orchestration pattern as a graph — exactly what LangGraph is built for.
9.5 Agent Responsibilities and Prompts
Planner Agent
Role: Understand user query, plan steps, decide what data/tables are needed, write to state.plan and update state.status = "running_sql"
Prompt sketch:
You are a planning agent for a data analysis assistant. Given the user query and available tables, produce: a list of numbered steps, any assumptions you must make, which tables and fields you will use. Do NOT write SQL. Only plan. Output in JSON with keys: steps, assumptions, tables_used.
Node logic: read state.user_query, read metadata about available tables, update state.plan, set state.status = "running_sql", hand off to SQLNode.
SQL Agent (Data Agent)
Role: Take the plan, for each step that needs data write safe SQL, run it through a controlled SQL tool, store results in state.results.tables
Prompt sketch:
You are a SQL generation agent. You are given: the user query, a high-level analysis plan, database schema. For each step that requires data: write a single SQL query, ensure queries are safe and read-only. Use ONLY the documented tables and columns. Return JSON with queries: [{step_id, sql}].
Execution loop: generate queries → validate (optional second LLM or rule-based check) → execute queries via tool (with hard guardrails: read-only, timeouts) → update state.sql_queries and state.results.tables. If any query fails → log error in state.errors and either retry once with error context or send to ErrorNode.
Explainer Agent
Role: Consume the plan + results, produce the final answer for the user: explanation, key comparisons, optional recommendations
Prompt sketch:
You are an analyst. You are given: the original user question, the analysis plan, query results in tables with sample rows and schemas. Your job: answer the question clearly, compare key metrics over time, highlight notable changes and possible reasons (label speculation as such). Format: 2–3 short paragraphs followed by a bullet list of key metrics.
Node logic: read user query + plan + state.results, generate final narrative, write state.final_answer, set state.status = "completed", go to DoneNode.
9.6 Orchestration Pattern in Practice
This setup gives you a manager–worker pattern with tool specialist and explicit lifecycle:
- Planner: decomposes user intent into a structured plan
- SQL Agent: specialized worker that only deals with data access
- Explainer: specialized communicator that focuses on clarity and narrative
Wrapped in a LangGraph-style stateful graph, we get: controlled transitions, explicit states (planning, running_sql, explaining, completed, error), ability to stop runaway loops (max steps or retries).
9.7 Guardrails and Budgets
Even this "simple" multi-agent system will happily burn tokens if you let it. Put guardrails around it:
- Hard caps: Max total steps (e.g., 8), max SQL retries per query (e.g., 2), max tokens (planner + SQL agent + explainer prompts/outputs)
- Permissions: SQL Agent (read-only, whitelisted schemas, timeouts and row limits); Planner/Explainer (no direct SQL execution, no system/infra tools)
- Fallbacks: If planner fails repeatedly → "I couldn't understand your request"; if SQL Agent keeps failing → "query looks unsupported/schema issue"; if explainer fails → return structured data plus safe generic comment
You don't let the system spin indefinitely. You fail predictably.
9.8 Observability: What You Log Per Task
For each task_id, store:
user_query- Final
AnalysisStatesnapshot: plan (steps, assumptions, tables), sql_queries (SQL, status, errors), results.tables (schemas + small sample), final_answer - Per-node metrics: which nodes ran and in what order, tokens in/out per node, latency per node, errors and retries
This gives you: traces you can replay, insight into which agent is the bottleneck, a way to compute cost and latency per workflow and optimize accordingly.
9.9 Implementation with LangGraph / AutoGen
LangGraph-style: Nodes (planner_node, sql_node, explainer_node, error_node, done_node), shared state object = AnalysisState, graph (initial node = planner_node, edges defined by state.status and error flags, max steps enforced at graph runner level). LangGraph gives you: state management, graph execution, persistence (optional).
AutoGen-style: Agents (planner_agent, sql_agent, explainer_agent), conversation protocol (user → planner → sql_agent → planner (optional) → explainer → user). But even with AutoGen, you still want a structured AnalysisState so you're not stuck in pure chat logs.
9.10 Why This Multi-Agent Setup Is Actually Worth It
This is the kind of multi-agent system that earns its complexity:
- Planner gives you interpretable plans and clearer debugging
- SQL Agent is locked to a narrow, auditable surface (SQL generation)
- Explainer focuses purely on communication quality
- Shared state gives you traceability, replayability, and evaluation hooks
- LangGraph-style graph gives you explicit control over flow, retries, and limits
That's the difference between "we played with agents" and "we built an agentic data assistant you can actually run in production and monitor."