Operations & Gateways

MCP Monitoring & Observability Guide

By Antoine van der Lee·

The failure mode that catches teams off-guard: an agent enters a retry loop, calls an MCP tool dozens of times on a single task, and nobody notices until the billing statement arrives. Traditional APM never fires — no error rate spike, no latency anomaly, just silent overconsumption. MCP monitoring requires a different instrumentation model.

Every tool invocation crosses a protocol boundary — model to MCP server to upstream system — and each hop can fail silently, return unexpected data, or consume credits you never anticipated. Without purpose-built observability, production agentic workloads are ungoverned by default.

The Three Pillars, Applied to MCP Agents

Metrics, traces, and logs remain the right vocabulary. What changes is what you measure inside each pillar.

Metrics shift from infrastructure-level signals to agent-level ones. CPU and memory matter, but the numbers that tell you whether your deployment is healthy are:

  • Tool-call success rate per MCP server
  • Mean and p99 latency per tool
  • Tokens consumed per agent session (and per tool call)
  • Cost per completed task
  • Retry rate — how often the model re-issues a failed tool call

Traces are where agent observability diverges most sharply from traditional APM. A single user request may fan out into a tree of tool calls, some sequential, some parallel, some triggered by the model's own reasoning. A useful trace captures the full decision tree: which tool the model chose, what arguments it passed, what the MCP server returned, and how long each leg took. Without end-to-end trace context propagating across the model boundary, you see only disconnected spans.

Logs for MCP traffic need structured records of every tool invocation: timestamp, session ID, tool name, input payload, response payload (or error), and the model's internal trace ID if the provider surfaces one. These records are your audit trail and your primary debugging surface — without them, post-incident analysis is guesswork.

What "Good" Looks Like

Without baselines, anomaly detection is guesswork. Establish targets per tool and per agent type. Reasonable starting points for enterprise deployments:

SignalTargetInvestigate if
Tool-call latency (p50)< 300 ms> 1 s sustained
Tool-call error rate< 1%> 3% over 5 min
Tool-call success rate> 99%< 97%
Cost per taskSet per workflow> 2× baseline
Retry rate< 5%> 15%

Cost per call deserves special attention. Agents enter retry loops or issue redundant tool calls that are individually cheap but compound fast at scale. Tracking cost at the tool-call grain — not just the session grain — surfaces these patterns before they reach your billing statement.

Traces Across the Model Boundary

The hardest part of agent observability is stitching spans that cross into and out of the model. Most LLM providers do not emit OpenTelemetry spans natively — so the instrumentation layer has to sit at the MCP gateway, the component that intercepts every tool call before it reaches an MCP server.

The gateway records an outbound span when the tool call leaves the model, attaches a trace ID, and correlates the inbound span when the MCP server responds. The result looks like a conventional distributed trace but represents a reasoning step, not a code path.

This is also where you capture the context that makes debugging tractable: the exact arguments the model chose, not just the fact that a call was made. That detail separates "the tool failed" from "the model passed malformed arguments" — two root causes with entirely different remediation paths.

For more on the audit requirements this creates, see AI Agent Audit Logs.

Alerting for Agentic Workloads

Alert design for agents follows the same principles as any distributed system — alert on symptoms, not causes — but the symptoms specific to agentic workloads require different alert types.

Latency alerts should be per-tool, not per-service. A slow database tool and a slow file-read tool have different normal ranges and different blast radii.

Error-rate alerts need a burn-rate model. A 5% error rate for 30 seconds is noise; a 3% error rate sustained for 5 minutes in a batch workflow is a real incident.

Cost anomaly alerts are unique to agent workloads. Set a rolling budget per agent type and alert when spend exceeds 150% of the trailing 7-day average. This catches runaway loops before they become billing surprises.

Stuck-agent alerts are the easiest to miss. An agent still "running" after 10× its normal task duration is likely in a loop or blocked on a hung tool call. A time-to-completion threshold catches this failure class — neither error rates nor latency histograms will.

How This Differs from Traditional APM

Traditional APM instruments code you wrote: function calls, database queries, HTTP requests — all deterministic, all defined at deploy time. Instrumentation points are stable.

Agent observability has to handle non-determinism. The model decides which tools to call; the instrumentation layer cannot know in advance which paths will be exercised. This drives three architectural differences:

  1. Schema validation at the boundary. The gateway validates tool inputs and outputs against the MCP server's declared schema in real time, not in post-hoc log analysis. Schema drift is caught as it happens.

  2. Semantic context alongside technical metrics. A slow HTTP request in traditional APM is self-describing. A slow tool call needs the model's intent context — why did the agent call this tool, at this point? — for root-cause analysis to mean anything. That context comes from the conversation or task object attached to the trace.

  3. Cost as a first-class signal. Traditional APM does not track spend per function call. For LLM-powered agents, cost is a reliability signal: unexpected spend spikes indicate misbehavior just as surely as error-rate spikes do.

The gateway layer is also where observability data becomes actionable. See What Is an MCP Gateway? for how the gateway fits into the broader control-plane architecture.


MCP Beast: Protocol-Layer Observability

MCP Beast instruments the MCP protocol layer directly — no code changes to your agents or MCP servers. Every tool call produces a structured span with latency, token cost, input/output payloads, and schema validation results. Traces are stitched across the model boundary using session context propagated through the gateway.

Dashboards surface per-tool latency histograms, cost-per-task trends, and error rates grouped by MCP server. Alerting rules ship with defaults tuned to the baselines above; teams can set per-agent cost budgets that fire Slack or PagerDuty notifications before a runaway loop becomes a billing incident.

The audit log — immutable, append-only, exportable to your SIEM — satisfies the same compliance requirements your security team already applies to non-AI workloads.

Get full MCP traffic visibility without agent-side instrumentation. See the observability docs →


Frequently Asked Questions

Can I use my existing APM tool (Datadog, New Relic, Dynatrace) for MCP monitoring?

These platforms accept OpenTelemetry data, so spans emitted by an MCP gateway can be forwarded to them. What they lack out of the box is agent-specific semantics: cost-per-call tracking, schema validation signals, and the tooling to make a reasoning trace readable alongside a service map. Many teams use their existing APM for infrastructure signals and a dedicated agent observability layer for the MCP-specific ones.

What trace sampling rate should I use for production agents?

Head-based sampling at 100% is the right default for agents, at least initially. Agent tasks are often long-running, infrequent, and high-value compared with web request traffic — dropping traces to manage volume is rarely worth the debugging cost. If volume becomes a constraint, tail-based sampling that retains all error traces and a representative sample of successful ones is a better tradeoff than head-based rate reduction.

How do I correlate a user complaint with a specific agent session?

Every agent session should carry a stable session ID that propagates into every tool-call span and log record. Surface this ID to the user in the product UI (or in support metadata) so that a complaint maps directly to a set of traces. Without this, you are searching logs by approximate timestamp, which is slow and unreliable in multi-tenant deployments.