MCP Beast

Operations & Gateways

How to Cut MCP Token Usage: Why Your Context Window Fills Up

By Antoine van der Lee·

MCP token usage grows because every connected server injects its complete tool-definition list — names, descriptions, JSON schemas — into the model's context at session start, before the first user message. Cutting it comes down to three moves: connect fewer servers per client, filter which tools are exposed, or switch to on-demand tool discovery so the model only loads what each request needs.

This article explains the mechanics precisely, because the fix follows from the mechanism. The claims here are qualitative on purpose: actual token counts depend on your servers, schemas, and model, and the audit checklist at the end shows you how to measure your own numbers instead of trusting anyone else's.

The Mechanics: Where the Tokens Go

When an MCP client starts a session, it performs a handshake with each connected server and calls tools/list. Each server responds with its full tool catalog: a name, a natural-language description, and a JSON Schema for the input of every tool. The client then serializes all of it into the system context it hands the model — that is how the model knows the tools exist at all. This is the tool-calling loop described in LLM Tool Calling Explained, and the context injection is its fixed cost.

Three properties make this expensive in practice:

  • It is unconditional. The definitions are loaded whether or not the conversation will ever touch them. A GitHub server's dozens of tools ride along in a session about renaming a CSS file.
  • It multiplies. Tool definitions are verbose by design — good descriptions and complete schemas are what make tool selection work. A single nontrivial server commonly exposes dozens of tools; connect ten servers and the model is reading hundreds of tool definitions, thousands of tokens deep, before the first user message.
  • It recurs. Context is rebuilt per request to the model. The tool block is part of every inference call in the session, so the overhead is paid on every turn, not once. (Prompt caching can soften the cost per call; it does not give the context window space back.)

The Second Cost: Worse Tool Selection

If token spend were the only issue, this would be a billing problem. The deeper cost is quality: as the tool count grows, models get measurably worse at picking the right tool. Similarly named tools from different servers (search, query, find_issues) compete; long tool lists dilute the attention available for the actual task; and the model may call a plausible-but-wrong tool rather than the right one. Anyone who has watched an agent invoke the wrong server's create_issue has seen this failure mode.

So the symptom is a fat context window, but the disease is degraded agent behavior. That is why mitigation is worth doing even when token cost alone would not justify it.

Mitigation 1: Connect Fewer Servers Per Client

The bluntest fix is hygiene. Most people accumulate MCP servers the way they accumulate browser extensions, and every client carries the full set everywhere.

  • Disconnect servers you have not used in weeks; reconnecting later is cheap.
  • Scope per client: the coding agent does not need the calendar server, and the writing assistant does not need the Kubernetes server.
  • Prefer one server that does the job over three overlapping ones.

This works and costs nothing, but it has a ceiling: it requires ongoing discipline, and it forces a static guess about which tools tomorrow's session will need.

Mitigation 2: Tool Filtering

One level finer: expose servers, but not all of their tools. Some clients and most proxies/gateways let you allow-list or deny-list at the tool level — keep get_file_contents, drop the twenty admin tools you will never call from this client. Filtering also doubles as a safety control: a tool that is not exposed cannot be invoked, which is the same least-privilege logic as AI agent access control.

The limitation is that filtering is still static. You are deciding at configuration time what every future request will need, and the merged list of "tools someone might need" tends to grow back over time.

Mitigation 3: On-Demand Tool Discovery (Semantic Routing)

The structural fix changes what the model sees. Instead of receiving every tool definition up front, the model receives a small, fixed set of meta-tools, and fetches real tool definitions only when a request needs them.

This is the approach MCP Beast takes. Clients connect to one gateway endpoint that exposes a compact dispatcher — three meta-tools — instead of the merged catalog:

  1. discover_tools — the agent describes what it is trying to do; the gateway runs a semantic search (combined keyword and embedding match) over every tool on every connected server and returns a ranked shortlist, without schemas.
  2. get_schema — the agent pulls the full input schema for just the tool it intends to call.
  3. invoke — the gateway routes the call to the right backend server and returns the result.

The context cost of being connected is now constant: three meta-tool definitions, regardless of whether the gateway fronts three servers or three hundred. Tool definitions enter context only for tools that are actually relevant to the request at hand — which also shrinks the candidate set the model must choose between, attacking the tool-selection problem directly. Adding a new server to the gateway no longer costs every client a bigger context block; it just makes one more catalog searchable. (This is the "router" layer in gateway vs proxy vs router terms.)

The honest tradeoff: discovery adds a round trip — the agent spends a tool call finding tools before calling one. For sessions that hammer the same one tool repeatedly, that indirection buys little. For the common case — many servers connected, few tools used per task — it trades a small per-task lookup for removing the large always-on overhead, and the lookup itself is auditable like any other call.

The Audit Checklist: Measure Your Own Numbers

Do not take anyone's percentages on faith, including ours. Your overhead is measurable in an afternoon:

MCP TOKEN AUDIT

[ ] 1. List every MCP server connected to each client,
       and the tool count per server (the client's MCP/tools
       settings panel, or call tools/list yourself).
[ ] 2. Capture the baseline: with all servers connected, start
       a fresh session and record reported input tokens for a
       trivial one-line prompt (from the client's usage display,
       API usage dashboard, or provider token counter).
[ ] 3. Capture the floor: disconnect ALL MCP servers, repeat the
       same prompt, record input tokens. The difference is your
       per-turn MCP overhead.
[ ] 4. Reconnect servers one at a time and repeat to attribute
       overhead per server. Rank them: cost vs. how often you
       actually use their tools.
[ ] 5. Check usage reality: from session history or gateway audit
       logs, list which tools were invoked in the last two weeks.
       Tools never called but always loaded are pure overhead.
[ ] 6. Act: disconnect never-used servers, filter rarely-used
       tools, and route the rest through on-demand discovery.
[ ] 7. Re-measure after the change, and re-run the audit when
       you add a server — overhead grows back silently.

Step 5 is where most teams are surprised: the loaded-versus-used gap is typically wide, and it is exactly the gap that filtering and discovery reclaim. A gateway with per-call audit logging makes that step a query instead of an archaeology project.

Frequently Asked Questions

Why does MCP use so many tokens before I even type anything?

Because the protocol's handshake delivers every connected server's full tool catalog — names, descriptions, and JSON Schemas — and the client serializes all of it into the model's context so the model knows what it can call. That block is rebuilt into every request, so the cost recurs on each turn of the session.

Does prompt caching solve MCP token overhead?

It reduces the billing impact, since a stable tool block can be served from cache on subsequent calls, but it does not return the context-window space or fix tool selection: the model still reasons over the full tool list every turn. Caching also breaks whenever the tool block changes — connecting or removing a server invalidates it.

How many MCP servers is too many for one client?

There is no universal number — it depends on how many tools each server exposes and how verbose their schemas are. The practical test is to measure per-turn overhead with and without each server, and watch for the behavioral symptom (wrong-tool calls, sluggish selection) rather than a fixed count.

What does MCP Beast actually load into my context?

Three meta-tools — discover_tools, get_schema, and invoke — regardless of how many servers sit behind the gateway. Full tool definitions enter context only when an agent discovers and selects a specific tool for the current request, and every discovery and invocation is recorded in the audit log.

Stop Paying for Tools You Never Call

If your audit shows a wide gap between tools loaded and tools used, that gap is what MCP Beast removes: one endpoint, a three-tool dispatcher, and on-demand discovery across every server you connect — with keys in your Keychain and a per-call audit trail. Download the free Mac app, connect your noisiest servers, and re-run step 7 of the checklist.


Related: