DSPy RLM with MCP Code Mode

DSPy’s RLM (Reinforcement Language Model) module lets LLMs write Python code instead of picking tools one at a time. Combined with mcp_use’s code_mode, this creates a powerful pattern: the LLM writes code that directly calls MCP tools as async Python functions.

The Traditional Approach

Most agent frameworks (including DSPy’s ReAct) follow this kind of pattern:

Thought: I need to search for users
Action: search_users
Action Input: {"name": "john"}
Observation: [results...]
Thought: Now I need to get details...

This works, but it’s limiting:

One tool per step: Can’t chain multiple calls efficiently
No data manipulation: Can’t filter, transform, or combine results
Rigid format: The LLM must conform to a strict thought/action/observation structure
Expensive: Each tool call requires a full LLM round-trip, and the output feeds back through the LLM just to be copied to the next call

The Origin of Code Mode

The code mode concept was pioneered by Cloudflare with a key insight: LLMs have seen a lot of code, but they haven’t seen a lot of “tool calls.”

Traditional tool-calling relies on synthetic training data and special tokens that are unfamiliar to language models. But LLMs have been trained on millions of open-source projects, they’re really good at writing code. So why not let them?

Cloudflare’s approach converts MCP tool schemas into a TypeScript API, then asks the LLM to write code that calls that API. The code runs in a secure V8 isolate sandbox where the only external access is through the MCP bindings. This gives you:

Better tool handling: Agents manage more tools when presented as APIs
Efficient chaining: No LLM round-trip between tool calls
Security: API keys stay hidden in the bindings

mcp-use’s Python Implementation

The mcp-use library brings code mode to Python. Enable it with a single flag:

client = MCPClient(config="config.json", code_mode=True)

When enabled, something clever happens: instead of loading all tool definitions into the agent’s context (which can be 150,000+ tokens), the agent sees only two tools: execute_code and search_tools. All MCP tools become accessible inside the code execution environment as async functions:

# Tools are namespaced by server
result = await github.get_pull_request(owner="anthropics", repo="claude")
data = await log_analytics.execute_kql(kql_query="SecurityAlert | take 10")

Tool names are automatically sanitized to valid Python identifiers (list-files becomes list_files).

The progressive discovery pattern is powerful: rather than pre-loading everything, agents use search_tools(query) to find relevant tools on-demand. This reduces context from 150k tokens to ~2k tokens, a 98.7% reduction.

Enter RLM: Code Generation Instead of Tool Selection

RLM flips the paradigm. Instead of asking the LLM to choose which tool to call, we ask it to write Python code that accomplishes the task:

# Instead of pick-one-tool-at-a-time, the LLM writes:
users = await server.search_users(name="john")
active_users = [u for u in users if u['status'] == 'active']
details = await server.get_user_details(user_id=active_users[0]['id'])
return SUBMIT(result=f"Found {len(active_users)} active users. First: {details['name']}")

This is more natural, more flexible, and often more efficient.

How RLM Works

RLM follows an iterative refinement process:

flowchart TD
    A[User Query] --> B[RLM Generates Python Code]
    B --> C[Interpreter Executes Code]
    C --> D{Success?}
    D -->|Error| E[LLM Sees Error + Context]
    E --> B
    D -->|Output| F{SUBMIT Called?}
    F -->|No| G[LLM Sees Output + Context]
    G --> B
    F -->|Yes| H[Return Final Result]

Key points:

Signature: Defines inputs/outputs, like task -> result
Interpreter: Executes the generated code (sandboxed)
Iteration: If code fails or doesn’t call SUBMIT(), RLM shows the output to the LLM and asks it to try again
SUBMIT(): A special function the LLM calls to signal “I’m done, here’s my answer”

Combining RLM with MCP Code Mode

Here’s where it comes together. RLM needs a custom interpreter to execute the code it generates. mcp-use’s code mode provides exactly that, a sandboxed environment where MCP tools are async functions.

The LLM can write code like:

# Query a database
incidents = await log_analytics.execute_kql(
    kql_query="SecurityIncident | where Severity == 'High' | take 10"
)
 
# Process locally
for incident in incidents['data']:
    print(f"Incident {incident['id']}: {incident['title']}")
 
# Submit findings
return SUBMIT(result=f"Found {len(incidents['data'])} high-severity incidents")

Building the MCPCodeInterpreter

Here’s the problem: DSPy’s RLM comes with built-in interpreters (Deno for TypeScript, Pyodide for Python), but they don’t know anything about MCP. They can’t call await github.get_pull_request() because that namespace-based syntax is specific to mcp-use’s code_mode sandbox.

mcp-use’s code_mode is what makes tools available as async functions via their server namespace:

# This syntax only works inside mcp-use's execute_code() sandbox
result = await log_analytics.execute_kql(kql_query="...")
data = await github.list_issues(repo="...")

So we need a custom interpreter that:

Implements the interface RLM expects (execute(), start(), shutdown(), tools, output_fields)
Delegates actual execution to mcp-use’s sandbox (where the namespace magic happens)
Injects a SUBMIT() function so the LLM can signal completion
Detects when SUBMIT() is called and returns a FinalOutput to stop iteration

Here’s the full implementation:

import asyncio
import json
from typing import Any, Callable
 
import nest_asyncio
from mcp_use import MCPClient
from dspy.primitives.code_interpreter import CodeInterpreterError, FinalOutput
 
# Allow nested event loops (needed for sync interpreter inside async context)
nest_asyncio.apply()
 
 
class MCPCodeInterpreter:
    """Custom interpreter that bridges DSPy RLM with mcp-use's code execution."""
 
    def __init__(self, mcp_client: MCPClient):
        self.client = mcp_client
        self._tools: dict[str, Callable[..., str]] = {}
        self._started = False
        self.output_fields: list[dict] = []  # Required by RLM
 
    @property
    def tools(self) -> dict[str, Callable[..., str]]:
        """RLM may set tools here, but we ignore them - tools come from MCP."""
        return self._tools
 
    @tools.setter
    def tools(self, value: dict[str, Callable[..., str]]):
        self._tools = value
 
    def _run_async(self, coro):
        """Run async code from sync context."""
        loop = asyncio.get_event_loop()
        return loop.run_until_complete(coro)
 
    def start(self) -> None:
        """Initialize MCP sessions."""
        if not self._started:
            self._run_async(self.client.create_all_sessions())
            self._started = True
 
    def _inject_submit_function(self, code: str) -> str:
        """Inject SUBMIT() so the LLM can signal completion.
 
        Returns a special dict marker that we detect in the result.
        """
        submit_func = '''
def SUBMIT(**kwargs):
    """Call this when you have the final answer."""
    return {"__submit__": kwargs}
'''
        return submit_func + "\n" + code
 
    def _strip_markdown_fences(self, code: str) -> str:
        """Strip markdown code fences if the LLM wrapped the code in them."""
        import re
        # Match ```python or ``` at start, and ``` at end
        pattern = r'^```(?:python)?\s*\n?(.*?)\n?```$'
        match = re.match(pattern, code.strip(), re.DOTALL)
        return match.group(1) if match else code
 
    def execute(self, code: str, variables: dict[str, Any] | None = None) -> Any:
        """Execute code via mcp-use and return results for RLM.
 
        This is the main interface called by RLM on each iteration.
        """
        self.start()
 
        # Strip markdown code fences if present (LLMs often wrap code in ```)
        code = self._strip_markdown_fences(code)
 
        # Inject SUBMIT function
        code = self._inject_submit_function(code)
 
        # Inject any variables from previous iterations
        if variables:
            injections = [f"{k} = {json.dumps(v)}" for k, v in variables.items()]
            code = "\n".join(injections) + "\n" + code
 
        # Execute via mcp-use's sandbox
        try:
            result = self._run_async(self.client.execute_code(code, timeout=60.0))
        except Exception as e:
            raise CodeInterpreterError(f"MCP execution error: {e}")
 
        # Handle errors
        if result.get('error'):
            error_msg = result['error']
            if 'SyntaxError' in error_msg:
                raise SyntaxError(error_msg)
            raise CodeInterpreterError(error_msg)
 
        # Check if LLM called SUBMIT() - this signals completion
        if isinstance(result.get('result'), dict) and '__submit__' in result['result']:
            return FinalOutput(result['result']['__submit__'])
 
        # Otherwise, format output for next iteration
        output_parts = []
        if result.get('logs'):
            output_parts.extend(result['logs'])
        if result.get('result') is not None:
            if isinstance(result['result'], (dict, list)):
                output_parts.append(json.dumps(result['result'], indent=2))
            else:
                output_parts.append(str(result['result']))
 
        return "\n".join(output_parts) if output_parts else ""
 
    def shutdown(self) -> None:
        """Clean up MCP sessions."""
        if self._started:
            self._run_async(self.client.close_all_sessions())
            self._started = False

The key insight: when the LLM calls SUBMIT(result=...), it returns {"__submit__": {"result": ...}}. We detect this marker and wrap it in FinalOutput, which tells RLM to stop iterating and return the result.

Key Components Explained

1. The SUBMIT() Pattern

The SUBMIT() function is how the LLM signals “I have the answer”:

# LLM generates this code:
data = await server.fetch_data(query="...")
analysis = process_data(data)
return SUBMIT(result=f"Analysis complete: {analysis}")

Without SUBMIT(), the code output goes back to the LLM for another iteration. This lets the LLM explore, gather data across multiple iterations, and only commit when ready.

2. Variable Scoping

Important caveat: DSPy’s default RLM interpreters maintain state between iterations, but mcp-use’s execute_code() sandbox does not. Each call is a fresh environment, variables don’t persist sadly.

This means the LLM must do everything in a single code block:

# WRONG - variables don't persist between execute_code() calls:
# Block 1: data = await server.fetch()
# Block 2: return SUBMIT(result=data)  # Error: 'data' not defined!
 
# CORRECT - everything in one block:
data = await server.fetch()
return SUBMIT(result=data)

Make sure your signature instructions clearly tell the LLM to complete its work (including SUBMIT()) in one code block.

3. Error Handling

When code fails, RLM shows the error to the LLM:

Error: NameError: name 'foo' is not defined

The LLM can then fix its code and try again. This iterative refinement is powerful, the LLM learns from its mistakes within the same task.

Putting It All Together

We need one more piece: a signature that tells the LLM what tools are available. We fetch this from mcp-use and build a dynamic signature:

async def get_tools_description(client: MCPClient) -> tuple[list[str], str]:
    """Fetch available MCP tools and format them for the LLM."""
    tools_info = await client.search_tools("", detail_level="descriptions")
 
    # Filter out code_mode namespace (that's our execution environment)
    tools = [t for t in tools_info['results'] if t['server'] != 'code_mode']
    namespaces = [n for n in tools_info['meta']['namespaces'] if n != 'code_mode']
 
    tools_description = "\n".join([
        f"- {t['server']}.{t['name']}: {t.get('description', '')[:100]}"
        for t in tools
    ])
    return namespaces, tools_description
 
 
def build_signature(namespaces: list[str], tools_description: str) -> dspy.Signature:
    """Build a signature with tool information baked into the instructions."""
    instructions = f"""You write Python code to accomplish tasks using MCP tools.
 
## Available Servers: {', '.join(namespaces)}
 
## Tools:
{tools_description}
 
## How to Call Tools:
result = await server_name.tool_name(param=value)
print(result)  # See output
 
## CRITICAL - When Done:
You MUST use 'return' with SUBMIT:
return SUBMIT(result="Your answer here")
 
Do NOT just call SUBMIT() - you MUST return it!
"""
    return dspy.Signature(
        {"task": dspy.InputField(desc="The task to accomplish")},
        instructions
    ).append("result", dspy.OutputField(desc="The result"), type_=str)

Now the complete flow:

import asyncio
import dspy
from mcp_use import MCPClient
 
# Configure LLM
dspy.configure(lm=dspy.LM("openai/gpt-4o"))
 
async def main():
    # Initialize mcp-use with code_mode enabled
    client = MCPClient(config="config.json", code_mode=True)
    await client.create_all_sessions()
 
    try:
        # Get available tools
        namespaces, tools_desc = await get_tools_description(client)
 
        # Create our custom interpreter
        interpreter = MCPCodeInterpreter(client)
        interpreter._started = True  # Sessions already started
 
        # Build signature with tool info
        signature = build_signature(namespaces, tools_desc)
 
        # Create RLM with our interpreter
        rlm = dspy.RLM(
            signature=signature,
            interpreter=interpreter,
            max_iterations=10,
            verbose=True
        )
 
        # Run it
        result = await rlm.aforward(task="Find high-severity incidents")
        print(result.result)
 
    finally:
        await client.close_all_sessions()
 
asyncio.run(main())

RLM vs ReAct: When to Use Each

Aspect	ReAct	RLM + Code Mode
Approach	Pick one tool per step	Write code that calls tools
Context usage	All tool schemas loaded upfront	Progressive discovery (~98% reduction)
Multi-step	One tool at a time, LLM sees each result	Chain calls in single block
Data processing	Limited to tool outputs	Full Python capabilities
Observability	LLM reasons about every intermediate result	Results processed in code, LLM sees summary
Best for	Exploration, anomaly detection	Large toolsets, data aggregation

The Big Win for RLM + Code Mode

Context efficiency. With traditional ReAct, every tool schema gets loaded into the agent’s context window. If you have 50+ tools with detailed parameters, that’s tens of thousands of tokens before you even start. With code mode’s progressive discovery, the agent searches for relevant tools on-demand, keeping context lean.

When ReAct Still Wins

The LLM sees everything. In ReAct, the LLM observes each intermediate result and reasons about it before the next action. This matters when you’re exploring unknown data or looking for anomalies.

A deterministic Python script does exactly what it’s told. If you write incidents = await server.get_incidents(); high_sev = [i for i in incidents if i['severity'] == 'High'], it filters strictly by severity. But what if there’s an incident marked “Medium” that mentions “critical data exfiltration” in the description? The script misses it.

In ReAct, the LLM sees the full output and might notice: “Wait, this Medium severity incident looks serious based on the description. Let me investigate further.” That adaptive reasoning can catch things rigid code won’t.

Use ReAct when:

You’re exploring unfamiliar data and don’t know what to look for
Anomaly detection where unexpected patterns matter
You want explicit reasoning traces for auditing
The tool set is small enough that context isn’t an issue

Use RLM + Code Mode when:

Large tool catalogs that would flood the context window
Well-defined tasks with clear success criteria
Data aggregation across multiple sources
You need to filter/transform large result sets locally

Lessons Learned

After building several agents with RLM + MCP, here are the key takeaways:

Clear tool documentation matters: The LLM can only write good code if it knows what tools exist and their parameters.
SUBMIT() clarity is crucial: Make it very clear in the signature when and how to call SUBMIT(). Otherwise, the LLM might just print results.
Variable scope trips up LLMs: Repeatedly remind the LLM that variables don’t persist. It’s the most common error.
Iteration limits need tuning: Too few iterations and complex tasks fail. Too many and simple tasks waste tokens.
Print statements help: Encourage the LLM to use print() for intermediate results. This helps both debugging and gives context for the next iteration.

Conclusion

DSPy’s RLM module combined with mcp-use’s code mode is a powerful pattern. The insight from Cloudflare, that LLMs are better at writing code than making tool calls, plus mcp-use’s Python implementation of code mode, plus DSPy’s iterative refinement loop creates something greater than the sum of its parts.

The MCPCodeInterpreter bridges RLM and MCP: it takes generated code, executes it in mcp-use’s sandbox where tools are async functions, and returns results for the next iteration. The SUBMIT() pattern provides a clean way to signal completion.

For complex tasks involving multiple data sources, filtering, and aggregation, this approach significantly outperforms traditional ReAct-style agents. The tradeoff is less visibility into reasoning, but when you need power over interpretability, code mode delivers.

References

Cloudflare: Code Mode - The original code mode concept
mcp-use: Code Mode Documentation - Python implementation
DSPy RLM Documentation or RLM Source - DSPy’s Reinforcement Language Model

0xIvan

Explorer