We pointed Capframe at the Damn Vulnerable MCP Server (and it found a gap in itself)
Dogfooding the Find module against a purpose-built insecure MCP server. It flagged all 10 tools for unconstrained input — but rated an arbitrary-shell-execution tool identically to a username lookup. Here's the gap that surfaced, the rule we added, and what the scanner still misses.
The best way to find out whether a security scanner is any good is to point it at something that's actually insecure. So I pointed Capframe's Find module at the Damn Vulnerable MCP Server (DVMCP) — a deliberately-broken MCP server with 10 challenges, the MCP equivalent of DVWA. Built to be exploited, so scanning it is fair game.
The result was useful in a way I didn't expect: it found a real gap in Capframe's own rules. This post is the honest walk-through.
Setup
DVMCP's challenges are Python servers. I built a faithful
mcp-recon.inventory.v1 from the
actual tool signatures across four challenges — prompt injection (challenge 1),
excessive permissions (3), code execution (8), and remote access (9). Real tool
names, real parameters: execute_shell_command(command: string),
read_file(filename: string), port_scan(host, port), and so on. None declare
side-effects, auth, or input constraints — faithful to a server that's meant to
be wrong.
The inventory is committed at
examples/dvmcp.inventory.json,
so everything below reproduces with:
capframe find examples/dvmcp.inventory.json --out findings.json --pretty
First run: correct, and yet clearly wrong
total: 10 | by severity: { critical: 0, high: 0, medium: 10 }
by rule: { r1: 10 }
Every one of the ten tools got flagged for unconstrained string input (rule R1). That's correct — none of them bound their inputs, and an unconstrained string is a real injection/payload surface.
But look at the severity column: everything is medium. Which means the scanner rated this:
execute_shell_command(command: string) → medium
exactly the same as this:
get_user_info(username: string) → medium
One of those runs arbitrary shell commands on the host. The other looks up a username. Rating them identically is not a small miss — it's the scanner failing at the one job that matters most, which is telling you what to panic about first.
The gap
I went looking for why. Capframe's classifier had six rules (R1–R6): unconstrained input, missing auth on side-effecting tools, side-effect/name mismatch, unbounded money params, undeclared-money, and external-fetch surfaces. Reasonable coverage — except none of them fire on "this tool executes code."
Worse, the two rules that should have caught it both no-op'd here:
- R2 (missing auth on side-effecting tool) gates on declared side-effects. DVMCP declares none — so R2 had nothing to key on. A tool that doesn't admit it has side effects sails straight past the side-effect rule.
- R3 (name implies a side-effect not declared) keys on a list of mutation verbs (delete, send, refund, …). "execute" wasn't in the list. Neither was "exec" or "shell."
So the single most dangerous category in the whole space — arbitrary code execution — had no dedicated detector, and the adjacent rules were defeated by a server simply declining to describe itself honestly. Which, of course, is exactly what a real vulnerable server does.
The fix: R7
I added a seventh rule. R7 fires Critical when a tool's name or description
implies code, shell, or subprocess execution — execute_, shell command,
eval(, subprocess, arbitrary code, and friends. Crucially, it ignores
declared side-effects entirely. The whole lesson from DVMCP is that the
declaration can't be trusted, so R7 keys only on the strongest available signal:
the tool is telling you, in its own name and docstring, that it runs code.
The rule was added test-first — three unit tests (shell tool → critical, code execution by description, benign tool stays silent) plus an integration test that locks the DVMCP numbers so this post's figures can't silently drift.
Second run
total: 12 | by severity: { critical: 2, high: 0, medium: 10 }
critical: [ execute_python_code, execute_shell_command ]
Now the two RCE tools sit at the top of the report where they belong, and the other eight unconstrained-input findings remain as medium. That's an honest severity picture: these two will get you owned; the rest are hygiene.
What R7 still misses (because a security post without this is marketing)
Dogfooding doesn't stop at one rule. Two things the scanner still under-rates on DVMCP, both now tracked as work:
Command injection through "safe-looking" parameters. Challenge 9's
network_diagnostic(target, options)shells out withoptionsinterpolated in. The description says "run comprehensive network diagnostics" — noexecutetoken — so R7 doesn't fire. Catching this needs taint-style reasoning about which string params reach a shell, which a name/description heuristic can't do alone.The "declares nothing" problem in general. R2 and R3 were both defeated by missing declarations. There's a case for a rule that flags any tool which declares no side-effects at all as "unclassified — review," rather than assuming absence means safe. That's a future R8.
Both are now good-first-issues if anyone wants them.
The actual point
A heuristic classifier is a floor, not a ceiling. It won't catch everything, and the moment you claim it does, a server like DVMCP makes you look silly. What makes it useful is two things:
It's extensible by anyone. R7 is ~40 lines + four tests. Each rule maps a detectable signal to a severity and a set of OWASP/NIST/MITRE IDs. The gap I found took an afternoon to close, and the next person's gap will too.
The output is a shared schema. Every finding R7 emits is a
findings.v1record — same envelope as R1, same envelope any other tool could emit. The scanner improving doesn't change the contract; it just fills it in better.
The DVMCP scan shipped in mcp-recon v0.0.5, example inventory and all. Run it yourself:
curl -fsSL capframe.ai/install | sh
capframe install find
capframe find <(curl -s https://raw.githubusercontent.com/euanmcrosson-dotcom/mcp-recon/master/examples/dvmcp.inventory.json) --pretty
If you find a tool surface that gets past all seven rules — and you will — that's not a gotcha, it's the next rule. Open an issue.