Multi-LLM AI Pentest: Claude + DeepSeek Routing in Pentestas

2026-05-02 · Pentestas Features

Pentestas is a multi-LLM platform by design. Here is how a single scan picks between Claude Haiku, Claude Sonnet, Claude Opus, and DeepSeek — and why it matters for both quality and cost.

Four model tiers, each tuned to a different workload.

Most AI-pentest products pick a single model and stop thinking. Some use GPT-5. Some use Claude Sonnet. A few use a fine-tuned 8B that runs on a graphics card. Whatever the choice, it's a choice for everything — reconnaissance, hypothesis generation, exploit planning, payload synthesis, finding narratives, attack-chain reasoning. One workload, one model.

An AI pentest is not one workload. It is at least seven different workloads sharing a name, and the right model for each one varies by an order of magnitude on every axis that matters: reasoning depth, context length, latency, dollar cost, and rate-limit budget.

Pentestas runs every scan as a multi-LLM workflow. The default tier mix is Claude Haiku 4.5, Claude Sonnet 4.6, Claude Opus 4.7 (1M-token context), and DeepSeek — with a router that picks per phase and re-picks on rate-limit fallback. This post is the inside view.

⚠️

The Problem

One model is wrong for at least one of your phases

If you only use a small fast model, your hypothesis generation is shallow and your chain reasoning is brittle. If you only use a large reasoning model, your finding narratives cost twenty cents each and your bill at the end of a real scan looks like a B2B SaaS subscription. If you only use the latest frontier model, you also pay frontier-model rate-limits, and one big scan can stall every other scan in your tenant for an hour.

The right move is to match the model to the workload. The wrong move is to make that decision once at the top of the codebase and ship it for everyone.

Pentestas treats LLM choice as a per-phase capability negotiation. Each phase declares what it needs — minimum reasoning level, minimum context length, whether it can stream, whether tool-use matters, how latency-sensitive it is, and what the rate-limit budget looks like. The router maps that to a model tier, with operator overrides at the tenant and per-scan level.

🧠

The Tiers

What each tier is for

Claude Haiku 4.5 — the narrative tier

Haiku gets the high-volume, low-reasoning tasks. Examples: turning a raw finding into a customer-readable description, classifying a log line into INFO / WARNING / ERROR / SECURITY, summarising a long crawl trace into a paragraph, picking which OWASP category a one-line vulnerability description maps to. There are thousands of these per scan. They don't need a million-token context window or step-by-step reasoning — they need to be fast, cheap, and consistent. Haiku 4.5 finishes them in under a second and costs less than a tenth of Sonnet on a per-call basis.

Claude Sonnet 4.6 — the workhorse

Sonnet is the default for everything in the middle: per-category vulnerability hypothesis generation (Injection, XSS, SSRF, Auth, Authz), exploit planning, attack-chain synthesis, tool-use during the exploitation phase. These tasks need real reasoning — "given this attack surface, propose 25 ranked SQLi hypotheses" is not a Haiku task — but they don't need megabytes of context. Sonnet's tool-use stability is what makes the agentic exploitation phase reliable, and it's cheap enough that we can fan out one specialist agent per vuln class without the bill running away.

Claude Opus 4.7 (1M-token context) — the codebase tier

Opus only fires when the customer supplies source code (white-box mode). It reads the entire repository — tree-of-code, configs, infrastructure-as-code, recent commits — in one shot, and produces an architecture deliverable that downstream phases use to filter findings. The 1M-token context is the thing that makes "reachability filtering on retire.js" work: Opus can answer "is this CVE actually reachable from a public route?" for every CVE in the dependency tree without losing the codebase context. Opus is expensive per call but only fires once per scan, so the total cost contribution is small.

DeepSeek — the cost-controlled batch lane

DeepSeek runs the workloads where you want bulk volume at a fraction of the cost and you don't need frontier-quality reasoning. Examples: scoring 5,000 candidate URLs for "is this a real endpoint or a CDN cache miss", generating boilerplate finding-description templates that a Haiku would also handle but DeepSeek does for less, classifying request bodies into "parameter-rich vs. opaque blob". DeepSeek also matters because it's open-weights: tenants who want full data residency can host an inference endpoint inside their own VPC and route their AI workloads through it without the data ever leaving their network.

⚙️

The Router

How the router picks per phase

Every LLM call inside Pentestas goes through a single router that takes the call's capability profile and returns a model handle. The capability profile has six fields:

phase — one of narrative, hypothesis, tool_use, codebase, classify, summary.
min_reasoning — low | medium | high. Hypothesis generation requires medium; chain synthesis requires high.
min_context_tokens — how many tokens of input the call needs to fit. Codebase reasoning declares 200K+; narrative tasks declare 8K.
tool_use — whether the call invokes the platform's tool-use API (HTTP requester, file reader, OAST canary). Critical for exploitation; useless for narratives.
latency_budget_ms — soft target for round-trip time. Live activity-feed updates declare 800ms; offline batch jobs declare 30s.
cost_class — cheap | normal | premium. The tenant's plan tier sets a hard ceiling.

The router has a static cost-vs-capability scoreboard for each tier. Given the profile, it returns the cheapest tier whose capabilities meet or exceed every minimum, subject to the cost ceiling. So phase=narrative, min_reasoning=low, min_context=8K resolves to Haiku, while phase=codebase, min_context=400K, min_reasoning=high resolves to Opus 1M.

Tenant overrides take precedence. A tenant on the BYOK plan can force every phase=classify call to DeepSeek to keep cost predictable; a regulated-industry tenant can pin everything to Anthropic so audit trails sit with one provider.

📞

Resilience

Rate-limit aware fallback

Frontier models have rate limits. Anthropic's subscription plans publish a 5-hour rolling token bucket; the API plans have per-organisation per-minute caps. A single big scan against a complex SaaS can fire thousands of LLM calls and trip those caps mid-run.

When the router gets a 429 from the primary tier, it doesn't just retry — it falls back along a pre-declared chain. Sonnet → Haiku → DeepSeek is the default for hypothesis-class calls; Haiku → DeepSeek for narratives. Each fallback step gets a confidence-scaling note attached to the resulting outputs (a Haiku-produced hypothesis isn't presented as a Sonnet-produced one), so downstream filtering can apply different thresholds.

If the entire fallback chain trips, the scan pauses with a clear message in the live feed ("LLM rate-limit reached, waiting for token bucket refill") instead of failing silently. The operator can stop the scan and resume later, switch providers, or wait it out.

💰

Cost

What this saves you

Pentestas's internal benchmark on a medium-complexity SaaS (37 endpoints, 4 user roles, 2,400-page crawl) shows the multi-LLM router cuts total LLM cost by approximately 62% versus a Sonnet-everywhere baseline, while improving the white-box-mode quality scores by ~14% (because the router can afford to use Opus 1M for codebase reasoning rather than chunking it into Sonnet-sized windows).

In dollar terms: a typical authenticated medium-complexity scan on the BYOK plan finishes for $4–$8 of LLM spend rather than $12–$22. Customers running Pentestas on bring-your-own-Anthropic-key configurations report single-digit-dollar AI bills per scan at this point.

🚀

Pick Your Stack

Configuring it

The new-scan form has a single scan_mode = hybrid option that turns on multi-LLM routing. Per-scan AI overrides let you pin a single provider (ai_provider_override = anthropic | deepseek) or a specific model (ai_model_override = claude-sonnet-4-6) when you're testing how a particular brain handles a particular target. Tenants on BYOK plans bring their own keys; the keys are encrypted at rest with the per-tenant Fernet key and never traverse the Celery broker in plaintext.

Run a multi-LLM AI pentest

Free tier includes 10 scans/month on a verified domain. Bring your Anthropic + DeepSeek keys and pay the LLM provider directly.

Try it

Why this matters when buying pentesting-as-a-service

Pentestas is a pentesting-as-a-service offering — an AI penetration testing system that scans web apps, APIs, mobile binaries, cloud accounts, and internal networks under one platform. We default to penetration testing with Claude for triage and exploit-chain narration, and switch to penetration testing with DeepSeek for cost-sensitive bulk passes; both modes go through the same accuracy gate, the same destructive-payload guard, and the same reporting pipeline so a B2B SaaS pentest you run today and one you run six months from now produce comparable, auditable results.

If you've previously bought one-off engagements and you're comparing them against penetration testing with AI, the trade-offs in this post are the ones to read against your last consulting report.

Related reading

Run it on your stack: Penetration Testing →

Why We Run Both Claude and DeepSeek — and How the Router Picks Which Brain Solves What

One model is wrong for at least one of your phases

What each tier is for

Claude Haiku 4.5 — the narrative tier

Claude Sonnet 4.6 — the workhorse

Claude Opus 4.7 (1M-token context) — the codebase tier

DeepSeek — the cost-controlled batch lane

How the router picks per phase

Rate-limit aware fallback

What this saves you

Configuring it

Why this matters when buying pentesting-as-a-service

Alexander Sverdlov

Why We Run Both Claude and DeepSeek &mdash; and How the Router Picks Which Brain Solves What

One model is wrong for at least one of your phases

What each tier is for

Claude Haiku 4.5 — the narrative tier

Claude Sonnet 4.6 — the workhorse

Claude Opus 4.7 (1M-token context) — the codebase tier

DeepSeek — the cost-controlled batch lane

How the router picks per phase

Rate-limit aware fallback

What this saves you

Configuring it

Why this matters when buying pentesting-as-a-service

Alexander Sverdlov

Why We Run Both Claude and DeepSeek — and How the Router Picks Which Brain Solves What