Behind Cloudflare? Pentestas Finds the Origin and Scans It

2026-05-02 · Pentestas Features

Cloudflare in front of your customer's app shouldn't mean you can't pentest it. Here is how the "Behind Cloudflare" mode discovers the real origin and pins every request to it.

The edge blocks. The origin doesn't.

Cloudflare sits in front of roughly 20% of the public web. For an authorised penetration test, that's a problem: the edge layer rewrites paths, blocks payloads, throws challenge pages at non-browser User-Agents, rate-limits aggressively, and entirely hides the response shape the scanner needs to detect vulnerabilities. A scan that runs through the edge isn't a scan of your customer's application — it's a scan of Cloudflare's defence in depth.

Pentestas ships a one-checkbox Behind Cloudflare mode that solves this with the same play any working pentester would use: discover the real origin IP, connect to it directly, and keep the customer's hostname in the Host header and TLS SNI so the cert validates and the right virtual host responds.

This post walks through what's behind that checkbox: the four origin-discovery channels, how candidates are verified, how we pin the HTTP client without breaking SNI, and what happens when no origin can be found.

⚠️

The Problem

Why scanning through the edge is a waste of money

Run a typical web-app scanner against a Cloudflare-fronted target and three things happen, all bad:

Payloads get filtered before they reach the app. Cloudflare's managed ruleset blocks the payload classes scanners care most about — SQLi, XSS, SSRF, path traversal — on the way in. The scan reports clean. The application is not clean. The result is a misleading PDF that auditors love and attackers ignore.

Bot management mangles the response shape. Even on payloads that get through, the edge can return a JS-challenge page (5-second I'm-Under-Attack mode), a CAPTCHA, or a tarpit response with attractive nuisance status codes. Your detector measures "status 503 + body contains cf-mitigated" and concludes nothing.

Rate-limit triggers cap your scan. A real authenticated scan fires hundreds of requests per minute against a target. Cloudflare's default rate-limiter and bot-score model cuts you off long before you finish the OWASP Top 10. Half-finished scans produce half-true reports.

The right move — the one a human pentester does in their first hour — is to bypass the edge entirely by talking to the origin server directly. The customer authorised testing of their application. They didn't authorise a load test against Cloudflare's WAF.

🔍

The Breakdown

Four channels we use to find the real origin

There is no single "origin lookup" API. We probe four independent channels in parallel, each cheap to query and each prone to leaking the same secret. When two channels agree on a candidate, our confidence goes up; when one channel produces a verifiable hit, we're done. None of the channels needs the customer to install anything — this is all public-data origin discovery, the same surface attackers see.

1. Direct AAAA records on the apex

Cloudflare proxies IPv4 by default. IPv6 is opt-in — and on legacy zones, frequently isn't enabled at all. The result is that A app.example.com resolves to 104.16.5.5 (Cloudflare edge), but AAAA app.example.com resolves directly to the origin's IPv6 address. We always check the AAAA record first — it's the cheapest and highest-yield channel.

2. Origin-leak subdomains

Operators put the customer-facing host behind Cloudflare and forget about the rest. mail.example.com, smtp.example.com, cpanel.example.com, ftp.example.com, webmail.example.com, staging., dev., admin., portal., backup., old., internal., and 20 other "originally we just typed it into DNS and never came back to it" subdomains routinely point straight at the real box. Pentestas resolves all of them, throws away anything in Cloudflare's published edge ranges, and keeps the rest as candidates.

3. SPF / TXT records on the apex

Email infrastructure is the great betrayer of origin secrecy. v=spf1 ip4:198.51.100.42 ip4:203.0.113.0/24 -all publishes every IP that's allowed to send mail on behalf of the domain — and on most small-to-mid-sized deployments, the mail server runs on the same VPS as the web origin, or at least in the same /24. We extract every ip4: and ip6: directive, drop wide ranges, keep the single-host /32 + /128 entries, and verify each.

4. Certificate Transparency (crt.sh)

Every TLS cert ever issued for example.com ends up in the public Certificate Transparency logs. The Subject Alternative Name list often reveals operator-stamped origin hostnames that aren't on the public DNS — origin.internal.example.com, prod-direct-1.example.com, edge-bypass.example.com. We pull the JSON feed from crt.sh, dedupe SANs that are subdomains of the customer's apex, resolve them through normal DNS, and feed any non-Cloudflare answer into the candidate list.

⚙️

How It Works

Verification — not every candidate is the right one

Discovery produces a list. Verification picks the winner. We don't trust DNS membership alone for two reasons: shared-IP hosting (the IP serves dozens of vhosts and we'd land on the wrong one), and IPs that used to be the origin but no longer route to the customer's app.

For each candidate IP we send a single HTTPS request with the customer's hostname forced into both the Host: header and the TLS SNI extension, then compare the response to the CDN-fronted baseline:

Status code parity — an origin that's serving the same app responds with a status in the same family (both 2xx, both 3xx, both 401, etc.). If the CDN gave a 200 and the candidate gives a 502, that's not the origin.
Body shape similarity — we hash the response into a token shingle (4-32 character lowercase tokens, HTML stripped) and compute a Jaccard similarity. The customer's real app shares 60-90% of those tokens with itself fronted through Cloudflare; an unrelated app on the same IP shares less than 20%. We require ≥0.4 for a passing match and short-circuit on ≥0.7.
Cert SAN match — bonus signal: the origin's TLS cert almost always lists the customer's hostname in its SAN. We don't require it (cloud-managed certs sometimes hide this), but it lifts confidence when present.

When at least one candidate passes, the winner becomes the pinned origin for the rest of the scan.

🔗

The Breakdown

Pinning the HTTP client without breaking SNI

Pinning the connection to a specific IP while keeping the hostname intact is the trick most casual scripts get wrong. The naive approach — rewrite the URL to https://204.13.0.42/ — fails three ways:

SNI breaks. The TLS handshake sends the IP as the SNI hostname, and the origin's cert is for app.example.com, not 204.13.0.42. Most servers respond with the default vhost, which is rarely the app.
Host header breaks. Even if the cert validates, the web server's vhost dispatch keys off the Host header, and rewriting the URL changes both. You land on the wrong vhost.
Cookies break. Browser cookie scoping is per-domain. Pinning to an IP makes cookies stop replaying, which kills authenticated scans.

Pentestas pins via the underlying httpcore transport. The TCP connect goes to the discovered origin IP, but the request URL is left as https://app.example.com/..., the Host header is forced to app.example.com, and a per-request sni_hostname extension keeps the TLS SNI set to the customer's hostname. The result: the origin's cert validates against the right name, the right vhost responds, cookies scope correctly, and the entire rest of the scan engine is unaware that anything special is happening.

HttpClient(
    timeout=30, delay=0.4,
    pin_origin_ip="204.13.0.42",
    pin_origin_host="app.example.com",
)
# Every request: TCP --> 204.13.0.42, Host: app.example.com, SNI: app.example.com

Cookie persistence, the user-supplied custom headers, the rate-limiter, the OAST canary, the JWT replay engine — everything continues to work because they all operate above the transport layer.

🛡️

Fallback Mode

When no origin is found: edge-evasion mode

Some targets are well-locked-down. The customer has fully Argo-tunnelled the origin, every subdomain is proxied, the SPF record is on a third-party mail provider, and CT logs only ever show Cloudflare-issued certs. In that case there's no origin to find — and we have to scan through the edge anyway.

When discovery returns empty, Pentestas drops the scan into edge-evasion mode. Three things change:

Realistic-browser fingerprint. The default User-Agent is replaced with a current Chrome on macOS, Sec-Fetch-* headers and Sec-Ch-Ua* client hints are stamped onto every request, Accept-Language is set to en-US,en;q=0.9, and Accept-Encoding includes zstd. Cloudflare's Bot Management scores requests on the completeness of this set; sending all of them moves the bot-score from "definitely a script" toward "possibly a real human in Chrome".
Slower request rate. The default 30+ req/s scan tempo gets capped to a level that doesn't trip the edge rate-limiter mid-OWASP-pass. The scan takes longer; it also finishes.
JA3 / JA4 TLS impersonation (where available). When the optional curl_cffi dependency is installed, every TLS handshake matches the cipher suite ordering, extension list, and ALPN preferences of a real Chrome client. Bot-management heuristics that key on TLS fingerprints fall through.

Edge-evasion mode is a fallback, not a goal. The detection signal you get from a true origin scan is always cleaner. We log clearly which mode the scan ran in so the operator knows whether to push the customer for an origin allowlist before the next engagement.

🔒

Authorisation

A note on what this is and isn't

Origin discovery uses public data. crt.sh is a public log. DNS records are public. SPF records are public by design. None of this requires breaking anything. Every technique here is documented in HackTricks, OWASP, and the Cloudflare community blog — the customer's adversaries already know how to do it.

What we don't do: connect to origin IPs that aren't authorised. Pentestas's tenant-isolation model only allows scans against domains that have been explicitly verified by the tenant; the discovered origin IP must be reachable via the customer's hostname, and the response must shape-match the customer's app. We treat origin discovery the same way we treat any other scan input — gated by the per-tenant allowlist and bounded by the engagement scope.

If you're a Cloudflare customer who would rather your origin not be discoverable: enable Argo Tunnel for the public hostname, lock all non-customer-facing subdomains behind their own zone, route mail through a third-party provider with separate IPs, and rotate the origin IP on a schedule. Pentestas finding the origin is the same play your real adversaries will run; the right defence is to make discovery genuinely hard, not to hope nobody tries.

🚀

Get Started

Try it

Behind-Cloudflare mode is a one-click checkbox on the new-scan form. When you tick it, the scan worker runs a detection probe (cf-ray, Server header, edge-IP membership) before kicking off origin discovery. If your target isn't actually fronted by Cloudflare, the mode is a no-op and the scan runs normally; if it is, you get a clearly logged origin or a clearly logged fallback.

Run an authorised pentest behind Cloudflare

Free tier includes 10 scans/month on a verified domain. No credit card required.

Start scanning

Where Pentestas applies this in the engagement

The pattern above is part of the day-to-day machinery of Pentestas's pentesting-as-a-service workflow. As an AI penetration testing system, the platform feeds every detected primitive through verification, chain orchestration, and evidence-graph weighting before the result lands in the report — the same flow whether the engagement is a quick B2B SaaS pentest before a Series A diligence call, a quarterly compliance run, or a continuous monitoring subscription. Our penetration testing with Claude path powers the analyst-grade narrative; penetration testing with DeepSeek powers the broad-spectrum coverage. Customers pick the routing per scan or per environment.

Teams looking at penetration testing with AI typically come to Pentestas after a manual engagement caught five issues and they want continuous coverage for the next four hundred regressions; the platform exists for exactly that gap.

Related reading

Run it on your stack: Penetration Testing →

Behind Cloudflare? How Pentestas Discovers the Real Origin and Scans It Anyway