Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Reverse Engineering — Transport & Anti-Bot

How to get a response from a server that doesn’t want to talk to you.

This is Layer 1 of the reverse-engineering docs:

  • Layer 1: Transport (this file) — TLS fingerprinting, headers, WAF bypass, headless stealth
  • Layer 2: Discovery2-discovery — finding structured data in pages and bundles
  • Layer 3: Auth & Runtime3-auth — credentials, sessions, rotating config
  • Layer 4: Content4-content — extracting data from HTML when there is no API
  • Layer 5: Social Networks5-social — modeling people, relationships, and social graphs
  • Layer 6: Desktop Apps6-desktop-apps — macOS, Electron, local state, unofficial APIs
  • Layer 7: MCP Servers7-mcp — discovering, probing, and evaluating remote/stdio MCPs

HTTP Client — agentos.http Routes Through the Engine

The short answer

from agentos import http

# Default — works for most JSON APIs
resp = http.get(url, **http.headers(accept="json"))

# Behind CloudFront/Cloudflare — WAF headers + HTTP/2
resp = http.get(url, **http.headers(waf="cf", accept="json"))

# Full page navigation (Amazon, Goodreads)
with http.client(cookies=cookie_header) as c:
    resp = c.get(url, **http.headers(waf="cf", mode="navigate", accept="html"))

All HTTP goes through the Rust engine via agentos.http. The engine handles transport mechanics (HTTP/2, cookie jars, decompression, timeouts, logging). Headers are built in Python via http.headers() — the engine sets zero default headers.

Default rule: ALWAYS use http.headers(). Never construct headers dicts manually. We are acting as a real browser (Brave/Chrome). There is no reason to NOT send proper browser headers. Without http.headers(), you get no User-Agent, no sec-ch-, no Sec-Fetch- — and some APIs silently reject you with 500 or 403. Pass service-specific headers (CSRF tokens, session IDs) via the extra= parameter.

# WRONG — no browser headers, will fail on strict endpoints
http.post(url, cookies=cookies, headers={"x-csrf-token": "x"}, json=body)

# RIGHT — browser-grade headers + service-specific extras
http.post(url, cookies=cookies, json=body,
          **http.headers(waf="cf", accept="json", extra={"x-csrf-token": "x"}))

TLS fingerprinting — why the engine uses wreq with BoringSSL

AWS WAF, Cloudflare, and other CDNs compute a JA3/JA4 fingerprint from every TLS ClientHello and compare it to the claimed User-Agent. If the UA says “Chrome 131” but the TLS fingerprint says “rustls” or “urllib3,” the request gets flagged as a bot. Sensitive pages (Amazon orders, Chase banking, account settings) have higher anomaly thresholds than product pages — so the homepage works but the orders page redirects to login.

The engine uses wreq (a reqwest fork) backed by BoringSSL — the same TLS library Chrome uses. With Emulation::Chrome131, every request produces an authentic Chrome JA4 fingerprint (t13d1516h2_8daaf6152771), including correct HTTP/2 SETTINGS frames, pseudo-header order, and WINDOW_UPDATE values. This is not string-matching — wreq constructs the same ClientHello Chrome would, using the same library, and the fingerprint falls out naturally.

Verified (2026-04-01): Same cookies from Brave Browser. reqwest (rustls) → Amazon redirects to signin. wreq (BoringSSL, Chrome 131) → Amazon returns 7 orders. The only difference was the TLS fingerprint.

Python clients (requests, httpx) have similar issues — requests/urllib3 has a blocklisted JA3 hash (8d9f7747675e24454cd9b7ed35c58707). Skills don’t hit this because all HTTP goes through the engine’s wreq client, not Python libraries directly.

When to use http2=False (Vercel)

Vercel Security Checkpoint blocks HTTP/2 clients outright — every request returns 429 with a JS challenge page, regardless of cookies or headers. But HTTP/1.1 passes cleanly.

In http.headers(), this is handled by the waf= knob:

# waf="cf" → http2=True (CloudFront/Cloudflare need HTTP/2)
resp = http.get(url, **http.headers(waf="cf", accept="json"))

# waf="vercel" → http2=False (Vercel blocks HTTP/2)
resp = http.get(url, **http.headers(waf="vercel", accept="json"))

The WAF template automatically sets the right http2 value. No need to remember which WAF needs what.

Not every Vercel-hosted endpoint enables the checkpoint. During Exa testing, auth.exa.ai (Vercel, no checkpoint) accepted h2; dashboard.exa.ai (Vercel, checkpoint enabled) rejected it. The checkpoint is a per-project Vercel Firewall setting — you have to test each subdomain.

Tested against dashboard.exa.ai (Vercel + Cloudflare):

http2=Truehttp2=False
session + cf_clearance429200
session only429200
no cookies at all429200 (empty session)

Cookies and headers are irrelevant — the checkpoint triggers purely on the HTTP/2 TLS fingerprint.

Rule of thumb: use waf="cf" for CloudFront/Cloudflare, waf="vercel" for Vercel. If you get 429 from Vercel, it’s the HTTP/2 fingerprint. If you get 403 from CloudFront, you need HTTP/2 + client hints.

Diagnostic protocol: isolating the variable

When a request fails, don’t guess — isolate. Test each transport variable independently to find the one that matters:

Step 1: Try httpx http2=True (default)
  → Works?     Done.
  → 429/403?   Continue.

Step 2: Try httpx http2=False
  → Works?     Vercel Security Checkpoint. Use http2=False, done.
  → Still 403? Continue.

Step 3: Try with full browser-like headers (Sec-Fetch-*, Sec-CH-UA, etc.)
  → Works?     WAF header check. Add headers, done.
  → Still 403? Continue.

Step 4: Try with valid session cookies
  → Works?     Auth required. Handle login first.
  → Still 403? It's TLS fingerprint-level.

Step 5: Use curl_cffi with Chrome impersonation
  → Works?     Strict JA3/JA4 enforcement. Use curl_cffi.
  → Still 403? Something non-standard (CAPTCHA, IP block).

The key insight from the Exa reverse engineering session: test one variable at a time. During Exa testing, we created a matrix of http2=True/False x cookies/no-cookies x headers/no-headers and discovered that ONLY the h2 setting mattered. Cookies and headers were completely irrelevant to the Vercel checkpoint. This prevented unnecessary complexity in the skill code.

You don’t need curl_cffi or httpx

The engine’s wreq client already emits Chrome’s exact TLS cipher suites, GREASE values, extension ordering, ALPN, and HTTP/2 SETTINGS frames. Skills should never use httpx, requests, or curl_cffi directly — agentos.http handles all of this automatically.

All I/O through SDK modules

Skills must use agentos.http for all HTTP — never urllib, requests, httpx, or subprocess directly. All I/O goes through SDK modules (http.get/post, shell.run, sql.query) so the engine can log, gate, and manage requests.


Browser-Like Headers — http.headers() Knobs

Headers are built in Python via http.headers(), which composes four independent concerns:

from agentos import http

# Four knobs, ordered by network layer:
conf = http.headers(
    waf="cf",            # WAF vendor — "cf", "vercel", or None
    ua="chrome-desktop", # User-Agent — preset name or raw string
    mode="fetch",        # Request type — "fetch" (XHR) or "navigate" (page load)
    accept="json",       # Content — "json", "html", or "any"
    extra={"X-Custom": "value"},  # Merge last, overrides anything
)
# Returns {"headers": {...}, "http2": True/False}
# Spread into http.get/post/client with **
resp = http.get(url, **conf)

What each knob controls

KnobWhat it setsValues
wafClient hints (Sec-CH-UA, etc.) + http2"cf" (CloudFront/Cloudflare, http2=True), "vercel" (http2=False), None
uaUser-Agent header"chrome-desktop", "chrome-mobile", "safari-desktop", or raw string
modeSec-Fetch-* headers (only when waf is set)"fetch" (XHR: dest=empty, mode=cors), "navigate" (page: dest=document, mode=navigate + device hints)
acceptAccept header"json", "html", "any" (default: */*)
extraCustom headers merged lastAny dict — auth tokens, CSRF, Origin, Referer, etc.

Standard headers (always included)

Every http.headers() call sets User-Agent, Accept-Language, and Accept-Encoding. These are normal browser headers — not WAF-specific. Override via extra= if needed.

WAF headers — waf="cf" and mode="navigate"

When waf is set, http.headers() adds Sec-Fetch-* metadata. The mode knob controls what type of request you’re simulating:

mode="fetch" (default) — XHR/fetch() API call:

  • Sec-Fetch-Dest: empty, Sec-Fetch-Mode: cors, Sec-Fetch-Site: same-origin

mode="navigate" — Full page navigation (used by Amazon, Goodreads):

  • Sec-Fetch-Dest: document, Sec-Fetch-Mode: navigate, Sec-Fetch-User: ?1
  • Plus device hints: Device-Memory, Downlink, DPR, ECT, RTT, Viewport-Width
  • Plus Cache-Control: max-age=0, Upgrade-Insecure-Requests: 1

Amazon’s Lightsaber bot detection checks these device hints. Without them, auth pages redirect to login. The mode="navigate" knob handles all of this automatically.

Sec-Fetch-Site values

ScenarioValueHow to set
JS on app.example.com calling app.example.com/apisame-originDefault in mode="fetch"
Full page navigation (user typed URL)noneDefault in mode="navigate"
Cross-origin API callcross-siteextra={"Sec-Fetch-Site": "cross-site"}

Common patterns

from agentos import http

# JSON API, no WAF (Gmail, Linear, Todoist — 15 skills)
resp = http.get(url, **http.headers(accept="json", extra={"Authorization": f"Bearer {token}"}))

# HTML scraping behind CloudFront (Amazon, Goodreads)
with http.client(cookies=cookie_header) as c:
    resp = c.get(url, **http.headers(waf="cf", mode="navigate", accept="html"))

# JSON API behind Cloudflare (Claude.ai)
# Claude needs custom Sec-CH-UA (Brave v146) and http2=False
conf = http.headers(waf="cf", accept="json", extra=CLAUDE_HEADERS)
conf["http2"] = False  # override WAF default
with http.client(cookies=cookie_header, **conf) as c:
    resp = c.get(url)

# Vercel checkpoint bypass (Exa)
resp = http.get(url, **http.headers(waf="vercel", accept="json"))

# Full control — skip helpers entirely
resp = http.get(url, headers={"Accept": "text/csv", "X-Custom": "value"})

# Debug — print what you're sending
print(http.headers(waf="cf", mode="navigate", accept="html"))

Version drift

The Chrome version in Sec-CH-UA is pinned in sdk/agentos/http.py (_UA and _WAF dicts). If you start getting unexpected 403s months later, the pinned version may be too old. Update the version strings in the SDK to match the current stable Chrome release.

How to discover the right headers

Use the Playwright skill’s capture_network or the fetch interceptor to see exactly what headers a real browser sends on the same request. Compare with http.headers() output and add any missing ones via extra=.


Some sites inject JavaScript-driven features via cookies. When you’re scraping with HTTPX (no JS engine), these features produce unusable output. The fix: strip the trigger cookies so the server falls back to plain HTML.

Amazon’s Siege Encryption

Amazon uses a system called SiegeClientSideDecryption to encrypt page content client-side. When the csd-key cookie is present, Amazon sends encrypted HTML blobs instead of readable content. The browser decrypts them with JavaScript; HTTPX gets unreadable garbage.

Solution: strip the trigger cookies using skip_cookies= on http.client():

_SKIP_COOKIES = ["csd-key", "csm-hit", "aws-waf-token"]

with http.client(cookies=cookie_header, skip_cookies=_SKIP_COOKIES,
                 **http.headers(waf="cf", mode="navigate", accept="html")) as c:
    resp = c.get(url)

The engine filters these cookies out of the jar before sending. With csd-key stripped, Amazon serves plain, parseable HTML. The csm-hit and aws-waf-token cookies are also stripped — they’re telemetry/WAF cookies that can trigger additional client-side behavior.

Diagnosing encryption

If your HTML responses contain garbled content, long base64 strings, or empty containers where data should be, check for client-side decryption:

  1. Compare the page source in the browser (View Source, not DevTools Elements) with your HTTPX response
  2. Search for keywords like decrypt, Siege, clientSide in the page JS
  3. Try stripping cookies one at a time to find which one triggers encryption

Reference: skills/amazon/amazon.py SKIP_COOKIES.


Response Decompression — You Must Handle What You Advertise

When you send Accept-Encoding: gzip, deflate, br, zstd (as all browser-like profiles do), the server will compress its response. Your HTTP client must decompress it. If it doesn’t, you get raw binary garbage instead of HTML — and every parser returns zero results.

This is a silent failure. The HTTP status is 200, the headers look normal, and Content-Length is reasonable. But resp.text is garbled bytes. It looks like client-side encryption (see above), but the cause is much simpler: the response is compressed and you’re not decompressing it.

How agentos.http handles it

The Rust HTTP engine uses wreq with gzip, brotli, deflate, and zstd feature flags enabled. Decompression is automatic and transparent — resp["body"] is always plaintext.

Why this matters

Brotli (RFC 7932) is a compression algorithm designed by Google for the web. It compresses 20-26% better than gzip on HTML/CSS/JS. Every modern browser supports it, and servers aggressively use it for large pages. Amazon’s order history page, for example, returns ~168KB of brotli-compressed HTML. Without decompression, you get 168KB of binary noise and zero order cards.

The trap: small pages (homepages, API endpoints) may not be compressed or may use gzip which some clients handle by default. Large pages (order history, dashboards, search results) almost always use brotli. So your skill works on simple endpoints and silently fails on the important ones.

Diagnostic

If your response body contains non-UTF-8 bytes, starts with garbled characters, or contains no recognizable HTML despite a 200 status:

  1. Check the response Content-Encoding header — if it says br, gzip, or zstd, the body is compressed
  2. Verify your HTTP client has decompression enabled
  3. In agentOS: agentos.http handles this automatically. If you’re using raw urllib.request, it does NOT decompress brotli

Reference: Cargo.toml wreq features — gzip, brotli, deflate, zstd.


Session Warming

Some services track request patterns and flag direct deep-links from an unknown session as bot traffic. The fix: warm the session by visiting the homepage first, then navigate to the target page.

def _warm_session(client) -> None:
    """Visit homepage first to provision session cookies."""
    client.get("https://www.amazon.com/", headers={"Sec-Fetch-Site": "none"})

This establishes the session context (cookies, CSRF tokens, tracking state) before hitting authenticated pages. Without it, Amazon redirects order history and account pages to the login page even with valid session cookies.

When to warm:

  • Before any authenticated page fetch (order history, account settings)
  • When the first request to a deep URL returns a login redirect despite valid cookies
  • When you see WAF-level blocks only on direct navigation

When warming isn’t needed:

  • API endpoints (JSON responses) — they don’t use page-level session tracking
  • Public pages without authentication
  • Sites where direct deep-links work fine (test first)

Reference: skills/amazon/amazon.py _warm_session().


Headless Browser Stealth

Default Playwright/Chromium gets blocked by many sites (Goodreads returns 403, Cloudflare serves challenge pages). The fix is a set of anti-fingerprinting settings.

Minimum stealth settings

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        args=["--disable-blink-features=AutomationControlled"],
    )
    context = browser.new_context(
        user_agent=(
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/131.0.0.0 Safari/537.36"
        ),
        viewport={"width": 1440, "height": 900},
        locale="en-US",
        timezone_id="America/New_York",
    )
    page = context.new_page()
    page.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', { get: () => false });
    """)

What each setting does

SettingWhy
--disable-blink-features=AutomationControlledRemoves the navigator.webdriver=true flag that Chromium sets in automation mode
Custom user_agentDefault headless UA contains HeadlessChrome which is trivially blocked
viewportDefault headless viewport is 800x600, which no real user has
locale / timezone_idSome bot detectors check for mismatches between locale and timezone
navigator.webdriver = falseBelt-and-suspenders override in case the flag leaks through other paths

Real example: Goodreads

Default Playwright against goodreads.com/book/show/4934 returns HTTP 403 with one network request. With stealth settings, the page loads fully with 1400+ requests including 4 AppSync GraphQL calls. See skills/goodreads/public_graph.py discover_via_browser() for the implementation.


CDP Detection Signals — Why Playwright Gets Caught

Even with the stealth settings above, Playwright is still detectable at the Chrome DevTools Protocol (CDP) layer. These signals are invisible in DevTools and unrelated to headers, cookies, or user-agent strings. They matter most during reverse engineering sessions — if a site behaves differently under Playwright than in your real browser, CDP leaks are likely the cause.

Runtime.Enable leak

Playwright calls Runtime.Enable on every CDP session to receive execution context events. Anti-bot systems (Cloudflare, DataDome) detect this with a few lines of in-page JavaScript that only fire when Runtime.Enable is active. This is the single most devastating detection vector — it works regardless of all other stealth measures.

sourceURL leak

Playwright appends //# sourceURL=__playwright_evaluation_script__ to every page.evaluate() call. Any page script can inspect error stack traces and see these telltale URLs. This means your __NEXT_DATA__ extraction, DOM inspection, or any other evaluate() call leaves a fingerprint.

Utility world name

Playwright creates an isolated world named __playwright_utility_world__ that is visible in Chrome’s internal state and potentially to detection scripts.

What to do about it

These leaks are baked into Playwright’s source code — no launch flag or init script fixes them. Two options:

  1. For most RE work: The stealth settings above (flags, UA, viewport, webdriver override) are enough. Most sites don’t check CDP-level signals. If a site seems to behave differently under Playwright, check for these leaks before adding complexity.

  2. For strict sites (Cloudflare Bot Management, DataDome): Use rebrowser-playwright as a drop-in replacement. It patches Playwright’s source to eliminate Runtime.Enable calls, randomize sourceURLs, and rename the utility world. Install: npm install rebrowser-playwright and change your import.

This doesn’t affect production skills. Our architecture uses Playwright only for discovery — production calls go through surf() / HTTPX, which has zero CDP surface. The CDP leaks only matter during reverse engineering sessions where you’re using the browser to investigate a protected site.


When a cookie provider (brave-browser, firefox) extracts cookies for a domain like .uber.com, it returns cookies from ALL subdomains: .uber.com, .riders.uber.com, .auth.uber.com, .www.uber.com. If the skill’s base_url is https://riders.uber.com, sending cookies from .auth.uber.com is wrong — the server picks the wrong csid and redirects to login.

The engine implements RFC 6265 domain matching: when resolving cookies, it extracts the host from connection.base_url and passes it to the cookie provider. The provider filters cookies so only matching ones are returned:

host = "riders.uber.com"

.uber.com          → riders.uber.com ends with .uber.com   → KEEP (parent domain)
.riders.uber.com   → riders.uber.com matches exactly        → KEEP (exact match)
.auth.uber.com     → riders.uber.com doesn't match          → DROP (sibling)
.www.uber.com      → riders.uber.com doesn't match          → DROP (sibling)

This is automatic — skills don’t need to do anything. The filtering happens in the cookie provider (brave-browser/get-cookie.py, firefox/firefox.py) based on the host parameter the engine passes from connection.base_url.

When it matters: Only when a domain has cookies on multiple subdomains with the same cookie name. Most skills are unaffected — Amazon, Goodreads, Chase all have cookies on a single domain. Uber is the first case where it matters.

The old workaround: Before RFC 6265 filtering, the Uber skill had a _filter_cookies() function that deduplicated by cookie name (last occurrence wins). This has been removed — the provider handles it correctly now.


Skills can resolve cookies for any domain without knowing which browser provides them:

from agentos import http

# Resolve cookies — provider discovery is automatic
cookie_header = http.cookies(domain=".uber.com")
resp = http.post(url, cookies=cookie_header, **http.headers(accept="json"))

# Specific account (multiple people logged in on different browsers)
cookie_header = http.cookies(domain=".uber.com", account="uber@contini.co")

http.cookies() uses the same auth resolver as connection-based auth: it tries all installed cookie providers (brave-browser, firefox, etc.), picks the best one, and returns a cookie header string. No hardcoded provider names in skill code.

Playwright integration

capture_network accepts a cookie_domain param that resolves cookies automatically:

# One step — no manual cookie extraction needed
run(skill="playwright", tool="capture_network", params={
    "url": "https://riders.uber.com/trips",
    "cookie_domain": ".uber.com",
    "pattern": "**graphql**",
})

This replaces the old 3-step flow (extract from provider → reformat → inject).


Debugging 400/403 Errors

SymptomLikely causeFix
403 from CloudFront with a bot-detection HTML pageJA3/JA4 fingerprint blockedShouldn’t happen with wreq — if it does, check that the engine is running the wreq build
400 from CloudFront, body is "Forbidden" or short stringWAF rule triggered (header order, ALPN)Use waf="cf" + check mode=
400, body looks like "404"API Gateway can’t route the request — usually a missing tenant/auth headerFind and add the missing header via extra=
403 for a same-origin API (e.g. claude.ai)Missing Sec-Fetch-* headersUse waf="cf" — sets Sec-Fetch-* automatically
403 from headless PlaywrightDefault Chromium automation fingerprintAdd stealth settings (see Headless Browser Stealth above)
429 with “Vercel Security Checkpoint” HTMLVercel blocks HTTP/2 fingerprintUse waf="vercel" (sets http2=False)
Works in browser, fails in Python regardlessCheck for authorization that’s not a JWTLook for short Authorization values in the bundle (namespace, env name, etc.)

Using Playwright to capture exact headers

When you’re stuck, use Playwright to intercept the actual XHR and log all headers (including those added by axios interceptors that aren’t visible in DevTools):

from playwright.sync_api import sync_playwright

def capture_request_headers(url_pattern: str, trigger_url: str) -> dict:
    """Navigate to trigger_url and capture headers from the first request matching url_pattern."""
    captured = {}
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.on("request", lambda req: captured.update(req.headers)
                if url_pattern in req.url else None)
        page.goto(trigger_url)
        page.wait_for_timeout(3000)
        browser.close()
    return captured

Skill File Layout

skills/<skill-name>/
  readme.md            <- agentOS skill descriptor (operations, adapters, etc.)
  requirements.md      <- reverse engineering notes, API docs, findings log
  <skill>.py           <- Python module with all API functions
  icon.svg             <- skill icon

Keep requirements.md as a living document — update it every time you discover a new endpoint, figure out a new header, or resolve a mystery.


Real-World Examples in This Repo

SkillServiceTransport configKey learnings
skills/amazon/Amazon (Lightsaber)waf="cf", mode="navigate", accept="html"Full device hints required, skip_cookies= for Siege encryption, session warming. Chrome TLS fingerprint (wreq) required for orders page — Amazon’s WAF uses JA4 + OpenID max_auth_age=0 per-feature auth gates.
skills/austin-boulder-project/Tilefive / approach.appaccept="json" + auth headerCloudFront, Authorization = namespace string
skills/claude/claude.ai (Cloudflare)waf="cf", accept="json", http2=False overrideCustom Sec-CH-UA (Brave v146), Cloudflare bypass needs Sec-Fetch-*
skills/exa/dashboard.exa.ai (Vercel)waf="vercel", accept="json"Vercel checkpoint is purely TLS — cookies and headers irrelevant
skills/goodreads/Goodreads (CloudFront)waf="cf", accept="html"Public GraphQL via CloudFront, headless Playwright needs stealth settings
skills/uber/Uber (CloudFront)accept="json" + custom headersRFC 6265 cookie domain filtering — first skill where sibling subdomain cookies caused bugs