Reverse Engineering — Discovery & Data Extraction
Once you can talk to the server (see 1-transport), how do you find and extract structured data?
This is Layer 2 of the reverse-engineering docs:
- Layer 1: Transport — 1-transport
- Layer 2: Discovery (this file) — finding structured data in pages and bundles
- Layer 3: Auth & Runtime — 3-auth
- Layer 4: Content — 4-content — HTML scraping when there is no API
- Layer 5: Social Networks — 5-social — modeling people, relationships, and social graphs
- Layer 6: Desktop Apps — 6-desktop-apps — macOS, Electron, local state, unofficial APIs
Tool: browse capture (bin/browse-capture.py) is the primary discovery tool. It connects to your real browser (Brave/Chrome) via CDP and captures all network traffic with full headers and response bodies. For DOM inspection, use the browser’s own DevTools. See the overview for the full toolkit.
Why not Playwright? Playwright’s bundled Chromium has a detectable TLS fingerprint. Sites like Amazon and Cloudflare-protected services reject it. CDP to a real browser produces authentic fingerprints and uses existing sessions. See Transport.
Next.js + Apollo Cache Extraction
Many modern sites (Goodreads, Airbnb, etc.) use Next.js with Apollo Client. These pages ship a full serialized Apollo cache in the HTML — structured entity data that you can parse without scraping visible HTML.
Where to find it
<script id="__NEXT_DATA__" type="application/json">{ ... }</script>
Inside that JSON:
__NEXT_DATA__
.props.pageProps
.props.pageProps.apolloState <-- the gold
.props.pageProps.apolloState.ROOT_QUERY
How Apollo normalized cache works
Apollo stores GraphQL results as a flat dictionary keyed by entity type and ID.
Related entities are stored as {"__ref": "Book:kca://book/..."} pointers.
import json, re
def extract_next_data(html: str) -> dict:
match = re.search(
r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
html, re.S,
)
if not match:
raise RuntimeError("No __NEXT_DATA__ found")
return json.loads(match.group(1))
def deref(apollo: dict, value):
"""Resolve Apollo __ref pointers to their actual objects."""
if isinstance(value, dict) and "__ref" in value:
return apollo.get(value["__ref"])
return value
Extraction pattern
next_data = extract_next_data(html)
apollo = next_data["props"]["pageProps"]["apolloState"]
root_query = apollo["ROOT_QUERY"]
# Find the entity by its query key
book_ref = root_query['getBookByLegacyId({"legacyId":"4934"})']
book = apollo[book_ref["__ref"]]
# Dereference related entities
work = deref(apollo, book.get("work"))
primary_author = deref(apollo, book.get("primaryContributorEdge", {}).get("node"))
What you typically find in the Apollo cache
| Entity type | Common fields |
|---|---|
| Books | title, description, imageUrl, webUrl, legacyId, details (isbn, pages, publisher) |
| Contributors | name, legacyId, webUrl, profileImageUrl |
| Works | stats (averageRating, ratingsCount), details (originalTitle, publicationTime) |
| Social signals | shelf counts (CURRENTLY_READING, TO_READ) |
| Genres | name, webUrl |
| Series | title, webUrl |
The Apollo cache often contains more data than the visible page renders. Always
dump and inspect apolloState before assuming you need to make additional API calls.
Real example: Goodreads
See skills/goodreads/public_graph.py functions load_book_page() and
map_book_payload() for a complete implementation that extracts 25+ fields from
the Apollo cache without any GraphQL calls.
JS Bundle Scanning
SPAs embed everything in their JavaScript bundles — config values, API keys, custom endpoints, and auth flow logic. Scanning bundles is one of the highest- value reverse engineering techniques. It works without login, reveals hidden endpoints that network capture misses, and exposes the exact contracts the frontend uses.
Two levels of bundle scanning
Level 1: Config extraction — find API keys, endpoints, tenant IDs. Standard search for known patterns.
Level 2: Endpoint and flow discovery — find custom API endpoints that
aren’t in the standard framework (e.g. /api/verify-otp), understand what
parameters they accept, and how the frontend processes the response. This
is how you crack custom auth flows.
General pattern
import re, httpx
def scan_bundles(page_url: str, search_terms: list[str]) -> dict:
"""Fetch a page, extract all JS bundle URLs, scan each for search terms."""
with httpx.Client(http2=False, follow_redirects=True, timeout=30) as client:
html = client.get(page_url).text
# Extract all JS chunk URLs (Next.js / Turbopack pattern)
js_urls = list(set(re.findall(
r'["\'](/_next/static/[^"\' >]+\.js[^"\' >]*)', html
)))
results = {}
for url in js_urls:
js = client.get(f"{page_url.split('//')[0]}//{page_url.split('//')[1].split('/')[0]}{url}").text
for term in search_terms:
if term.lower() in js.lower():
# Extract context around the match
idx = js.lower().find(term.lower())
context = js[max(0, idx-100):idx+200]
results.setdefault(term, []).append({
"chunk": url[-40:],
"size": len(js),
"context": context,
})
return results
Config patterns to search for
| What | Search terms |
|---|---|
| API keys | apiKey, api_key, X-Api-Key, widgetsApiKey |
| GraphQL endpoints | appsync-api, graphql |
| Tenant / namespace | host.split, subdomain |
| Cognito credentials | userPoolId, userPoolClientId |
| Auth endpoints | AuthFlow, InitiateAuth, cognito-idp |
Custom endpoint patterns to search for
| What | Search terms |
|---|---|
| Custom auth flows | verify-otp, verify-code, verify-token, confirm-code |
| Hidden API routes | fetch(, /api/ |
| Token construction | callback/email, hashedOtp, rawOtp, token= |
| Form submission handlers | submit, handleSubmit, onSubmit |
How we cracked Exa’s custom OTP flow
Exa’s login page uses a custom 6-digit OTP system built on top of NextAuth.
The standard NextAuth callback failed with error=Verification. Scanning
the JS bundles revealed the actual flow:
# Search terms that found the hidden endpoint
results = scan_bundles("https://auth.exa.ai", ["verify-otp", "verify-code", "callback/email"])
In a 573KB chunk, this surfaced:
fetch("/api/verify-otp", {method: "POST", headers: {"Content-Type": "application/json"},
body: JSON.stringify({email: e.toLowerCase(), otp: r})})
// → response: {email, hashedOtp, rawOtp}
// → constructs: token = hashedOtp + ":" + rawOtp
// → redirects to: /api/auth/callback/email?token=...&email=...
This revealed the entire auth flow — custom endpoint, request/response shape, and token construction — all from static JS analysis.
Multi-environment configs
Many sites ship all environment configs in the same bundle. Goodreads ships four AppSync configurations with labeled environments:
{"graphql":{"apiKey":"da2-...","endpoint":"https://...appsync-api...amazonaws.com/graphql","region":"us-east-1"},"showAds":false,"shortName":"Dev"}
{"graphql":{"apiKey":"da2-...","endpoint":"https://...appsync-api...amazonaws.com/graphql","region":"us-east-1"},"showAds":false,"shortName":"Beta"}
{"graphql":{"apiKey":"da2-...","endpoint":"https://...appsync-api...amazonaws.com/graphql","region":"us-east-1"},"showAds":true,"shortName":"Preprod"}
{"graphql":{"apiKey":"da2-...","endpoint":"https://...appsync-api...amazonaws.com/graphql","region":"us-east-1"},"showAds":true,"shortName":"Prod"}
Pick the right one by looking for identifiers like shortName, showAds: true,
publishWebVitalMetrics: true, or simply taking the last entry (Prod is typically
last in webpack build output).
The “Authorization is the namespace” pattern
Some APIs use the Authorization header not for a JWT but for a tenant namespace
extracted from the subdomain at runtime:
Jl = () => host.split(".")[0] // -> "boulderingproject"
headers: { Authorization: Jl(), "X-Api-Key": widgetsApiKey }
If you see Authorization values that seem too short to be JWTs, look for the
function that generates them near the axios/fetch client factory in the bundle.
Real examples
- Goodreads:
skills/goodreads/public_graph.pydiscover_from_bundle()— extracts Prod AppSync config from_appchunk - Austin Boulder Project:
skills/austin-boulder-project/abp.py— API key and namespace from Tilefive bundle
Navigation API Interception
When JS bundle scanning reveals what endpoint gets called but not what happens with the result (e.g. a client-side token construction), you need to see the actual values the browser produces. The Navigation API interceptor is the key technique.
The problem
Client-side JS often does: fetch → process response → set window.location.href.
Once the navigation fires, the page is gone and you can’t inspect the URL. Network
capture only catches the fetch, not the outbound navigation. And the processing
logic is buried in minified closures you can’t easily call.
The solution
Modern Chrome exposes the Navigation API.
You can intercept navigation attempts, capture the destination URL, and prevent
the actual navigation — all with a single evaluate call:
evaluate { script: "navigation.addEventListener('navigate', (e) => { window.__intercepted_nav_url = e.destination.url; e.preventDefault(); }); 'interceptor installed'" }
Then trigger the action (click a button, submit a form), and read the captured URL:
click { selector: "button#submit" }
evaluate { script: "window.__intercepted_nav_url" }
The URL contains whatever the client-side JS constructed — tokens, hashes, callback parameters — fully assembled and ready to replay with HTTPX.
When to use this
| Situation | Technique |
|---|---|
Button click makes a fetch() call | Fetch interceptor (see 3-auth) |
| Button click causes a page navigation | Navigation API interceptor |
| Form does a native POST (page reloads) | Inspect the <form> action + inputs |
| JS constructs a URL and redirects | Navigation API interceptor |
Real example: Exa OTP verification
The Exa auth page’s “VERIFY CODE” button calls /api/verify-otp, gets back
{hashedOtp, rawOtp}, then does window.location.href = callback_url_with_token.
The Navigation API interceptor captured the full callback URL, revealing the
token format is {bcrypt_hash}:{raw_code}.
This technique turned a “Playwright required” flow into a fully HTTPX-replayable one. See NextAuth OTP flow.
Combining with fetch interception
For complete visibility, install both interceptors before triggering an action:
// Capture all fetch calls AND navigations
window.__cap = { fetches: [], navigations: [] };
// Fetch interceptor
const origFetch = window.fetch;
window.fetch = async (...args) => {
const r = await origFetch(...args);
const c = r.clone();
window.__cap.fetches.push({
url: typeof args[0] === 'string' ? args[0] : args[0]?.url,
status: r.status,
body: (await c.text()).substring(0, 3000),
});
return r;
};
// Navigation interceptor
navigation.addEventListener('navigate', (e) => {
window.__cap.navigations.push(e.destination.url);
e.preventDefault();
});
Read everything after: evaluate { script: "JSON.stringify(window.__cap)" }
Read the Source
When bundle scanning and interception give you the what but not the why, go read the library’s source code. This is especially valuable for well-known frameworks (NextAuth, Supabase, Clerk, Auth0) where the source is on GitHub.
Why this matters
Minified bundle code tells you what the client does. The library source tells you what the server expects. These are two halves of the same flow.
Example: NextAuth email callback
Bundle scanning revealed Exa calls /api/auth/callback/email?token=.... But
what does the server do with that token? Reading the
NextAuth callback source
revealed the critical line:
token: await createHash(`${paramToken}${secret}`)
The server SHA-256 hashes token + NEXTAUTH_SECRET and compares with the
database. This told us the token format must be stable and deterministic — it
can’t be a random value. Combined with the Navigation API interception that
showed token = hashedOtp:rawOtp, we had the complete picture.
When to read the source
| Signal | Action |
|---|---|
| Standard framework (NextAuth, Supabase, etc.) | Read the auth callback handler source |
Custom error messages (e.g. error=Verification) | Search the library source for that error string |
| Token/hash format is unclear | Read the token verification logic |
| Framework does something “impossible” | The source always reveals how |
Where to find it
NextAuth: github.com/nextauthjs/next-auth/tree/main/packages/core/src
Supabase: github.com/supabase/auth
Clerk: github.com/clerk/javascript
Auth0: github.com/auth0/nextjs-auth0
Search the repo for the endpoint path (e.g. callback/email) or error message
(e.g. Verification) to find the relevant handler quickly.
GraphQL Schema Discovery via JS Bundles
Production GraphQL endpoints almost never allow introspection queries. But the frontend JS bundles contain every query and mutation the app uses.
Technique: scan all JS chunks for operation names
import re
def discover_graphql_operations(html: str, base_url: str) -> set[str]:
"""Find all GraphQL operation names from the frontend JS bundles."""
chunks = re.findall(r'(/_next/static/chunks/[a-zA-Z0-9/_%-]+\.js)', html)
operations = set()
for chunk in chunks:
js = fetch(f"{base_url}{chunk}")
# Find query/mutation declarations
for m in re.finditer(r'(?:query|mutation)\s+([A-Za-z_]\w*)\s*[\(\{]', js):
operations.add(m.group(1))
return operations
What this finds
On Goodreads, scanning 18 JS chunks revealed 38 operations:
Queries (public reads): getReviews, getSimilarBooks, getSearchSuggestions,
getWorksByContributor, getWorksForSeries, getComments, getBookListsOfBook,
getSocialSignals, getWorkCommunityRatings, getWorkCommunitySignals, …
Queries (auth required): getUser, getViewer, getEditions,
getSocialReviews, getWorkSocialReviews, getWorkSocialShelvings, …
Mutations: RateBook, ShelveBook, UnshelveBook, TagBook, Like,
Unlike, CreateComment, DeleteComment
Extracting full query strings
Once you know the operation name, extract the full query with its variable shape:
def extract_query(js: str, operation_name: str) -> str | None:
idx = js.find(f"query {operation_name}")
if idx == -1:
return None
snippet = js[idx:idx + 3000]
depth = 0
for i, c in enumerate(snippet):
if c == "{": depth += 1
elif c == "}":
depth -= 1
if depth == 0:
return snippet[:i + 1].replace("\\n", "\n")
return None
This gives you copy-pasteable GraphQL documents you can replay directly via HTTP POST.
Real example: Goodreads
See skills/goodreads/public_graph.py for the full set of proven GraphQL queries
including getReviews, getSimilarBooks, getSearchSuggestions,
getWorksForSeries, and getWorksByContributor.
Public vs Auth Boundary Mapping
After discovering operations, you need to determine which ones work anonymously (with just the public API key) and which require user session auth.
Technique: probe each operation and classify the error
Send each discovered operation to the public endpoint and classify the response:
| Response | Meaning |
|---|---|
200 with data | Public, works anonymously |
200 with errors: ["Not Authorized to access X on type Y"] | Partially public — the operation works but specific fields are viewer-scoped. Remove the blocked field and retry. |
200 with errors: ["MappingTemplate" / VTL error] | Requires auth — the AppSync resolver needs session context to even start |
403 or 401 | Requires auth at the transport level |
AppSync VTL errors as a signal
AWS AppSync uses Velocity Template Language (VTL) resolvers. When a public request hits an auth-gated resolver, you get a distinctive error:
{
"errorType": "MappingTemplate",
"message": "Error invoking method 'get(java.lang.Integer)' in [Ljava.lang.String; at velocity[line 20, column 55]"
}
This means: “the resolver tried to read user context from the auth token and failed.” It reliably indicates the operation needs authentication.
Field-level authorization
GraphQL auth on AppSync is often field-level, not operation-level. A getReviews
query might work but including viewerHasLiked returns:
{ "message": "Not Authorized to access viewerHasLiked on type Review" }
The fix: remove the viewer-scoped field from your query. The rest works fine publicly.
Goodreads boundary scorecard
| Operation | Public? | Notes |
|---|---|---|
getSearchSuggestions | Yes | Book search by title/author |
getReviews | Yes | Except viewerHasLiked and viewerRelationshipStatus |
getSimilarBooks | Yes | |
getWorksForSeries | Yes | Series book listings |
getWorksByContributor | Yes | Needs internal contributor ID (not legacy author ID) |
getUser | No | VTL error — needs session |
getEditions | No | VTL error — needs session |
getViewer | No | Viewer-only by definition |
getWorkSocialShelvings | Partial | May need session for full data |
Heterogeneous Page Stacks
Large sites migrating to modern frontends have mixed page types. You need to identify which pages use which stack and adjust your extraction strategy.
How to identify the stack
| Signal | Stack |
|---|---|
<script id="__NEXT_DATA__"> in HTML | Next.js (server-rendered, may have Apollo cache) |
| GraphQL/AppSync XHR traffic after page load | Modern frontend with GraphQL backend |
No __NEXT_DATA__, classic <div> structure, <meta> tags | Legacy server-rendered HTML |
window.__INITIAL_STATE__ or similar | React SPA with custom state hydration |
Goodreads example
| Page type | Stack | Extraction strategy |
|---|---|---|
Book pages (/book/show/) | Next.js + Apollo + AppSync | __NEXT_DATA__ for main data, GraphQL for reviews/similar |
Author pages (/author/show/) | Legacy HTML | Regex scraping |
Profile pages (/user/show/) | Legacy HTML | Regex scraping |
Search pages (/search) | Legacy HTML | Regex scraping |
Strategy: use structured extraction where available, fall back to HTML only where the site hasn’t migrated yet. As the site migrates pages, move your extractors to match.
Legacy HTML Scraping
When a page has no structured data surface, regex scraping is the fallback.
Principles
- Prefer specific anchors (IDs, class names,
itempropattributes) over positional matching - Use
re.S(dotall) for multi-line HTML patterns - Extract sections first, then parse within the section to reduce false matches
- Always strip and unescape HTML entities
Section extraction pattern
def section_between(html: str, start_marker: str, end_marker: str) -> str:
start = html.find(start_marker)
if start == -1:
return ""
end = html.find(end_marker, start)
return html[start:end] if end != -1 else html[start:]
When to stop scraping
If you find yourself writing regex patterns longer than 3 lines, consider:
- Is there a
__NEXT_DATA__payload you missed? - Does the page make XHR calls you could replay directly?
- Can you use a headless browser to get the rendered DOM instead?
HTML scraping should be the strategy of last resort, not the first attempt.
Real-World Examples in This Repo
| Skill | Discovery technique | Reference |
|---|---|---|
skills/exa/ | JS bundle scanning for custom /api/verify-otp endpoint + Navigation API interception for token format + reading NextAuth source for server-side verification logic | exa.py, nextauth.md |
skills/goodreads/ | Next.js Apollo cache + AppSync GraphQL + JS bundle scanning | public_graph.py |
skills/austin-boulder-project/ | JS bundle config extraction (API key + namespace) | abp.py |
skills/claude/ | Session cookie capture via Playwright | claude-login.py |