Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Reverse Engineering — Content Extraction from HTML

When there’s no API, no GraphQL, no Apollo cache — just server-rendered HTML behind a login wall. This doc covers the patterns for authenticated HTML scraping with agentos.http + lxml.

This is Layer 4 of the reverse-engineering docs:

  • Layer 1: Transport1-transport — getting a response at all
  • Layer 2: Discovery2-discovery — finding structured data in bundles
  • Layer 3: Auth & Runtime3-auth — credentials, sessions, rotating config
  • Layer 4: Content (this file) — extracting data from HTML when there is no API
  • Layer 5: Social Networks5-social — modeling people, relationships, and social graphs
  • Layer 6: Desktop Apps6-desktop-apps — macOS, Electron, local state, unofficial APIs

When You Need This Layer

Not every operation needs HTML scraping. The same site often has a mix:

Data typeApproachExample
Public catalog dataGraphQL / Apollo cache (Layer 2)Goodreads book details, reviews, search
User-scoped data behind loginHTML scraping (this doc)Goodreads friends, shelves, user’s books
Write operationsAPI calls with session tokensRating a book, adding to shelf

Rule of thumb: Check for structured APIs first (Layer 2). Only fall back to HTML scraping when the data is exclusively server-rendered behind authentication.


Skill Architecture: Two Modules

When a skill needs both public API access and authenticated scraping, split into two Python modules:

skills/mysite/
  readme.md          # Skill descriptor — operations point to either module
  public_graph.py    # Public API / GraphQL / Apollo — no cookies needed
  web_scraper.py     # Authenticated HTML scraping — needs cookies

The readme declares separate connections for each:

connections:
  graphql:
    description: "Public API — key auto-discovered"
  web:
    description: "User cookies for authenticated data"
    auth:
      type: cookies
      domain: ".mysite.com"
    optional: true
    label: MySite Session

Operations reference the appropriate connection:

operations:
  search_books:        # public
    connection: graphql
    python:
      module: ./public_graph.py
      function: search_books
      args: { query: .params.query }

  list_friends:        # authenticated
    connection: web
    python:
      module: ./web_scraper.py
      function: run_list_friends
      params: true

The entire cookie lifecycle is handled by agentOS. The Python script never touches browser databases or knows which browser the cookies came from.

How it works

  1. Skill declares connection: web with cookies.domain: ".mysite.com"
  2. Executor finds an installed cookie provider (brave-browser, firefox, etc.)
  3. Provider extracts + decrypts cookies from the local browser database
  4. Executor injects them into params as params.auth.cookies (a Cookie: header string)
  5. Python reads them and passes to http.client()

Python side

def _cookie(ctx: dict) -> str | None:
    """Extract cookie header from AgentOS-injected auth."""
    c = (ctx.get("auth") or {}).get("cookies") or ""
    return c if c else None

def _require_cookies(cookie_header, params, op_name):
    cookie_header = cookie_header or (params and _cookie(params))
    if not cookie_header:
        raise ValueError(f"{op_name} requires session cookies (connection: web)")
    return cookie_header

params: true context structure

When a Python executor uses params: true, the function receives the full wrapped context as a single params dict:

{
  "params": { "user_id": "123", "page": 1 },
  "auth": { "cookies": "session_id=abc; token=xyz" }
}

Use a helper to read user params from either nesting level:

def _p(d: dict, key: str, default=None):
    """Read from params sub-dict or top-level."""
    p = (d.get("params") or d) if isinstance(d, dict) else {}
    return p.get(key, default) if isinstance(p, dict) else default

HTTP Client: Shared Across Pages

Create one http.client() per operation and reuse it across paginated requests. This keeps the TCP/TLS connection alive and avoids per-request overhead.

from agentos import http

def _client(cookie_header: str | None) -> http.Client:
    headers = http.headers(waf="standard", accept="html")
    if cookie_header:
        headers["Cookie"] = cookie_header
    return http.client(headers=headers)

# Usage
with _client(cookie_header) as client:
    for page in range(1, max_pages + 1):
        status, html = _fetch(client, url.format(page=page))
        if not _has_next(html):
            break

Pagination

Default: fetch all pages

Make page=0 the default, meaning “fetch everything.” When the caller passes page=N, return only that page. This gives callers control without requiring them to implement their own pagination loop.

def list_friends(user_id, page=0, cookie_header=None, *, params=None):
    if page > 0:
        # Single page
        return _parse_one_page(url.format(page=page), cookie_header)

    # Auto-paginate
    all_items = []
    seen = set()
    with _client(cookie_header) as client:
        for p in range(1, MAX_PAGES + 1):
            status, html = _fetch(client, url.format(page=p))
            items = _parse_page(html)
            for item in items:
                if item["id"] not in seen:
                    seen.add(item["id"])
                    all_items.append(item)
            if not items or not _has_next(html):
                break
    return all_items

Detecting “next page”

Look for pagination controls rather than guessing based on result count:

def _has_next(html_text: str) -> bool:
    return bool(
        re.search(r'class="next_page"', html_text) or
        re.search(r'rel="next"', html_text)
    )

Safety limits

Always cap auto-pagination (MAX_PAGES = 20). A user with 5,000 books shouldn’t trigger 200 sequential requests in a single tool call.

Deduplication

Sites often include the user’s own profile in friend lists, or repeat items across page boundaries. Always deduplicate by ID:

seen: set[str] = set()
for item in page_items:
    if item["id"] not in seen:
        seen.add(item["id"])
        all_items.append(item)

HTML Parsing Patterns

Use data attributes over visible text

Data attributes are more stable than CSS classes or visible text:

# Good: data-rating is the source of truth
stars = row.select_one(".stars[data-rating]")
rating = int(stars["data-rating"]) if stars else None

# Bad: fragile, depends on star rendering
rating_el = row.select_one(".staticStars")

Fallback selector chains

Large sites use different HTML structures across pages, A/B tests, and regions. Instead of matching a single selector, define a priority-ordered list and take the first match. This makes parsers resilient to markup changes.

ORDER_ID_SEL = [
    "[data-component='orderId']",
    ".order-date-invoice-item :is(bdi, span)[dir='ltr']",
    ".yohtmlc-order-id :is(bdi, span)[dir='ltr']",
    ":is(bdi, span)[dir='ltr']",
]

ITEM_PRICE_SEL = [
    ".a-price .a-offscreen",
    "[data-component='unitPrice'] .a-text-price :not(.a-offscreen)",
    ".yohtmlc-item .a-color-price",
]

def _select_one(tag, selectors: list[str]):
    for sel in selectors:
        result = tag.select_one(sel)
        if result:
            return result
    return None

def _select(tag, selectors: list[str]) -> list:
    for sel in selectors:
        result = tag.select(sel)
        if result:
            return result
    return []

Put the most specific, modern selector first (e.g. data-component attributes) and the broadest fallback last. This pattern works especially well for sites like Amazon that ship multiple front-end variants simultaneously.

Reference: skills/amazon/amazon.py — all order/item selectors use this pattern.

Structured table pages (Goodreads /review/list/)

Many sites render user data in HTML tables with class-coded columns. Each <td> has a field class you can target directly:

rows = soup.select("tr.bookalike")
for row in rows:
    book_id = row.get("data-resource-id")
    title = row.select_one("td.field.title a").get("title")
    author = row.select_one("td.field.author a").get_text(strip=True)
    rating = row.select_one(".stars[data-rating]")["data-rating"]
    date_added = row.select_one("td.field.date_added span[title]")["title"]

Extraction helpers

Write small focused helpers for each field type rather than inline parsing:

def _extract_date(row, field_class):
    td = row.select_one(f"td.field.{field_class}")
    if not td:
        return None
    span = td.select_one("span[title]")
    if span:
        return span.get("title") or span.get_text(strip=True)
    return None

def _extract_rating(row):
    stars = row.select_one(".stars[data-rating]")
    if stars:
        val = int(stars.get("data-rating", "0"))
        return val if val > 0 else None
    return None

Login detection and SESSION_EXPIRED

Check early and fail fast when cookies are invalid. Use the SESSION_EXPIRED: prefix convention so the engine can automatically retry with a different cookie provider (see connections.md):

def _is_login_redirect(resp, body: str) -> bool:
    if "ap/signin" in str(resp.url):
        return True
    if "form[name='signIn']" in body[:5000]:
        return True
    return "ap_email" in body[:3000] or "signIn" in body[:3000]

# In any authenticated operation:
if _is_login_redirect(resp, body):
    raise RuntimeError(
        "SESSION_EXPIRED: Amazon redirected to login — session cookies are expired or invalid."
    )

The SESSION_EXPIRED: prefix triggers the engine’s provider-exclusion retry: the engine marks the current cookie provider as stale, excludes it, and retries with the next-best provider. This handles the common case where one browser has stale cookies but another has a fresh session.

Convention: SESSION_EXPIRED: <human-readable reason> for stale auth. Any other exception message means a real failure — the engine won’t retry with different credentials.


AJAX Endpoints for Dynamic Content

Not everything is in the HTML. Many sites load sections dynamically via internal AJAX endpoints that return HTML fragments or JSON. These are often easier to parse than the full page and more stable across redesigns.

Discovering AJAX endpoints

Use Playwright’s capture_network while interacting with the page:

capture_network { url: "https://www.amazon.com/auto-deliveries", pattern: "**/ajax/**", wait: 5000 }

Or inject a fetch interceptor and click the relevant UI element — the interceptor captures the endpoint, params, and response shape.

Example: Amazon Subscribe & Save

Amazon’s subscription management page loads content via an AJAX endpoint that returns a JSON payload with embedded HTML:

from lxml import html as lhtml

resp = client.get(
    f"{BASE}/auto-deliveries/ajax/subscriptionList",
    params={"pageNumber": 0},
    headers={
        "X-Requested-With": "XMLHttpRequest",
        "Referer": f"{BASE}/auto-deliveries",
    },
)
data = resp.json()
html_fragment = data.get("subscriptionListHtml", "")
doc = lhtml.fromstring(html_fragment)

Key headers for AJAX: Always include X-Requested-With: XMLHttpRequest and a valid Referer. Many servers check these to distinguish AJAX from direct navigation.

When to look for AJAX endpoints

SignalLikely AJAX
Content appears after page load (spinner, lazy-load)Yes
URL changes without full page reloadYes — check for pushState + fetch
Tab/section switching within a pageYes — each tab may have its own endpoint
Data differs between “View Source” and DevTools ElementsYes — JS loaded it after

Reference: skills/amazon/amazon.py subscriptions() — AJAX endpoint for Subscribe & Save management.


Adapter Null Safety

When a skill’s adapter maps nested collections (like shelves on an account), not every operation returns those nested fields. Use jaq // [] fallback to prevent null iteration errors:

adapters:
  account:
    id: .user_id
    name: .name
    shelves:
      shelf[]:
        _source: '.shelves // []'    # won't blow up when shelves is absent
        id: .shelf_id
        name: .name

Data Validation Checklist

After building a scraper, cross-reference against the live site:

CheckHow
Total countCompare your result count to what the site header says (“Showing 1-30 of 69”)
Unique IDsDeduplicate and compare — off-by-one usually means a deleted/deactivated account
Rating countsCount items with non-null ratings vs. the site’s “X ratings” display
Review countsCount items with actual review text vs. the site’s “X reviews” display
Field completenessSpot-check dates, ratings, authors against individual entries on the site
Shelf mathSum shelf counts and compare to “All (N)” — they may diverge (Goodreads shows 273 but serves 301)

Testing Methodology

1. Save cookies locally for development

Extract cookies once and save to a JSON file for local testing:

# From agentOS:
# run({ skill: "brave-browser", tool: "cookie_get", params: { domain: ".mysite.com" } })

# Or manually build the file:
# scripts/test_cookies.json
[
  {"name": "session_id", "value": "abc123", "domain": ".mysite.com"},
  ...
]

2. Test parsers against real pages

Hit the live site with agentos.http and verify parsing before wiring to agentOS:

with open("scripts/test_cookies.json") as f:
    cookies = json.load(f)
cookie_header = "; ".join(f'{c["name"]}={c["value"]}' for c in cookies)

friends = list_friends("12345", cookie_header=cookie_header)
print(f"Got {len(friends)} friends")

3. Test through agentOS MCP

Once local parsing works, test the full pipeline:

npm run mcp:call -- --skill mysite --tool list_friends \
  --params '{"user_id":"12345"}' --verbose

Operations that require live cookies should use test.mode: write so they’re skipped in automated smoke tests but can be run manually with --write:

test:
  mode: write
  fixtures:
    user_id: "12345"

Real-World Examples

SkillWhat’s scrapedKey patternsReference
skills/amazon/Orders, products, subscriptions, account identity — all from server-rendered HTML and AJAXFallback selector chains, Siege cookie stripping, session warming, AJAX endpoints, SESSION_EXPIRED conventionamazon.py
skills/goodreads/People (friends, following, followers), books, reviews, groups, quotes, rich profiles — all from HTMLStructured table parsing, data attributes, pagination, dedupweb_scraper.py

For social-network-specific modeling patterns (person vs account, relationship types, cross-platform identity), see 5-social.