Skip to content

Modularise URL retrieval with Cloudflare Browser Rendering support#73

Open
paskal wants to merge 20 commits intomasterfrom
modularise-retrieval
Open

Modularise URL retrieval with Cloudflare Browser Rendering support#73
paskal wants to merge 20 commits intomasterfrom
modularise-retrieval

Conversation

@paskal
Copy link
Copy Markdown
Member

@paskal paskal commented Mar 29, 2026

Summary

Addresses review feedback on radio-t/super-bot#156: content extraction improvements belong in ukeeper-readability (the extraction layer), not in super-bot (the consumer).

Cloudflare Browser Rendering (CF BR) is powerful for JS-gated and SPA-style pages, but it's slow (~3–5s per fetch), has a narrow sweet spot (no help against DataDome/Turnstile-protected sites, same result as plain HTTP on regular content), and the free tier is 1 req/10s with a 10 min/day browser budget. To stay cost-effective the default path remains plain HTTP — CF is opt-in, either per-domain via rules or globally via a flag.

Retriever interface

  • New extractor.Retriever interface — Retrieve(ctx, url) (*RetrieveResult, error) returning raw bytes, final URL and headers; existing parsing pipeline is unchanged
  • HTTPRetriever — extracts the current fetch logic (Safari UA, redirect following, connection reuse) into a reusable type with a cached client
  • CloudflareRetriever — POSTs to /accounts/{id}/browser-rendering/content, accepts both {success,result} JSON and raw HTML responses
  • UReadability.Retriever stays as the default fetcher; normalizeLinks signature simplified from *http.Request to *url.URL
  • Backward compatible: UReadability{} without Retriever set falls back to a cached HTTPRetriever

Routing (cost-effective opt-in)

Default behaviour unchanged — every request goes through HTTPRetriever unless something explicitly routes it elsewhere. CF is off by default and opt-in at two levels:

  1. Per-ruleRule.UseCloudflare field (exposed as a checkbox in the rule editor UI). When a rule for the requested domain has this set, that request uses the CF retriever. Good for the handful of domains that actually need it (Reuters, X.com, etc.).
  2. Global--cf-route-all / CF_ROUTE_ALL flag, default false. When set, every request routes through CF. Intended for debugging / isolated deployments, not day-to-day.

UReadability.pickRetriever(rule) decides per request:

CFRetriever == nil             → default HTTP
CFRouteAll == true             → CF
rule.UseCloudflare == true     → CF
otherwise                      → default HTTP

extractWithRules now looks up the rule by domain once up front (instead of twice — once for routing, once in getContent).

429 retries (holding the caller's connection)

CF free tier throttles at 1 req / 10s and returns HTTP 429 with {"success":false,"errors":[{"code":2001,"message":"Rate limit exceeded"}]}. The naive behaviour (fail the extraction on the first 429) made CF unusable in practice — during local testing the second sequential request almost always hit it.

CloudflareRetriever.Retrieve now wraps a single-attempt doRetrieve in a retry loop:

  • On 429 (and only 429): back off and retry, keeping the caller's HTTP connection open
  • Default MaxRetries = 2, default RetryDelay = 11s (CF free tier window is 10s; small headroom)
  • Exponential backoff: 11s → 22s, capped at 30s per step
  • Respects Retry-After header (delta-seconds or HTTP date) when present
  • Aborts the backoff immediately when the caller's ctx is canceled — so an upstream timeout terminates cleanly instead of hanging
  • MaxRetries = -1 disables retries entirely
  • Non-429 errors are returned immediately without retrying

Timeout budget

Worst case with the defaults: 30s (request) + 11s + 30s + 22s + 30s ≈ 123s. That's long, so:

  • Server-side: added WriteTimeout: 150s on the HTTP server in rest.Server.Run — was previously unset, which allowed handlers to run indefinitely. 150s caps runaway handlers with headroom for the worst-case CF path.
  • Caller-side: callers of /api/content/v1/parser should allow ~150s timeout when the service is running with CF enabled. Upstream reverse proxies (nginx proxy_read_timeout default 60s, AWS ALB default 60s) need to be bumped accordingly or the caller will see their own timeout before the retry finishes.
  • Without --cf-route-all and without any rule marked use_cloudflare, nothing changes — the service still responds in the same time it always did.

Other improvements in touched code

  • checkToken helper with subtle.ConstantTimeCompare — extracted from extractArticleEmulateReadability; the old code compared tokens with != which is not constant-time
  • Fixed %b%v format verb bug in text.go, switched from stdlib log to lgr for consistency

Tested

End-to-end test matrix (plain HTTP on :8078 vs CF on :8079 against the same mongo, full results in the test summary comment):

URL HTTP CF winner
stevehanov.ca blog post 9039 B 9039 B tie
news.ycombinator.com item 862 B 862 B tie
blogs.windows.com post 12716 B 12716 B tie
nesbitt.io post 5332 B 5332 B tie
nytimes.com/live/... JS wall DataDome CAPTCHA both fail
x.com status error page tweet in title CF
reuters.com article 43 B "enable JS" 1529 B full article CF

Full go test -race ./... passes; golangci-lint run clean.

Config reference

Flag Env Default Description
--cf-account-id CF_ACCOUNT_ID Cloudflare account ID for Browser Rendering API
--cf-api-token CF_API_TOKEN Cloudflare API token with Browser Rendering Edit permission
--cf-route-all CF_ROUTE_ALL false Route every request through Cloudflare

paskal added 17 commits March 29, 2026 20:28
Extract URL fetching abstraction from the inline HTTP logic in extractWithRules.
Defines Retriever interface, RetrieveResult struct, and HTTPRetriever with
Safari user-agent, redirect following, and timeout support. Includes moq
generate directive and comprehensive tests.
Generate moq mock for Retriever interface as a test-only file
(retriever_mock_test.go) instead of mocks/ subpackage to avoid
import cycle (mocks/retriever.go would import extractor, cycling
with readability_test.go). Run gofmt on all modified files, zero
lint issues.
- fix err shadowing in deferred Body.Close() in both retrievers (use closeErr)
- handle Cloudflare API success=false response explicitly instead of treating JSON error as HTML
- truncate CF API error body to 512 bytes in error messages
- add comment documenting CF retriever URL limitation (no final URL after JS redirects)
- fix pre-existing %b format verb in text.go logging (should be %v)
- replace network-dependent TestCloudflareRetriever_DefaultBaseURL with local httptest
- add TestCloudflareRetriever_SuccessFalse for the new success=false handling
- add TestExtractWithCustomRetriever integration test using RetrieverMock
- remove duplicate plan file from docs/plans/ (already in completed/)
- update README.md with new CF CLI flags and feature description
- update CLAUDE.md CI bullet to reflect split docker.yml workflow
POST /api/extract never had token auth in the original code.
The checkToken refactoring should only apply to the legacy
/content/v1/parser endpoint which always had it.
@paskal
Copy link
Copy Markdown
Member Author

paskal commented Mar 29, 2026

This PR addresses the review feedback on radio-t/super-bot#156 — the content extraction improvement belongs in ukeeper-readability (the extraction layer), not in super-bot. With the Retriever interface and Cloudflare Browser Rendering support here, super-bot#156 can be closed.

@paskal
Copy link
Copy Markdown
Member Author

paskal commented Apr 12, 2026

tested end-to-end locally against both retrievers (two containers off the same compose, HTTPRetriever on :8078 vs CloudflareRetriever on :8079, same mongo). all tests pass, lint clean.

ran the same URLs through both to see where Browser Rendering actually helps:

URL HTTP CF winner
stevehanov.ca/blog/... 9039 B 9039 B tie
news.ycombinator.com/item?id=47736555 862 B 862 B tie
blogs.windows.com/... 12716 B 12716 B tie
nesbitt.io/2026/03/06/gitlocal.html 5332 B 5332 B tie
nytimes.com/live/... 43 B (JS wall) 0 B (DataDome CAPTCHA) both fail
x.com/bcherny/status/... 161 B error page tweet text in title via og tags CF
reuters.com/business/... 43 B "enable JS" 1529 B full article CF

findings:

  1. CF delivers on JS-gated sites without headless-browser fingerprinting — Reuters is the clean example: HTTP hits the noscript wall, CF renders the full article. x.com partially works (real tweet ends up in title).
  2. DataDome / advanced anti-bot kills it — NYT serves a DataDome CAPTCHA iframe to the CF headless browser. cf gets past the noscript layer but into a different defence.
  3. regular content sites show no differencego-readability handles them fine over plain HTTP, no reason to pay the ~4.5s CF round-trip.
  4. free-tier rate limit (1 req / 10s) is real — hit 429 immediately on the second sequential request. not a blocker for the PR but worth knowing for production use.
  5. nit on the PR description: says checkToken "now also protects POST /extract" but extractArticle at rest/server.go:154 doesn't call it. pre-existing behaviour, just the bullet is misleading.

the Retriever interface abstraction is clean, backward-compatible (nil → cached HTTPRetriever), tests cover both implementations with httptest mocks. normalizeLinks(*url.URL) cleanup is a nice bonus.

the CF path is narrow-use but genuinely useful for the sites it covers. maybe I should redo it to be used only on retrieval failures.

paskal added 2 commits April 12, 2026 11:43
HTTP retriever stays the default. Cloudflare is now opt-in at two levels:

- per-rule: new Rule.UseCloudflare field (checkbox in rule editor UI)
  routes requests for that domain through Cloudflare Browser Rendering
- global: --cf-route-all / CF_ROUTE_ALL flag (default false) routes every
  request through Cloudflare

UReadability.pickRetriever(rule) picks: CFRouteAll > rule.UseCloudflare >
default HTTP. extractWithRules now resolves the rule once upfront and
shares it between routing and getContent (was looked up twice).

CloudflareRetriever retries on HTTP 429 with exponential backoff (base 11s,
max 2 retries by default → worst-case 33s of backoff), honours Retry-After
header, and aborts immediately on caller context cancel. MaxRetries=-1
disables retries.

Added WriteTimeout=150s on the HTTP server — was previously unset, allowing
handlers to run forever. 150s covers the worst-case CF path (up to ~123s).
Copy link
Copy Markdown
Collaborator

@umputun umputun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests pass, lint clean, race detector clean. the Retriever interface follows the existing consumer-side pattern nicely and the opt-in routing makes sense for an expensive backend.

couple things to fix/consider:

  1. TestGetContentCustom (readability_test.go:346) - the RulesMock.GetFunc setup is dead code now since getContent no longer does rule lookups. the test silently runs the general parser while looking like it tests custom rule parsing. should either pass the rule directly to getContent or rewrite to go through extractWithRules

  2. MaxRetries zero-value trap (retriever.go:145) - MaxRetries: 0 silently gives 2 retries because of the default substitution. someone setting MaxRetries: 0 in struct init would expect no retries. consider using a pointer (*int) to distinguish "not set" from "explicitly zero", or just make 0 mean 0 and require explicit MaxRetries: 2 in production setup

  3. --cf-route-all / CF_ROUTE_ALL is missing from the README config table - should be listed alongside cf-account-id and cf-api-token

not blocking, just noting - the WriteTimeout: 150s applies to all handlers (static files, rule CRUD, health ping), not just the CF extraction path. might be worth a comment explaining why it's global, or wrapping just the extraction endpoints with http.TimeoutHandler

- TestGetContentCustom: pass the rule directly to getContent so it actually
  exercises the custom-rule path; the RulesMock.GetFunc setup was dead code
  after getContent stopped looking up rules
- CloudflareRetriever.MaxRetries: remove default substitution for the zero
  value — 0 now means "no retries" as expected. Callers opt into retries by
  setting MaxRetries explicitly; main.go uses the exported CFDefaultMaxRetries
  constant (2)
- README: add cf-route-all to the config table and rewrite the Cloudflare
  section to reflect the opt-in routing model + 429 retry behaviour
- rest.Server.Run: expand the WriteTimeout comment to explain why the 150s
  ceiling is server-wide rather than per-route via http.TimeoutHandler
@paskal
Copy link
Copy Markdown
Member Author

paskal commented Apr 12, 2026

thanks for the review. all four points addressed in 6403f00:

  1. TestGetContentCustom dead mock — the test now passes the rule directly to getContent instead of relying on the stale RulesMock.GetFunc. exercises the custom-rule path visibly.

  2. MaxRetries: 0 zero-value trap — removed the default substitution. 0 now means "no retries" as one would expect; negative values are still clamped to 0 defensively. production setup explicitly wires extractor.CFDefaultMaxRetries (=2) from main.go. the disabled-retries test is now table-driven over both 0 and -1.

  3. README missing --cf-route-all — added to the config table and rewrote the "Cloudflare Browser Rendering (optional)" section to explain the opt-in routing model (per-rule vs route-all) and the 429 retry behaviour. the previous text still described the old "replace HTTP entirely when creds are set" behaviour.

  4. server-wide WriteTimeout scope — expanded the comment above it to explain why it's global (the other handlers are all sub-second, so a single server-wide ceiling is simpler than wrapping the extraction routes with http.TimeoutHandler). left a hint to switch to per-route if anything else becomes long-running.

tests + lint green.

@paskal paskal requested a review from umputun April 12, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants