Modularise URL retrieval with Cloudflare Browser Rendering support by paskal · Pull Request #73 · ukeeper/ukeeper-readability

paskal · 2026-03-29T13:07:06Z

Summary

Addresses review feedback on radio-t/super-bot#156: content extraction improvements belong in ukeeper-readability (the extraction layer), not in super-bot (the consumer).

Cloudflare Browser Rendering (CF BR) is powerful for JS-gated and SPA-style pages, but it's slow (~3–5s per fetch), has a narrow sweet spot (no help against DataDome/Turnstile-protected sites, same result as plain HTTP on regular content), and the free tier is 1 req/10s with a 10 min/day browser budget. To stay cost-effective the default path remains plain HTTP — CF is opt-in, either per-domain via rules or globally via a flag.

Retriever interface

New extractor.Retriever interface — Retrieve(ctx, url) (*RetrieveResult, error) returning raw bytes, final URL and headers; existing parsing pipeline is unchanged
HTTPRetriever — extracts the current fetch logic (Safari UA, redirect following, connection reuse) into a reusable type with a cached client
CloudflareRetriever — POSTs to /accounts/{id}/browser-rendering/content, accepts both {success,result} JSON and raw HTML responses
UReadability.Retriever stays as the default fetcher; normalizeLinks signature simplified from *http.Request to *url.URL
Backward compatible: UReadability{} without Retriever set falls back to a cached HTTPRetriever

Routing (cost-effective opt-in)

Default behaviour unchanged — every request goes through HTTPRetriever unless something explicitly routes it elsewhere. CF is off by default and opt-in at two levels:

Per-rule — Rule.UseCloudflare field (exposed as a checkbox in the rule editor UI). When a rule for the requested domain has this set, that request uses the CF retriever. Good for the handful of domains that actually need it (Reuters, X.com, etc.).
Global — --cf-route-all / CF_ROUTE_ALL flag, default false. When set, every request routes through CF. Intended for debugging / isolated deployments, not day-to-day.

UReadability.pickRetriever(rule) decides per request:

CFRetriever == nil             → default HTTP
CFRouteAll == true             → CF
rule.UseCloudflare == true     → CF
otherwise                      → default HTTP

extractWithRules now looks up the rule by domain once up front (instead of twice — once for routing, once in getContent).

429 retries (holding the caller's connection)

CF free tier throttles at 1 req / 10s and returns HTTP 429 with {"success":false,"errors":[{"code":2001,"message":"Rate limit exceeded"}]}. The naive behaviour (fail the extraction on the first 429) made CF unusable in practice — during local testing the second sequential request almost always hit it.

CloudflareRetriever.Retrieve now wraps a single-attempt doRetrieve in a retry loop:

On 429 (and only 429): back off and retry, keeping the caller's HTTP connection open
Default MaxRetries = 2, default RetryDelay = 11s (CF free tier window is 10s; small headroom)
Exponential backoff: 11s → 22s, capped at 30s per step
Respects Retry-After header (delta-seconds or HTTP date) when present
Aborts the backoff immediately when the caller's ctx is canceled — so an upstream timeout terminates cleanly instead of hanging
MaxRetries = -1 disables retries entirely
Non-429 errors are returned immediately without retrying

Timeout budget

Worst case with the defaults: 30s (request) + 11s + 30s + 22s + 30s ≈ 123s. That's long, so:

Server-side: added WriteTimeout: 150s on the HTTP server in rest.Server.Run — was previously unset, which allowed handlers to run indefinitely. 150s caps runaway handlers with headroom for the worst-case CF path.
Caller-side: callers of /api/content/v1/parser should allow ~150s timeout when the service is running with CF enabled. Upstream reverse proxies (nginx proxy_read_timeout default 60s, AWS ALB default 60s) need to be bumped accordingly or the caller will see their own timeout before the retry finishes.
Without --cf-route-all and without any rule marked use_cloudflare, nothing changes — the service still responds in the same time it always did.

Other improvements in touched code

checkToken helper with subtle.ConstantTimeCompare — extracted from extractArticleEmulateReadability; the old code compared tokens with != which is not constant-time
Fixed %b → %v format verb bug in text.go, switched from stdlib log to lgr for consistency

Tested

End-to-end test matrix (plain HTTP on :8078 vs CF on :8079 against the same mongo, full results in the test summary comment):

URL	HTTP	CF	winner
`stevehanov.ca` blog post	9039 B	9039 B	tie
`news.ycombinator.com` item	862 B	862 B	tie
`blogs.windows.com` post	12716 B	12716 B	tie
`nesbitt.io` post	5332 B	5332 B	tie
`nytimes.com/live/...`	JS wall	DataDome CAPTCHA	both fail
`x.com` status	error page	tweet in title	CF
`reuters.com` article	43 B "enable JS"	1529 B full article	CF

Full go test -race ./... passes; golangci-lint run clean.

Config reference

Flag	Env	Default	Description
`--cf-account-id`	`CF_ACCOUNT_ID`	—	Cloudflare account ID for Browser Rendering API
`--cf-api-token`	`CF_API_TOKEN`	—	Cloudflare API token with Browser Rendering Edit permission
`--cf-route-all`	`CF_ROUTE_ALL`	`false`	Route every request through Cloudflare

Extract URL fetching abstraction from the inline HTTP logic in extractWithRules. Defines Retriever interface, RetrieveResult struct, and HTTPRetriever with Safari user-agent, redirect following, and timeout support. Includes moq generate directive and comprehensive tests.

Generate moq mock for Retriever interface as a test-only file (retriever_mock_test.go) instead of mocks/ subpackage to avoid import cycle (mocks/retriever.go would import extractor, cycling with readability_test.go). Run gofmt on all modified files, zero lint issues.

- fix err shadowing in deferred Body.Close() in both retrievers (use closeErr) - handle Cloudflare API success=false response explicitly instead of treating JSON error as HTML - truncate CF API error body to 512 bytes in error messages - add comment documenting CF retriever URL limitation (no final URL after JS redirects) - fix pre-existing %b format verb in text.go logging (should be %v) - replace network-dependent TestCloudflareRetriever_DefaultBaseURL with local httptest - add TestCloudflareRetriever_SuccessFalse for the new success=false handling - add TestExtractWithCustomRetriever integration test using RetrieverMock - remove duplicate plan file from docs/plans/ (already in completed/) - update README.md with new CF CLI flags and feature description - update CLAUDE.md CI bullet to reflect split docker.yml workflow

POST /api/extract never had token auth in the original code. The checkToken refactoring should only apply to the legacy /content/v1/parser endpoint which always had it.

paskal · 2026-03-29T20:59:51Z

This PR addresses the review feedback on radio-t/super-bot#156 — the content extraction improvement belongs in ukeeper-readability (the extraction layer), not in super-bot. With the Retriever interface and Cloudflare Browser Rendering support here, super-bot#156 can be closed.

paskal · 2026-04-12T08:29:45Z

tested end-to-end locally against both retrievers (two containers off the same compose, HTTPRetriever on :8078 vs CloudflareRetriever on :8079, same mongo). all tests pass, lint clean.

ran the same URLs through both to see where Browser Rendering actually helps:

URL	HTTP	CF	winner
`stevehanov.ca/blog/...`	9039 B	9039 B	tie
`news.ycombinator.com/item?id=47736555`	862 B	862 B	tie
`blogs.windows.com/...`	12716 B	12716 B	tie
`nesbitt.io/2026/03/06/gitlocal.html`	5332 B	5332 B	tie
`nytimes.com/live/...`	43 B (JS wall)	0 B (DataDome CAPTCHA)	both fail
`x.com/bcherny/status/...`	161 B error page	tweet text in `title` via og tags	CF
`reuters.com/business/...`	43 B "enable JS"	1529 B full article	CF

findings:

CF delivers on JS-gated sites without headless-browser fingerprinting — Reuters is the clean example: HTTP hits the noscript wall, CF renders the full article. x.com partially works (real tweet ends up in title).
DataDome / advanced anti-bot kills it — NYT serves a DataDome CAPTCHA iframe to the CF headless browser. cf gets past the noscript layer but into a different defence.
regular content sites show no difference — go-readability handles them fine over plain HTTP, no reason to pay the ~4.5s CF round-trip.
free-tier rate limit (1 req / 10s) is real — hit 429 immediately on the second sequential request. not a blocker for the PR but worth knowing for production use.
nit on the PR description: says checkToken "now also protects POST /extract" but extractArticle at rest/server.go:154 doesn't call it. pre-existing behaviour, just the bullet is misleading.

the Retriever interface abstraction is clean, backward-compatible (nil → cached HTTPRetriever), tests cover both implementations with httptest mocks. normalizeLinks(*url.URL) cleanup is a nice bonus.

the CF path is narrow-use but genuinely useful for the sites it covers. maybe I should redo it to be used only on retrieval failures.

HTTP retriever stays the default. Cloudflare is now opt-in at two levels: - per-rule: new Rule.UseCloudflare field (checkbox in rule editor UI) routes requests for that domain through Cloudflare Browser Rendering - global: --cf-route-all / CF_ROUTE_ALL flag (default false) routes every request through Cloudflare UReadability.pickRetriever(rule) picks: CFRouteAll > rule.UseCloudflare > default HTTP. extractWithRules now resolves the rule once upfront and shares it between routing and getContent (was looked up twice). CloudflareRetriever retries on HTTP 429 with exponential backoff (base 11s, max 2 retries by default → worst-case 33s of backoff), honours Retry-After header, and aborts immediately on caller context cancel. MaxRetries=-1 disables retries. Added WriteTimeout=150s on the HTTP server — was previously unset, allowing handlers to run forever. 150s covers the worst-case CF path (up to ~123s).

umputun

tests pass, lint clean, race detector clean. the Retriever interface follows the existing consumer-side pattern nicely and the opt-in routing makes sense for an expensive backend.

couple things to fix/consider:

TestGetContentCustom (readability_test.go:346) - the RulesMock.GetFunc setup is dead code now since getContent no longer does rule lookups. the test silently runs the general parser while looking like it tests custom rule parsing. should either pass the rule directly to getContent or rewrite to go through extractWithRules
MaxRetries zero-value trap (retriever.go:145) - MaxRetries: 0 silently gives 2 retries because of the default substitution. someone setting MaxRetries: 0 in struct init would expect no retries. consider using a pointer (*int) to distinguish "not set" from "explicitly zero", or just make 0 mean 0 and require explicit MaxRetries: 2 in production setup
--cf-route-all / CF_ROUTE_ALL is missing from the README config table - should be listed alongside cf-account-id and cf-api-token

not blocking, just noting - the WriteTimeout: 150s applies to all handlers (static files, rule CRUD, health ping), not just the CF extraction path. might be worth a comment explaining why it's global, or wrapping just the extraction endpoints with http.TimeoutHandler

- TestGetContentCustom: pass the rule directly to getContent so it actually exercises the custom-rule path; the RulesMock.GetFunc setup was dead code after getContent stopped looking up rules - CloudflareRetriever.MaxRetries: remove default substitution for the zero value — 0 now means "no retries" as expected. Callers opt into retries by setting MaxRetries explicitly; main.go uses the exported CFDefaultMaxRetries constant (2) - README: add cf-route-all to the config table and rewrite the Cloudflare section to reflect the opt-in routing model + 429 retry behaviour - rest.Server.Run: expand the WriteTimeout comment to explain why the 150s ceiling is server-wide rather than per-route via http.TimeoutHandler

paskal · 2026-04-12T19:30:13Z

thanks for the review. all four points addressed in 6403f00:

TestGetContentCustom dead mock — the test now passes the rule directly to getContent instead of relying on the stale RulesMock.GetFunc. exercises the custom-rule path visibly.
MaxRetries: 0 zero-value trap — removed the default substitution. 0 now means "no retries" as one would expect; negative values are still clamped to 0 defensively. production setup explicitly wires extractor.CFDefaultMaxRetries (=2) from main.go. the disabled-retries test is now table-driven over both 0 and -1.
README missing --cf-route-all — added to the config table and rewrote the "Cloudflare Browser Rendering (optional)" section to explain the opt-in routing model (per-rule vs route-all) and the 429 retry behaviour. the previous text still described the old "replace HTTP entirely when creds are set" behaviour.
server-wide WriteTimeout scope — expanded the comment above it to explain why it's global (the other handlers are all sub-second, so a single server-wide ceiling is simpler than wrapping the extraction routes with http.TimeoutHandler). left a hint to switch to per-route if anything else becomes long-running.

tests + lint green.

paskal added 17 commits March 29, 2026 20:28

feat: implement CloudflareRetriever for Browser Rendering API

5bcd16d

feat: wire Retriever interface into UReadability extraction pipeline

4b932a3

feat: add CLI flags and wire Cloudflare retriever in main.go

6128691

feat: verify acceptance criteria for Retriever interface

8d560a0

feat: update documentation for Retriever interface and CLI flags

a7ff6d0

fix: address code review findings

7119f29

fix: address code review findings

ef56b96

fix: address codex review findings

298fa03

fix: address code review findings

e0562ed

fix: address code review findings

60f1a07

fix: address code review findings

38255d5

fix: cache default retriever, add defensive timeouts, extract constants

cd8ab64

fix: revert token auth addition to POST /api/extract

4bdfd51

POST /api/extract never had token auth in the original code. The checkToken refactoring should only apply to the legacy /content/v1/parser endpoint which always had it.

docs: add OpenAI auto-extraction improvement plan

1ef736b

paskal force-pushed the modularise-retrieval branch from ff27ed8 to 1ef736b Compare March 29, 2026 19:28

paskal mentioned this pull request Mar 29, 2026

Auto-improve content extraction with OpenAI evaluation #74

Open

paskal added 2 commits April 12, 2026 11:43

docs: remove unrelated OpenAI auto-extraction plan

5ab4f4e

umputun reviewed Apr 12, 2026

View reviewed changes

paskal requested a review from umputun April 12, 2026 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modularise URL retrieval with Cloudflare Browser Rendering support#73

Modularise URL retrieval with Cloudflare Browser Rendering support#73
paskal wants to merge 20 commits intomasterfrom
modularise-retrieval

paskal commented Mar 29, 2026 •

edited

Loading

Uh oh!

paskal commented Mar 29, 2026

Uh oh!

paskal commented Apr 12, 2026 •

edited

Loading

Uh oh!

umputun left a comment

Uh oh!

paskal commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

paskal commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Retriever interface

Routing (cost-effective opt-in)

429 retries (holding the caller's connection)

Timeout budget

Other improvements in touched code

Tested

Config reference

Uh oh!

paskal commented Mar 29, 2026

Uh oh!

paskal commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

umputun left a comment

Choose a reason for hiding this comment

Uh oh!

paskal commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paskal commented Mar 29, 2026 •

edited

Loading

paskal commented Apr 12, 2026 •

edited

Loading