Modularise URL retrieval with Cloudflare Browser Rendering support#73
Modularise URL retrieval with Cloudflare Browser Rendering support#73
Conversation
Extract URL fetching abstraction from the inline HTTP logic in extractWithRules. Defines Retriever interface, RetrieveResult struct, and HTTPRetriever with Safari user-agent, redirect following, and timeout support. Includes moq generate directive and comprehensive tests.
Generate moq mock for Retriever interface as a test-only file (retriever_mock_test.go) instead of mocks/ subpackage to avoid import cycle (mocks/retriever.go would import extractor, cycling with readability_test.go). Run gofmt on all modified files, zero lint issues.
- fix err shadowing in deferred Body.Close() in both retrievers (use closeErr) - handle Cloudflare API success=false response explicitly instead of treating JSON error as HTML - truncate CF API error body to 512 bytes in error messages - add comment documenting CF retriever URL limitation (no final URL after JS redirects) - fix pre-existing %b format verb in text.go logging (should be %v) - replace network-dependent TestCloudflareRetriever_DefaultBaseURL with local httptest - add TestCloudflareRetriever_SuccessFalse for the new success=false handling - add TestExtractWithCustomRetriever integration test using RetrieverMock - remove duplicate plan file from docs/plans/ (already in completed/) - update README.md with new CF CLI flags and feature description - update CLAUDE.md CI bullet to reflect split docker.yml workflow
POST /api/extract never had token auth in the original code. The checkToken refactoring should only apply to the legacy /content/v1/parser endpoint which always had it.
ff27ed8 to
1ef736b
Compare
|
This PR addresses the review feedback on radio-t/super-bot#156 — the content extraction improvement belongs in ukeeper-readability (the extraction layer), not in super-bot. With the Retriever interface and Cloudflare Browser Rendering support here, super-bot#156 can be closed. |
|
tested end-to-end locally against both retrievers (two containers off the same compose, ran the same URLs through both to see where Browser Rendering actually helps:
findings:
the the CF path is narrow-use but genuinely useful for the sites it covers. maybe I should redo it to be used only on retrieval failures. |
HTTP retriever stays the default. Cloudflare is now opt-in at two levels: - per-rule: new Rule.UseCloudflare field (checkbox in rule editor UI) routes requests for that domain through Cloudflare Browser Rendering - global: --cf-route-all / CF_ROUTE_ALL flag (default false) routes every request through Cloudflare UReadability.pickRetriever(rule) picks: CFRouteAll > rule.UseCloudflare > default HTTP. extractWithRules now resolves the rule once upfront and shares it between routing and getContent (was looked up twice). CloudflareRetriever retries on HTTP 429 with exponential backoff (base 11s, max 2 retries by default → worst-case 33s of backoff), honours Retry-After header, and aborts immediately on caller context cancel. MaxRetries=-1 disables retries. Added WriteTimeout=150s on the HTTP server — was previously unset, allowing handlers to run forever. 150s covers the worst-case CF path (up to ~123s).
umputun
left a comment
There was a problem hiding this comment.
tests pass, lint clean, race detector clean. the Retriever interface follows the existing consumer-side pattern nicely and the opt-in routing makes sense for an expensive backend.
couple things to fix/consider:
-
TestGetContentCustom(readability_test.go:346) - theRulesMock.GetFuncsetup is dead code now sincegetContentno longer does rule lookups. the test silently runs the general parser while looking like it tests custom rule parsing. should either pass the rule directly togetContentor rewrite to go throughextractWithRules -
MaxRetrieszero-value trap (retriever.go:145) -MaxRetries: 0silently gives 2 retries because of the default substitution. someone settingMaxRetries: 0in struct init would expect no retries. consider using a pointer (*int) to distinguish "not set" from "explicitly zero", or just make 0 mean 0 and require explicitMaxRetries: 2in production setup -
--cf-route-all/CF_ROUTE_ALLis missing from the README config table - should be listed alongsidecf-account-idandcf-api-token
not blocking, just noting - the WriteTimeout: 150s applies to all handlers (static files, rule CRUD, health ping), not just the CF extraction path. might be worth a comment explaining why it's global, or wrapping just the extraction endpoints with http.TimeoutHandler
- TestGetContentCustom: pass the rule directly to getContent so it actually exercises the custom-rule path; the RulesMock.GetFunc setup was dead code after getContent stopped looking up rules - CloudflareRetriever.MaxRetries: remove default substitution for the zero value — 0 now means "no retries" as expected. Callers opt into retries by setting MaxRetries explicitly; main.go uses the exported CFDefaultMaxRetries constant (2) - README: add cf-route-all to the config table and rewrite the Cloudflare section to reflect the opt-in routing model + 429 retry behaviour - rest.Server.Run: expand the WriteTimeout comment to explain why the 150s ceiling is server-wide rather than per-route via http.TimeoutHandler
|
thanks for the review. all four points addressed in 6403f00:
tests + lint green. |
Summary
Addresses review feedback on radio-t/super-bot#156: content extraction improvements belong in ukeeper-readability (the extraction layer), not in super-bot (the consumer).
Cloudflare Browser Rendering (CF BR) is powerful for JS-gated and SPA-style pages, but it's slow (~3–5s per fetch), has a narrow sweet spot (no help against DataDome/Turnstile-protected sites, same result as plain HTTP on regular content), and the free tier is 1 req/10s with a 10 min/day browser budget. To stay cost-effective the default path remains plain HTTP — CF is opt-in, either per-domain via rules or globally via a flag.
Retriever interface
extractor.Retrieverinterface —Retrieve(ctx, url) (*RetrieveResult, error)returning raw bytes, final URL and headers; existing parsing pipeline is unchangedHTTPRetriever— extracts the current fetch logic (Safari UA, redirect following, connection reuse) into a reusable type with a cached clientCloudflareRetriever— POSTs to/accounts/{id}/browser-rendering/content, accepts both{success,result}JSON and raw HTML responsesUReadability.Retrieverstays as the default fetcher;normalizeLinkssignature simplified from*http.Requestto*url.URLUReadability{}withoutRetrieverset falls back to a cachedHTTPRetrieverRouting (cost-effective opt-in)
Default behaviour unchanged — every request goes through
HTTPRetrieverunless something explicitly routes it elsewhere. CF is off by default and opt-in at two levels:Rule.UseCloudflarefield (exposed as a checkbox in the rule editor UI). When a rule for the requested domain has this set, that request uses the CF retriever. Good for the handful of domains that actually need it (Reuters, X.com, etc.).--cf-route-all/CF_ROUTE_ALLflag, default false. When set, every request routes through CF. Intended for debugging / isolated deployments, not day-to-day.UReadability.pickRetriever(rule)decides per request:extractWithRulesnow looks up the rule by domain once up front (instead of twice — once for routing, once ingetContent).429 retries (holding the caller's connection)
CF free tier throttles at 1 req / 10s and returns HTTP 429 with
{"success":false,"errors":[{"code":2001,"message":"Rate limit exceeded"}]}. The naive behaviour (fail the extraction on the first 429) made CF unusable in practice — during local testing the second sequential request almost always hit it.CloudflareRetriever.Retrievenow wraps a single-attemptdoRetrievein a retry loop:MaxRetries= 2, defaultRetryDelay= 11s (CF free tier window is 10s; small headroom)Retry-Afterheader (delta-seconds or HTTP date) when presentctxis canceled — so an upstream timeout terminates cleanly instead of hangingMaxRetries = -1disables retries entirelyTimeout budget
Worst case with the defaults:
30s (request) + 11s + 30s + 22s + 30s ≈ 123s. That's long, so:WriteTimeout: 150son the HTTP server inrest.Server.Run— was previously unset, which allowed handlers to run indefinitely. 150s caps runaway handlers with headroom for the worst-case CF path./api/content/v1/parsershould allow ~150s timeout when the service is running with CF enabled. Upstream reverse proxies (nginxproxy_read_timeoutdefault 60s, AWS ALB default 60s) need to be bumped accordingly or the caller will see their own timeout before the retry finishes.--cf-route-alland without any rule markeduse_cloudflare, nothing changes — the service still responds in the same time it always did.Other improvements in touched code
checkTokenhelper withsubtle.ConstantTimeCompare— extracted fromextractArticleEmulateReadability; the old code compared tokens with!=which is not constant-time%b→%vformat verb bug intext.go, switched from stdliblogtolgrfor consistencyTested
End-to-end test matrix (plain HTTP on :8078 vs CF on :8079 against the same mongo, full results in the test summary comment):
stevehanov.cablog postnews.ycombinator.comitemblogs.windows.compostnesbitt.iopostnytimes.com/live/...x.comstatusreuters.comarticleFull
go test -race ./...passes;golangci-lint runclean.Config reference
--cf-account-idCF_ACCOUNT_ID--cf-api-tokenCF_API_TOKEN--cf-route-allCF_ROUTE_ALLfalse