ukeeper · umputun · Apr 12, 2026 · Mar 29, 2026 · Mar 29, 2026 · Mar 29, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -17,6 +17,10 @@ golangci-lint run --max-issues-per-linter=0 --max-same-issues=0  # lint from rep
 
 The `revision` variable in `main.go` is injected at build time: `-ldflags "-X main.revision=<version>"`.
 
+Optional Cloudflare Browser Rendering flags (when both are set, uses `CloudflareRetriever` instead of default HTTP):
+- `--cf-account-id` / `CF_ACCOUNT_ID` — Cloudflare account ID
+- `--cf-api-token` / `CF_API_TOKEN` — Cloudflare API token with Browser Rendering Edit permission
+
 `main_test.go` is gated behind `ENABLE_MONGO_TESTS=true` and needs MongoDB on localhost:27017. All other packages test independently — `datastore/` spins up MongoDB via testcontainers automatically.
 
 ## Architecture
@@ -32,11 +36,13 @@ web/           → Go HTML templates (HTMX v2), static assets
 
 **Dependency flow:** `main → datastore, extractor, rest`; `rest → datastore, extractor`; `extractor → datastore` (Rule type + Rules interface).
 
-**Key interface** — `extractor.Rules` (defined consumer-side in `extractor/readability.go`), implemented by `datastore.RulesDAO`. Mock generated with `//go:generate moq` in extractor package.
+**Key interfaces:**
+- `extractor.Rules` (defined consumer-side in `extractor/readability.go`), implemented by `datastore.RulesDAO`. Mock generated with `//go:generate moq` in extractor package.
+- `extractor.Retriever` (defined in `extractor/retriever.go`) — abstracts URL content fetching. Two implementations: `HTTPRetriever` (default, standard HTTP GET with Safari user-agent) and `CloudflareRetriever` (Cloudflare Browser Rendering API for JS-rendered pages). When `UReadability.Retriever` is nil, defaults to `HTTPRetriever`.
 
 ## Content Extraction Flow
 
-1. Fetch URL (30s timeout, Safari user-agent, follows redirects)
+1. Fetch URL via `Retriever` interface (default: HTTP GET with 30s timeout, Safari user-agent, follows redirects; optional: Cloudflare Browser Rendering for JS-heavy sites)
 2. Detect charset from Content-Type header and `<meta>` tags, convert to UTF-8
 3. Look up custom CSS selector rule from MongoDB by domain
 4. If rule found → extract via goquery CSS selector; if fails → fall back to general parser
@@ -47,7 +53,7 @@ web/           → Go HTML templates (HTMX v2), static assets
 
 - Rule upsert is keyed on `domain` — one rule per domain. Rules are disabled (`enabled: false`), never deleted.
 - `rest.Server.Readability` is `extractor.UReadability` by value (not pointer), with `Rules` interface field inside.
-- Protected API routes use custom `basicAuth` middleware with constant-time comparison.
+- `/api/content/v1/parser` requires the `token` query parameter when configured. Token comparison uses `subtle.ConstantTimeCompare`. Protected rule management routes use custom `basicAuth` middleware with constant-time comparison.
 - Web UI text is in Russian — tests assert on Russian strings, don't change them.
 - Middleware stack: Recoverer → RealIP → AppInfo+Ping → Throttle(50) → Logger.
-- CI runs tests and lint in the `build` job; Docker build only compiles (no tests inside Docker).
+- CI: `ci.yml` runs tests and lint in the `build` job (MongoDB via service container); `docker.yml` builds Docker images via `workflow_run` trigger after `build` succeeds (no tests inside Docker).
diff --git a/README.md b/README.md
@@ -12,16 +12,28 @@
 | port         | UKEEPER_PORT    | `8080`         | web server port                                       |
 | mongo-uri    | MONGO_URI       | none           | MongoDB connection string, _required_                 |
 | frontend-dir | FRONTEND_DIR    | `/srv/web`     | directory with frontend files                         |
-| token        | TOKEN           | none           | token for /content/v1/parser endpoint auth            |
+| token        | UKEEPER_TOKEN   | none           | token for API endpoint auth                           |
 | mongo-delay  | MONGO_DELAY     | `0`            | mongo initial delay                                   |
 | mongo-db     | MONGO_DB        | `ureadability` | mongo database name                                   |
 | creds        | CREDS           | none           | credentials for protected calls (POST, DELETE /rules) |
+| cf-account-id| CF_ACCOUNT_ID   | none           | Cloudflare account ID for Browser Rendering API       |
+| cf-api-token | CF_API_TOKEN    | none           | Cloudflare API token with Browser Rendering Edit perm |
+| cf-route-all | CF_ROUTE_ALL    | `false`        | route every request through Cloudflare Browser Rendering |
 | dbg          | DEBUG           | `false`        | debug mode                                            |
 
+### Cloudflare Browser Rendering (optional)
+
+Cloudflare Browser Rendering is useful for JavaScript-heavy pages and sites behind a "please enable JS" wall, but it's slower than direct HTTP and the free tier throttles at 1 request per 10 seconds. To keep the service cost-effective, Cloudflare routing is **opt-in**.
+
+1. Set `--cf-account-id` and `--cf-api-token` to enable Cloudflare routing.
+2. Either flip the `use_cloudflare` checkbox on individual rules (per-domain, recommended) or set `--cf-route-all=true` to route every request through Cloudflare.
+
+When Cloudflare credentials are not set, the service uses a standard HTTP client for everything (default). On HTTP 429 (rate limit) the service automatically retries with exponential backoff and respects the `Retry-After` header.
+
 ### API
 
     GET /api/content/v1/parser?token=secret&url=http://aa.com/blah - extract content (emulate Readability API parse call)
-    POST /api/v1/extract {url: http://aa.com/blah}  - extract content
+    POST /api/extract {url: http://aa.com/blah}  - extract content
 
 ## Development
 

diff --git a/datastore/rules.go b/datastore/rules.go
@@ -18,16 +18,17 @@ type RulesDAO struct {
 
 // Rule record, entry in mongo
 type Rule struct {
-	ID        bson.ObjectID `json:"id" bson:"_id,omitempty"`
-	Domain    string        `json:"domain"`
-	MatchURLs []string      `json:"match_url,omitempty" bson:"match_urls,omitempty"`
-	Content   string        `json:"content"`
-	Author    string        `json:"author,omitempty" bson:"author,omitempty"`
-	TS        string        `json:"ts,omitempty" bson:"ts,omitempty"` // ts of original article
-	Excludes  []string      `json:"excludes,omitempty" bson:"excludes,omitempty"`
-	TestURLs  []string      `json:"test_urls,omitempty" bson:"test_urls"`
-	User      string        `json:"user"`
-	Enabled   bool          `json:"enabled"`
+	ID            bson.ObjectID `json:"id" bson:"_id,omitempty"`
+	Domain        string        `json:"domain"`
+	MatchURLs     []string      `json:"match_url,omitempty" bson:"match_urls,omitempty"`
+	Content       string        `json:"content"`
+	Author        string        `json:"author,omitempty" bson:"author,omitempty"`
+	TS            string        `json:"ts,omitempty" bson:"ts,omitempty"` // ts of original article
+	Excludes      []string      `json:"excludes,omitempty" bson:"excludes,omitempty"`
+	TestURLs      []string      `json:"test_urls,omitempty" bson:"test_urls"`
+	User          string        `json:"user"`
+	Enabled       bool          `json:"enabled"`
+	UseCloudflare bool          `json:"use_cloudflare,omitempty" bson:"use_cloudflare,omitempty"` // route fetch via Cloudflare Browser Rendering
 }
 
 // Get rule by url. Checks if found in mongo, matching by domain

diff --git a/docs/plans/completed/20260329-modularise-retrieval.md b/docs/plans/completed/20260329-modularise-retrieval.md
@@ -0,0 +1,169 @@
+# Modularise URL Retrieval with Cloudflare Browser Rendering
+
+## Overview
+- Extract the hardcoded HTTP fetch logic from `extractWithRules` into a `Retriever` interface so different content retrieval backends can be swapped in
+- Add `CloudflareRetriever` using Cloudflare Browser Rendering `/content` endpoint — returns fully rendered HTML after JS execution, handling sites behind bot protection or requiring JS rendering
+- Addresses PR review feedback on radio-t/super-bot#156: the content extraction improvement belongs in ukeeper-readability (the extraction layer), not in super-bot (the consumer)
+
+## Context (from discovery)
+- **fetch logic**: embedded in `extractor/readability.go:extractWithRules` (lines 80-106) — creates `http.Client` inline, Safari user-agent, `io.ReadAll`, no abstraction
+- **only existing interface**: `extractor.Rules` for datastore access; no fetcher interface exists
+- **pipeline after fetch**: `toUtf8` → `getContent` (rules/readability) → title → `getText` → `normalizeLinks` → `getSnippet` → `extractPics` — all expects HTML input, stays unchanged
+- **`normalizeLinks`** takes `*http.Request` but only uses `.URL` — will simplify to `*url.URL`
+- **wiring**: `main.go` creates `extractor.UReadability` struct with `TimeOut`, `SnippetSize`, `Rules` fields; `rest.Server` holds it as a concrete struct
+- **CF `/content` endpoint**: `POST /accounts/{id}/browser-rendering/content` with `{"url": "..."}`, Bearer token auth, returns rendered HTML
+
+## Development Approach
+- **testing approach**: Regular (code first, then tests)
+- complete each task fully before moving to the next
+- make small, focused changes
+- **CRITICAL: every task MUST include new/updated tests** for code changes in that task
+- **CRITICAL: all tests must pass before starting next task**
+- **CRITICAL: update this plan file when scope changes during implementation**
+- run tests after each change
+- maintain backward compatibility — existing code creating `UReadability{}` without `Retriever` must continue to work (nil defaults to HTTP fetch)
+
+## Testing Strategy
+- **unit tests**: httptest mock servers for both retrievers, table-driven tests, testify assertions (matching existing patterns)
+- **mock generation**: moq-generated mock for `Retriever` interface (same pattern as `Rules` mock)
+- **integration**: existing `readability_test.go` tests must pass unchanged (they create `UReadability` without `Retriever`)
+
+## Progress Tracking
+- mark completed items with `[x]` immediately when done
+- add newly discovered tasks with + prefix
+- document issues/blockers with warning prefix
+- update plan if implementation deviates from original scope
+
+## Implementation Steps
+
+### Task 1: Create Retriever interface and HTTPRetriever
+
+**Files:**
+- Create: `extractor/retriever.go`
+- Create: `extractor/retriever_test.go`
+
+- [x] define `Retriever` interface with `Retrieve(ctx context.Context, url string) (*RetrieveResult, error)` method
+- [x] define `RetrieveResult` struct with `Body []byte`, `URL string`, `Header http.Header`
+- [x] implement `HTTPRetriever` struct extracting the current fetch logic from `extractWithRules` (HTTP client, Safari user-agent, redirect following, body reading)
+- [x] add `//go:generate moq` directive for `Retriever` interface
+- [x] write tests for `HTTPRetriever`: successful fetch, redirect following, user-agent header, error cases (bad URL, connection refused)
+- [x] run tests — must pass before next task
+
+### Task 2: Implement CloudflareRetriever
+
+**Files:**
+- Modify: `extractor/retriever.go`
+- Modify: `extractor/retriever_test.go`
+
+- [x] implement `CloudflareRetriever` struct with `AccountID`, `APIToken`, `BaseURL` (for test override), `Timeout` fields
+- [x] implement `Retrieve` method: POST to `/accounts/{id}/browser-rendering/content` with `{"url": "...", "gotoOptions": {"waitUntil": "networkidle0"}}`, Bearer token auth
+- [x] handle response: try JSON `{"success": true, "result": "<html>"}` first, fall back to raw body; set `Content-Type: text/html; charset=utf-8` header
+- [x] write tests: successful fetch (mock CF API), API error (non-200 status), JSON response format, raw HTML response format
+- [x] run tests — must pass before next task
+
+### Task 3: Wire Retriever into UReadability
+
+**Files:**
+- Modify: `extractor/readability.go`
+- Modify: `extractor/readability_test.go`
+
+- [x] add `Retriever Retriever` field to `UReadability` struct
+- [x] add `retriever()` helper method: returns `f.Retriever` if non-nil, otherwise `&HTTPRetriever{Timeout: f.TimeOut}`
+- [x] replace inline HTTP fetch in `extractWithRules` (lines 80-106) with `f.retriever().Retrieve(ctx, reqURL)` call
+- [x] use `result.URL`, `result.Body`, `result.Header` instead of `resp.Request.URL`, `io.ReadAll(resp.Body)`, `resp.Header`
+- [x] change `normalizeLinks` signature from `*http.Request` to `*url.URL` (only `.URL` field is used); update caller to pass parsed URL
+- [x] remove unused imports from `readability.go` (`io`, `net/http`)
+- [x] update `TestNormalizeLinks` and `TestNormalizeLinksIssue` to pass `*url.URL` instead of `&http.Request{URL: u}`
+- [x] verify all existing tests pass unchanged (tests create `UReadability` without `Retriever` — nil defaults to HTTPRetriever)
+- [x] run full test suite: `go test -timeout=60s -race ./...`
+
+### Task 4: Add CLI flags and wiring in main.go
+
+**Files:**
+- Modify: `main.go`
+
+- [x] add `CFAccountID string` and `CFAPIToken string` fields to opts struct with `long`/`env` tags
+- [x] in `main()`, create `CloudflareRetriever` when both flags are set; log which retriever is active
+- [x] pass retriever to `UReadability` struct
+- [x] run full test suite: `go test -timeout=60s -race ./...`
+
+### Task 5: Generate mock and run linter
+
+**Files:**
+- Create: `extractor/retriever_mock_test.go` (generated, test-only — placed in extractor package to avoid import cycle with mocks/)
+
+- [x] run `go generate ./extractor/...` to generate `Retriever` mock
+- [x] run `gofmt -w` on all modified files
+- [x] run `golangci-lint run --max-issues-per-linter=0 --max-same-issues=0`
+- [x] fix any lint issues
+
+### Task 6: Verify acceptance criteria
+
+- [x] verify `UReadability{}` without `Retriever` field works (backward compatible)
+- [x] verify `UReadability{Retriever: &CloudflareRetriever{...}}` works
+- [x] verify all existing tests pass: `go test -timeout=60s -race ./...`
+- [x] verify mock is generated and up to date
+
+### Task 7: [Final] Update documentation
+
+- [x] update CLAUDE.md architecture section to mention `Retriever` interface
+- [x] update CLAUDE.md build section with new CLI flags
+- [x] move this plan to `docs/plans/completed/`
+
+## Technical Details
+
+### Retriever interface
+
+```go
+type Retriever interface {
+    Retrieve(ctx context.Context, url string) (*RetrieveResult, error)
+}
+
+type RetrieveResult struct {
+    Body   []byte      // raw page content (HTML)
+    URL    string      // final URL after redirects
+    Header http.Header // response headers (for charset detection)
+}
+```
+
+### CloudflareRetriever request/response
+
+```
+POST https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/content
+Authorization: Bearer {api_token}
+Content-Type: application/json
+
+{"url": "https://example.com", "gotoOptions": {"waitUntil": "networkidle0"}}
+```
+
+Response: fully rendered HTML (may be JSON-wrapped `{"success": true, "result": "<html>"}` or raw HTML).
+
+### CLI flags
+
+| Flag | Env | Description |
+|------|-----|-------------|
+| `--cf-account-id` | `CF_ACCOUNT_ID` | Cloudflare account ID for Browser Rendering API |
+| `--cf-api-token` | `CF_API_TOKEN` | Cloudflare API token with Browser Rendering Edit permission |
+
+When both are set → `CloudflareRetriever`; otherwise → `HTTPRetriever` (default).
+
+### Pipeline flow (unchanged)
+
+```
+Retriever.Retrieve(url)  →  toUtf8  →  getContent (rules/readability)  →  title
+     ↑ NEW                     │              │
+     │                         ↓              ↓
+HTTPRetriever (default)    getText  →  normalizeLinks  →  getSnippet  →  extractPics
+CloudflareRetriever (opt)
+```
+
+## Post-Completion
+
+**External system updates:**
+- super-bot deployment: add `CF_ACCOUNT_ID` and `CF_API_TOKEN` env vars to ukeeper-readability deployment config when switching to Cloudflare retrieval
+- Cloudflare setup: create API token with "Browser Rendering - Edit" permission under the target account
+- radio-t/super-bot#156: can be closed once this is deployed — super-bot continues using the existing `uKeeperGetter` interface unchanged
+
+**Manual verification:**
+- test against real Cloudflare Browser Rendering API with known problematic URLs (sites returning "just a moment..." to direct HTTP)
+- verify free tier limits are acceptable (10 min/day browser time, 1 req/10 sec rate limit)