Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
23e2942
feat: create Retriever interface and HTTPRetriever implementation
paskal Mar 29, 2026
5bcd16d
feat: implement CloudflareRetriever for Browser Rendering API
paskal Mar 29, 2026
4b932a3
feat: wire Retriever interface into UReadability extraction pipeline
paskal Mar 29, 2026
6128691
feat: add CLI flags and wire Cloudflare retriever in main.go
paskal Mar 29, 2026
dd74c5a
feat: generate Retriever mock, run gofmt and linter
paskal Mar 29, 2026
8d560a0
feat: verify acceptance criteria for Retriever interface
paskal Mar 29, 2026
a7ff6d0
feat: update documentation for Retriever interface and CLI flags
paskal Mar 29, 2026
e2d96e5
fix: address code review findings
paskal Mar 29, 2026
7119f29
fix: address code review findings
paskal Mar 29, 2026
ef56b96
fix: address code review findings
paskal Mar 29, 2026
298fa03
fix: address codex review findings
paskal Mar 29, 2026
e0562ed
fix: address code review findings
paskal Mar 29, 2026
60f1a07
fix: address code review findings
paskal Mar 29, 2026
38255d5
fix: address code review findings
paskal Mar 29, 2026
cd8ab64
fix: cache default retriever, add defensive timeouts, extract constants
paskal Mar 29, 2026
4bdfd51
fix: revert token auth addition to POST /api/extract
paskal Mar 29, 2026
1ef736b
docs: add OpenAI auto-extraction improvement plan
paskal Mar 29, 2026
5ab4f4e
docs: remove unrelated OpenAI auto-extraction plan
paskal Apr 12, 2026
1078c74
feat: add per-rule and global Cloudflare routing, 429 retries
paskal Apr 12, 2026
6403f00
fix: address review feedback on Cloudflare routing PR
paskal Apr 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ golangci-lint run --max-issues-per-linter=0 --max-same-issues=0 # lint from rep

The `revision` variable in `main.go` is injected at build time: `-ldflags "-X main.revision=<version>"`.

Optional Cloudflare Browser Rendering flags (when both are set, uses `CloudflareRetriever` instead of default HTTP):
- `--cf-account-id` / `CF_ACCOUNT_ID` — Cloudflare account ID
- `--cf-api-token` / `CF_API_TOKEN` — Cloudflare API token with Browser Rendering Edit permission

`main_test.go` is gated behind `ENABLE_MONGO_TESTS=true` and needs MongoDB on localhost:27017. All other packages test independently — `datastore/` spins up MongoDB via testcontainers automatically.

## Architecture
Expand All @@ -32,11 +36,13 @@ web/ → Go HTML templates (HTMX v2), static assets

**Dependency flow:** `main → datastore, extractor, rest`; `rest → datastore, extractor`; `extractor → datastore` (Rule type + Rules interface).

**Key interface** — `extractor.Rules` (defined consumer-side in `extractor/readability.go`), implemented by `datastore.RulesDAO`. Mock generated with `//go:generate moq` in extractor package.
**Key interfaces:**
- `extractor.Rules` (defined consumer-side in `extractor/readability.go`), implemented by `datastore.RulesDAO`. Mock generated with `//go:generate moq` in extractor package.
- `extractor.Retriever` (defined in `extractor/retriever.go`) — abstracts URL content fetching. Two implementations: `HTTPRetriever` (default, standard HTTP GET with Safari user-agent) and `CloudflareRetriever` (Cloudflare Browser Rendering API for JS-rendered pages). When `UReadability.Retriever` is nil, defaults to `HTTPRetriever`.

## Content Extraction Flow

1. Fetch URL (30s timeout, Safari user-agent, follows redirects)
1. Fetch URL via `Retriever` interface (default: HTTP GET with 30s timeout, Safari user-agent, follows redirects; optional: Cloudflare Browser Rendering for JS-heavy sites)
2. Detect charset from Content-Type header and `<meta>` tags, convert to UTF-8
3. Look up custom CSS selector rule from MongoDB by domain
4. If rule found → extract via goquery CSS selector; if fails → fall back to general parser
Expand All @@ -47,7 +53,7 @@ web/ → Go HTML templates (HTMX v2), static assets

- Rule upsert is keyed on `domain` — one rule per domain. Rules are disabled (`enabled: false`), never deleted.
- `rest.Server.Readability` is `extractor.UReadability` by value (not pointer), with `Rules` interface field inside.
- Protected API routes use custom `basicAuth` middleware with constant-time comparison.
- `/api/content/v1/parser` requires the `token` query parameter when configured. Token comparison uses `subtle.ConstantTimeCompare`. Protected rule management routes use custom `basicAuth` middleware with constant-time comparison.
- Web UI text is in Russian — tests assert on Russian strings, don't change them.
- Middleware stack: Recoverer → RealIP → AppInfo+Ping → Throttle(50) → Logger.
- CI runs tests and lint in the `build` job; Docker build only compiles (no tests inside Docker).
- CI: `ci.yml` runs tests and lint in the `build` job (MongoDB via service container); `docker.yml` builds Docker images via `workflow_run` trigger after `build` succeeds (no tests inside Docker).
16 changes: 14 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,28 @@
| port | UKEEPER_PORT | `8080` | web server port |
| mongo-uri | MONGO_URI | none | MongoDB connection string, _required_ |
| frontend-dir | FRONTEND_DIR | `/srv/web` | directory with frontend files |
| token | TOKEN | none | token for /content/v1/parser endpoint auth |
| token | UKEEPER_TOKEN | none | token for API endpoint auth |
| mongo-delay | MONGO_DELAY | `0` | mongo initial delay |
| mongo-db | MONGO_DB | `ureadability` | mongo database name |
| creds | CREDS | none | credentials for protected calls (POST, DELETE /rules) |
| cf-account-id| CF_ACCOUNT_ID | none | Cloudflare account ID for Browser Rendering API |
| cf-api-token | CF_API_TOKEN | none | Cloudflare API token with Browser Rendering Edit perm |
| cf-route-all | CF_ROUTE_ALL | `false` | route every request through Cloudflare Browser Rendering |
| dbg | DEBUG | `false` | debug mode |

### Cloudflare Browser Rendering (optional)

Cloudflare Browser Rendering is useful for JavaScript-heavy pages and sites behind a "please enable JS" wall, but it's slower than direct HTTP and the free tier throttles at 1 request per 10 seconds. To keep the service cost-effective, Cloudflare routing is **opt-in**.

1. Set `--cf-account-id` and `--cf-api-token` to enable Cloudflare routing.
2. Either flip the `use_cloudflare` checkbox on individual rules (per-domain, recommended) or set `--cf-route-all=true` to route every request through Cloudflare.

When Cloudflare credentials are not set, the service uses a standard HTTP client for everything (default). On HTTP 429 (rate limit) the service automatically retries with exponential backoff and respects the `Retry-After` header.

### API

GET /api/content/v1/parser?token=secret&url=http://aa.com/blah - extract content (emulate Readability API parse call)
POST /api/v1/extract {url: http://aa.com/blah} - extract content
POST /api/extract {url: http://aa.com/blah} - extract content

## Development

Expand Down
21 changes: 11 additions & 10 deletions datastore/rules.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,17 @@ type RulesDAO struct {

// Rule record, entry in mongo
type Rule struct {
ID bson.ObjectID `json:"id" bson:"_id,omitempty"`
Domain string `json:"domain"`
MatchURLs []string `json:"match_url,omitempty" bson:"match_urls,omitempty"`
Content string `json:"content"`
Author string `json:"author,omitempty" bson:"author,omitempty"`
TS string `json:"ts,omitempty" bson:"ts,omitempty"` // ts of original article
Excludes []string `json:"excludes,omitempty" bson:"excludes,omitempty"`
TestURLs []string `json:"test_urls,omitempty" bson:"test_urls"`
User string `json:"user"`
Enabled bool `json:"enabled"`
ID bson.ObjectID `json:"id" bson:"_id,omitempty"`
Domain string `json:"domain"`
MatchURLs []string `json:"match_url,omitempty" bson:"match_urls,omitempty"`
Content string `json:"content"`
Author string `json:"author,omitempty" bson:"author,omitempty"`
TS string `json:"ts,omitempty" bson:"ts,omitempty"` // ts of original article
Excludes []string `json:"excludes,omitempty" bson:"excludes,omitempty"`
TestURLs []string `json:"test_urls,omitempty" bson:"test_urls"`
User string `json:"user"`
Enabled bool `json:"enabled"`
UseCloudflare bool `json:"use_cloudflare,omitempty" bson:"use_cloudflare,omitempty"` // route fetch via Cloudflare Browser Rendering
}

// Get rule by url. Checks if found in mongo, matching by domain
Expand Down
169 changes: 169 additions & 0 deletions docs/plans/completed/20260329-modularise-retrieval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Modularise URL Retrieval with Cloudflare Browser Rendering

## Overview
- Extract the hardcoded HTTP fetch logic from `extractWithRules` into a `Retriever` interface so different content retrieval backends can be swapped in
- Add `CloudflareRetriever` using Cloudflare Browser Rendering `/content` endpoint — returns fully rendered HTML after JS execution, handling sites behind bot protection or requiring JS rendering
- Addresses PR review feedback on radio-t/super-bot#156: the content extraction improvement belongs in ukeeper-readability (the extraction layer), not in super-bot (the consumer)

## Context (from discovery)
- **fetch logic**: embedded in `extractor/readability.go:extractWithRules` (lines 80-106) — creates `http.Client` inline, Safari user-agent, `io.ReadAll`, no abstraction
- **only existing interface**: `extractor.Rules` for datastore access; no fetcher interface exists
- **pipeline after fetch**: `toUtf8` → `getContent` (rules/readability) → title → `getText` → `normalizeLinks` → `getSnippet` → `extractPics` — all expects HTML input, stays unchanged
- **`normalizeLinks`** takes `*http.Request` but only uses `.URL` — will simplify to `*url.URL`
- **wiring**: `main.go` creates `extractor.UReadability` struct with `TimeOut`, `SnippetSize`, `Rules` fields; `rest.Server` holds it as a concrete struct
- **CF `/content` endpoint**: `POST /accounts/{id}/browser-rendering/content` with `{"url": "..."}`, Bearer token auth, returns rendered HTML

## Development Approach
- **testing approach**: Regular (code first, then tests)
- complete each task fully before moving to the next
- make small, focused changes
- **CRITICAL: every task MUST include new/updated tests** for code changes in that task
- **CRITICAL: all tests must pass before starting next task**
- **CRITICAL: update this plan file when scope changes during implementation**
- run tests after each change
- maintain backward compatibility — existing code creating `UReadability{}` without `Retriever` must continue to work (nil defaults to HTTP fetch)

## Testing Strategy
- **unit tests**: httptest mock servers for both retrievers, table-driven tests, testify assertions (matching existing patterns)
- **mock generation**: moq-generated mock for `Retriever` interface (same pattern as `Rules` mock)
- **integration**: existing `readability_test.go` tests must pass unchanged (they create `UReadability` without `Retriever`)

## Progress Tracking
- mark completed items with `[x]` immediately when done
- add newly discovered tasks with + prefix
- document issues/blockers with warning prefix
- update plan if implementation deviates from original scope

## Implementation Steps

### Task 1: Create Retriever interface and HTTPRetriever

**Files:**
- Create: `extractor/retriever.go`
- Create: `extractor/retriever_test.go`

- [x] define `Retriever` interface with `Retrieve(ctx context.Context, url string) (*RetrieveResult, error)` method
- [x] define `RetrieveResult` struct with `Body []byte`, `URL string`, `Header http.Header`
- [x] implement `HTTPRetriever` struct extracting the current fetch logic from `extractWithRules` (HTTP client, Safari user-agent, redirect following, body reading)
- [x] add `//go:generate moq` directive for `Retriever` interface
- [x] write tests for `HTTPRetriever`: successful fetch, redirect following, user-agent header, error cases (bad URL, connection refused)
- [x] run tests — must pass before next task

### Task 2: Implement CloudflareRetriever

**Files:**
- Modify: `extractor/retriever.go`
- Modify: `extractor/retriever_test.go`

- [x] implement `CloudflareRetriever` struct with `AccountID`, `APIToken`, `BaseURL` (for test override), `Timeout` fields
- [x] implement `Retrieve` method: POST to `/accounts/{id}/browser-rendering/content` with `{"url": "...", "gotoOptions": {"waitUntil": "networkidle0"}}`, Bearer token auth
- [x] handle response: try JSON `{"success": true, "result": "<html>"}` first, fall back to raw body; set `Content-Type: text/html; charset=utf-8` header
- [x] write tests: successful fetch (mock CF API), API error (non-200 status), JSON response format, raw HTML response format
- [x] run tests — must pass before next task

### Task 3: Wire Retriever into UReadability

**Files:**
- Modify: `extractor/readability.go`
- Modify: `extractor/readability_test.go`

- [x] add `Retriever Retriever` field to `UReadability` struct
- [x] add `retriever()` helper method: returns `f.Retriever` if non-nil, otherwise `&HTTPRetriever{Timeout: f.TimeOut}`
- [x] replace inline HTTP fetch in `extractWithRules` (lines 80-106) with `f.retriever().Retrieve(ctx, reqURL)` call
- [x] use `result.URL`, `result.Body`, `result.Header` instead of `resp.Request.URL`, `io.ReadAll(resp.Body)`, `resp.Header`
- [x] change `normalizeLinks` signature from `*http.Request` to `*url.URL` (only `.URL` field is used); update caller to pass parsed URL
- [x] remove unused imports from `readability.go` (`io`, `net/http`)
- [x] update `TestNormalizeLinks` and `TestNormalizeLinksIssue` to pass `*url.URL` instead of `&http.Request{URL: u}`
- [x] verify all existing tests pass unchanged (tests create `UReadability` without `Retriever` — nil defaults to HTTPRetriever)
- [x] run full test suite: `go test -timeout=60s -race ./...`

### Task 4: Add CLI flags and wiring in main.go

**Files:**
- Modify: `main.go`

- [x] add `CFAccountID string` and `CFAPIToken string` fields to opts struct with `long`/`env` tags
- [x] in `main()`, create `CloudflareRetriever` when both flags are set; log which retriever is active
- [x] pass retriever to `UReadability` struct
- [x] run full test suite: `go test -timeout=60s -race ./...`

### Task 5: Generate mock and run linter

**Files:**
- Create: `extractor/retriever_mock_test.go` (generated, test-only — placed in extractor package to avoid import cycle with mocks/)

- [x] run `go generate ./extractor/...` to generate `Retriever` mock
- [x] run `gofmt -w` on all modified files
- [x] run `golangci-lint run --max-issues-per-linter=0 --max-same-issues=0`
- [x] fix any lint issues

### Task 6: Verify acceptance criteria

- [x] verify `UReadability{}` without `Retriever` field works (backward compatible)
- [x] verify `UReadability{Retriever: &CloudflareRetriever{...}}` works
- [x] verify all existing tests pass: `go test -timeout=60s -race ./...`
- [x] verify mock is generated and up to date

### Task 7: [Final] Update documentation

- [x] update CLAUDE.md architecture section to mention `Retriever` interface
- [x] update CLAUDE.md build section with new CLI flags
- [x] move this plan to `docs/plans/completed/`

## Technical Details

### Retriever interface

```go
type Retriever interface {
Retrieve(ctx context.Context, url string) (*RetrieveResult, error)
}

type RetrieveResult struct {
Body []byte // raw page content (HTML)
URL string // final URL after redirects
Header http.Header // response headers (for charset detection)
}
```

### CloudflareRetriever request/response

```
POST https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/content
Authorization: Bearer {api_token}
Content-Type: application/json

{"url": "https://example.com", "gotoOptions": {"waitUntil": "networkidle0"}}
```

Response: fully rendered HTML (may be JSON-wrapped `{"success": true, "result": "<html>"}` or raw HTML).

### CLI flags

| Flag | Env | Description |
|------|-----|-------------|
| `--cf-account-id` | `CF_ACCOUNT_ID` | Cloudflare account ID for Browser Rendering API |
| `--cf-api-token` | `CF_API_TOKEN` | Cloudflare API token with Browser Rendering Edit permission |

When both are set → `CloudflareRetriever`; otherwise → `HTTPRetriever` (default).

### Pipeline flow (unchanged)

```
Retriever.Retrieve(url) → toUtf8 → getContent (rules/readability) → title
↑ NEW │ │
│ ↓ ↓
HTTPRetriever (default) getText → normalizeLinks → getSnippet → extractPics
CloudflareRetriever (opt)
```

## Post-Completion

**External system updates:**
- super-bot deployment: add `CF_ACCOUNT_ID` and `CF_API_TOKEN` env vars to ukeeper-readability deployment config when switching to Cloudflare retrieval
- Cloudflare setup: create API token with "Browser Rendering - Edit" permission under the target account
- radio-t/super-bot#156: can be closed once this is deployed — super-bot continues using the existing `uKeeperGetter` interface unchanged

**Manual verification:**
- test against real Cloudflare Browser Rendering API with known problematic URLs (sites returning "just a moment..." to direct HTTP)
- verify free tier limits are acceptable (10 min/day browser time, 1 req/10 sec rate limit)
Loading