diff --git a/CLAUDE.md b/CLAUDE.md index 6dafc24f..55efbdf7 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -17,6 +17,10 @@ golangci-lint run --max-issues-per-linter=0 --max-same-issues=0 # lint from rep The `revision` variable in `main.go` is injected at build time: `-ldflags "-X main.revision="`. +Optional Cloudflare Browser Rendering flags (when both are set, uses `CloudflareRetriever` instead of default HTTP): +- `--cf-account-id` / `CF_ACCOUNT_ID` — Cloudflare account ID +- `--cf-api-token` / `CF_API_TOKEN` — Cloudflare API token with Browser Rendering Edit permission + `main_test.go` is gated behind `ENABLE_MONGO_TESTS=true` and needs MongoDB on localhost:27017. All other packages test independently — `datastore/` spins up MongoDB via testcontainers automatically. ## Architecture @@ -32,11 +36,13 @@ web/ → Go HTML templates (HTMX v2), static assets **Dependency flow:** `main → datastore, extractor, rest`; `rest → datastore, extractor`; `extractor → datastore` (Rule type + Rules interface). -**Key interface** — `extractor.Rules` (defined consumer-side in `extractor/readability.go`), implemented by `datastore.RulesDAO`. Mock generated with `//go:generate moq` in extractor package. +**Key interfaces:** +- `extractor.Rules` (defined consumer-side in `extractor/readability.go`), implemented by `datastore.RulesDAO`. Mock generated with `//go:generate moq` in extractor package. +- `extractor.Retriever` (defined in `extractor/retriever.go`) — abstracts URL content fetching. Two implementations: `HTTPRetriever` (default, standard HTTP GET with Safari user-agent) and `CloudflareRetriever` (Cloudflare Browser Rendering API for JS-rendered pages). When `UReadability.Retriever` is nil, defaults to `HTTPRetriever`. ## Content Extraction Flow -1. Fetch URL (30s timeout, Safari user-agent, follows redirects) +1. Fetch URL via `Retriever` interface (default: HTTP GET with 30s timeout, Safari user-agent, follows redirects; optional: Cloudflare Browser Rendering for JS-heavy sites) 2. Detect charset from Content-Type header and `` tags, convert to UTF-8 3. Look up custom CSS selector rule from MongoDB by domain 4. If rule found → extract via goquery CSS selector; if fails → fall back to general parser @@ -47,7 +53,7 @@ web/ → Go HTML templates (HTMX v2), static assets - Rule upsert is keyed on `domain` — one rule per domain. Rules are disabled (`enabled: false`), never deleted. - `rest.Server.Readability` is `extractor.UReadability` by value (not pointer), with `Rules` interface field inside. -- Protected API routes use custom `basicAuth` middleware with constant-time comparison. +- `/api/content/v1/parser` requires the `token` query parameter when configured. Token comparison uses `subtle.ConstantTimeCompare`. Protected rule management routes use custom `basicAuth` middleware with constant-time comparison. - Web UI text is in Russian — tests assert on Russian strings, don't change them. - Middleware stack: Recoverer → RealIP → AppInfo+Ping → Throttle(50) → Logger. -- CI runs tests and lint in the `build` job; Docker build only compiles (no tests inside Docker). +- CI: `ci.yml` runs tests and lint in the `build` job (MongoDB via service container); `docker.yml` builds Docker images via `workflow_run` trigger after `build` succeeds (no tests inside Docker). diff --git a/README.md b/README.md index 9a9639d2..e2648b7b 100644 --- a/README.md +++ b/README.md @@ -12,16 +12,28 @@ | port | UKEEPER_PORT | `8080` | web server port | | mongo-uri | MONGO_URI | none | MongoDB connection string, _required_ | | frontend-dir | FRONTEND_DIR | `/srv/web` | directory with frontend files | -| token | TOKEN | none | token for /content/v1/parser endpoint auth | +| token | UKEEPER_TOKEN | none | token for API endpoint auth | | mongo-delay | MONGO_DELAY | `0` | mongo initial delay | | mongo-db | MONGO_DB | `ureadability` | mongo database name | | creds | CREDS | none | credentials for protected calls (POST, DELETE /rules) | +| cf-account-id| CF_ACCOUNT_ID | none | Cloudflare account ID for Browser Rendering API | +| cf-api-token | CF_API_TOKEN | none | Cloudflare API token with Browser Rendering Edit perm | +| cf-route-all | CF_ROUTE_ALL | `false` | route every request through Cloudflare Browser Rendering | | dbg | DEBUG | `false` | debug mode | +### Cloudflare Browser Rendering (optional) + +Cloudflare Browser Rendering is useful for JavaScript-heavy pages and sites behind a "please enable JS" wall, but it's slower than direct HTTP and the free tier throttles at 1 request per 10 seconds. To keep the service cost-effective, Cloudflare routing is **opt-in**. + +1. Set `--cf-account-id` and `--cf-api-token` to enable Cloudflare routing. +2. Either flip the `use_cloudflare` checkbox on individual rules (per-domain, recommended) or set `--cf-route-all=true` to route every request through Cloudflare. + +When Cloudflare credentials are not set, the service uses a standard HTTP client for everything (default). On HTTP 429 (rate limit) the service automatically retries with exponential backoff and respects the `Retry-After` header. + ### API GET /api/content/v1/parser?token=secret&url=http://aa.com/blah - extract content (emulate Readability API parse call) - POST /api/v1/extract {url: http://aa.com/blah} - extract content + POST /api/extract {url: http://aa.com/blah} - extract content ## Development diff --git a/datastore/rules.go b/datastore/rules.go index 8db12f82..ddc77041 100644 --- a/datastore/rules.go +++ b/datastore/rules.go @@ -18,16 +18,17 @@ type RulesDAO struct { // Rule record, entry in mongo type Rule struct { - ID bson.ObjectID `json:"id" bson:"_id,omitempty"` - Domain string `json:"domain"` - MatchURLs []string `json:"match_url,omitempty" bson:"match_urls,omitempty"` - Content string `json:"content"` - Author string `json:"author,omitempty" bson:"author,omitempty"` - TS string `json:"ts,omitempty" bson:"ts,omitempty"` // ts of original article - Excludes []string `json:"excludes,omitempty" bson:"excludes,omitempty"` - TestURLs []string `json:"test_urls,omitempty" bson:"test_urls"` - User string `json:"user"` - Enabled bool `json:"enabled"` + ID bson.ObjectID `json:"id" bson:"_id,omitempty"` + Domain string `json:"domain"` + MatchURLs []string `json:"match_url,omitempty" bson:"match_urls,omitempty"` + Content string `json:"content"` + Author string `json:"author,omitempty" bson:"author,omitempty"` + TS string `json:"ts,omitempty" bson:"ts,omitempty"` // ts of original article + Excludes []string `json:"excludes,omitempty" bson:"excludes,omitempty"` + TestURLs []string `json:"test_urls,omitempty" bson:"test_urls"` + User string `json:"user"` + Enabled bool `json:"enabled"` + UseCloudflare bool `json:"use_cloudflare,omitempty" bson:"use_cloudflare,omitempty"` // route fetch via Cloudflare Browser Rendering } // Get rule by url. Checks if found in mongo, matching by domain diff --git a/docs/plans/completed/20260329-modularise-retrieval.md b/docs/plans/completed/20260329-modularise-retrieval.md new file mode 100644 index 00000000..31df83d1 --- /dev/null +++ b/docs/plans/completed/20260329-modularise-retrieval.md @@ -0,0 +1,169 @@ +# Modularise URL Retrieval with Cloudflare Browser Rendering + +## Overview +- Extract the hardcoded HTTP fetch logic from `extractWithRules` into a `Retriever` interface so different content retrieval backends can be swapped in +- Add `CloudflareRetriever` using Cloudflare Browser Rendering `/content` endpoint — returns fully rendered HTML after JS execution, handling sites behind bot protection or requiring JS rendering +- Addresses PR review feedback on radio-t/super-bot#156: the content extraction improvement belongs in ukeeper-readability (the extraction layer), not in super-bot (the consumer) + +## Context (from discovery) +- **fetch logic**: embedded in `extractor/readability.go:extractWithRules` (lines 80-106) — creates `http.Client` inline, Safari user-agent, `io.ReadAll`, no abstraction +- **only existing interface**: `extractor.Rules` for datastore access; no fetcher interface exists +- **pipeline after fetch**: `toUtf8` → `getContent` (rules/readability) → title → `getText` → `normalizeLinks` → `getSnippet` → `extractPics` — all expects HTML input, stays unchanged +- **`normalizeLinks`** takes `*http.Request` but only uses `.URL` — will simplify to `*url.URL` +- **wiring**: `main.go` creates `extractor.UReadability` struct with `TimeOut`, `SnippetSize`, `Rules` fields; `rest.Server` holds it as a concrete struct +- **CF `/content` endpoint**: `POST /accounts/{id}/browser-rendering/content` with `{"url": "..."}`, Bearer token auth, returns rendered HTML + +## Development Approach +- **testing approach**: Regular (code first, then tests) +- complete each task fully before moving to the next +- make small, focused changes +- **CRITICAL: every task MUST include new/updated tests** for code changes in that task +- **CRITICAL: all tests must pass before starting next task** +- **CRITICAL: update this plan file when scope changes during implementation** +- run tests after each change +- maintain backward compatibility — existing code creating `UReadability{}` without `Retriever` must continue to work (nil defaults to HTTP fetch) + +## Testing Strategy +- **unit tests**: httptest mock servers for both retrievers, table-driven tests, testify assertions (matching existing patterns) +- **mock generation**: moq-generated mock for `Retriever` interface (same pattern as `Rules` mock) +- **integration**: existing `readability_test.go` tests must pass unchanged (they create `UReadability` without `Retriever`) + +## Progress Tracking +- mark completed items with `[x]` immediately when done +- add newly discovered tasks with + prefix +- document issues/blockers with warning prefix +- update plan if implementation deviates from original scope + +## Implementation Steps + +### Task 1: Create Retriever interface and HTTPRetriever + +**Files:** +- Create: `extractor/retriever.go` +- Create: `extractor/retriever_test.go` + +- [x] define `Retriever` interface with `Retrieve(ctx context.Context, url string) (*RetrieveResult, error)` method +- [x] define `RetrieveResult` struct with `Body []byte`, `URL string`, `Header http.Header` +- [x] implement `HTTPRetriever` struct extracting the current fetch logic from `extractWithRules` (HTTP client, Safari user-agent, redirect following, body reading) +- [x] add `//go:generate moq` directive for `Retriever` interface +- [x] write tests for `HTTPRetriever`: successful fetch, redirect following, user-agent header, error cases (bad URL, connection refused) +- [x] run tests — must pass before next task + +### Task 2: Implement CloudflareRetriever + +**Files:** +- Modify: `extractor/retriever.go` +- Modify: `extractor/retriever_test.go` + +- [x] implement `CloudflareRetriever` struct with `AccountID`, `APIToken`, `BaseURL` (for test override), `Timeout` fields +- [x] implement `Retrieve` method: POST to `/accounts/{id}/browser-rendering/content` with `{"url": "...", "gotoOptions": {"waitUntil": "networkidle0"}}`, Bearer token auth +- [x] handle response: try JSON `{"success": true, "result": ""}` first, fall back to raw body; set `Content-Type: text/html; charset=utf-8` header +- [x] write tests: successful fetch (mock CF API), API error (non-200 status), JSON response format, raw HTML response format +- [x] run tests — must pass before next task + +### Task 3: Wire Retriever into UReadability + +**Files:** +- Modify: `extractor/readability.go` +- Modify: `extractor/readability_test.go` + +- [x] add `Retriever Retriever` field to `UReadability` struct +- [x] add `retriever()` helper method: returns `f.Retriever` if non-nil, otherwise `&HTTPRetriever{Timeout: f.TimeOut}` +- [x] replace inline HTTP fetch in `extractWithRules` (lines 80-106) with `f.retriever().Retrieve(ctx, reqURL)` call +- [x] use `result.URL`, `result.Body`, `result.Header` instead of `resp.Request.URL`, `io.ReadAll(resp.Body)`, `resp.Header` +- [x] change `normalizeLinks` signature from `*http.Request` to `*url.URL` (only `.URL` field is used); update caller to pass parsed URL +- [x] remove unused imports from `readability.go` (`io`, `net/http`) +- [x] update `TestNormalizeLinks` and `TestNormalizeLinksIssue` to pass `*url.URL` instead of `&http.Request{URL: u}` +- [x] verify all existing tests pass unchanged (tests create `UReadability` without `Retriever` — nil defaults to HTTPRetriever) +- [x] run full test suite: `go test -timeout=60s -race ./...` + +### Task 4: Add CLI flags and wiring in main.go + +**Files:** +- Modify: `main.go` + +- [x] add `CFAccountID string` and `CFAPIToken string` fields to opts struct with `long`/`env` tags +- [x] in `main()`, create `CloudflareRetriever` when both flags are set; log which retriever is active +- [x] pass retriever to `UReadability` struct +- [x] run full test suite: `go test -timeout=60s -race ./...` + +### Task 5: Generate mock and run linter + +**Files:** +- Create: `extractor/retriever_mock_test.go` (generated, test-only — placed in extractor package to avoid import cycle with mocks/) + +- [x] run `go generate ./extractor/...` to generate `Retriever` mock +- [x] run `gofmt -w` on all modified files +- [x] run `golangci-lint run --max-issues-per-linter=0 --max-same-issues=0` +- [x] fix any lint issues + +### Task 6: Verify acceptance criteria + +- [x] verify `UReadability{}` without `Retriever` field works (backward compatible) +- [x] verify `UReadability{Retriever: &CloudflareRetriever{...}}` works +- [x] verify all existing tests pass: `go test -timeout=60s -race ./...` +- [x] verify mock is generated and up to date + +### Task 7: [Final] Update documentation + +- [x] update CLAUDE.md architecture section to mention `Retriever` interface +- [x] update CLAUDE.md build section with new CLI flags +- [x] move this plan to `docs/plans/completed/` + +## Technical Details + +### Retriever interface + +```go +type Retriever interface { + Retrieve(ctx context.Context, url string) (*RetrieveResult, error) +} + +type RetrieveResult struct { + Body []byte // raw page content (HTML) + URL string // final URL after redirects + Header http.Header // response headers (for charset detection) +} +``` + +### CloudflareRetriever request/response + +``` +POST https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/content +Authorization: Bearer {api_token} +Content-Type: application/json + +{"url": "https://example.com", "gotoOptions": {"waitUntil": "networkidle0"}} +``` + +Response: fully rendered HTML (may be JSON-wrapped `{"success": true, "result": ""}` or raw HTML). + +### CLI flags + +| Flag | Env | Description | +|------|-----|-------------| +| `--cf-account-id` | `CF_ACCOUNT_ID` | Cloudflare account ID for Browser Rendering API | +| `--cf-api-token` | `CF_API_TOKEN` | Cloudflare API token with Browser Rendering Edit permission | + +When both are set → `CloudflareRetriever`; otherwise → `HTTPRetriever` (default). + +### Pipeline flow (unchanged) + +``` +Retriever.Retrieve(url) → toUtf8 → getContent (rules/readability) → title + ↑ NEW │ │ + │ ↓ ↓ +HTTPRetriever (default) getText → normalizeLinks → getSnippet → extractPics +CloudflareRetriever (opt) +``` + +## Post-Completion + +**External system updates:** +- super-bot deployment: add `CF_ACCOUNT_ID` and `CF_API_TOKEN` env vars to ukeeper-readability deployment config when switching to Cloudflare retrieval +- Cloudflare setup: create API token with "Browser Rendering - Edit" permission under the target account +- radio-t/super-bot#156: can be closed once this is deployed — super-bot continues using the existing `uKeeperGetter` interface unchanged + +**Manual verification:** +- test against real Cloudflare Browser Rendering API with known problematic URLs (sites returning "just a moment..." to direct HTTP) +- verify free tier limits are acceptable (10 min/day browser time, 1 req/10 sec rate limit) diff --git a/extractor/readability.go b/extractor/readability.go index 7ffb8c28..bfa7f081 100644 --- a/extractor/readability.go +++ b/extractor/readability.go @@ -4,11 +4,10 @@ package extractor import ( "context" "fmt" - "io" - "net/http" "net/url" "regexp" "strings" + "sync" "time" "github.com/PuerkitoBio/goquery" @@ -35,6 +34,39 @@ type UReadability struct { TimeOut time.Duration SnippetSize int Rules Rules + Retriever Retriever // default retriever; when nil a cached HTTPRetriever is used + CFRetriever Retriever // optional Cloudflare Browser Rendering retriever; when set, enables routing + CFRouteAll bool // route every request through CFRetriever (requires CFRetriever != nil) + + defaultRetrieverOnce sync.Once + defaultRetriever Retriever +} + +// retriever returns the configured default Retriever, creating a cached HTTPRetriever if nil +func (f *UReadability) retriever() Retriever { + if f.Retriever != nil { + return f.Retriever + } + f.defaultRetrieverOnce.Do(func() { + f.defaultRetriever = &HTTPRetriever{Timeout: f.TimeOut} + }) + return f.defaultRetriever +} + +// pickRetriever decides which retriever should fetch the given URL based on routing config and an +// optional pre-resolved rule. Falls back to the default retriever unless CFRetriever is set AND +// either CFRouteAll is true or the rule explicitly asks for Cloudflare. +func (f *UReadability) pickRetriever(rule *datastore.Rule) Retriever { + if f.CFRetriever == nil { + return f.retriever() + } + if f.CFRouteAll { + return f.CFRetriever + } + if rule != nil && rule.UseCloudflare { + return f.CFRetriever + } + return f.retriever() } // Response from api calls @@ -77,36 +109,23 @@ func (f *UReadability) extractWithRules(ctx context.Context, reqURL string, rule log.Printf("[INFO] extract %s", reqURL) rb := &Response{} - httpClient := &http.Client{Timeout: f.TimeOut} - req, err := http.NewRequestWithContext(ctx, "GET", reqURL, http.NoBody) - if err != nil { - log.Printf("[WARN] failed to create request for %s, error=%v", reqURL, err) - return nil, err + // look up a rule by domain once up front (unless one was explicitly passed) so retriever + // selection and getContent share the same lookup instead of paying for two round-trips. + if rule == nil && f.Rules != nil { + if r, found := f.Rules.Get(ctx, reqURL); found { + rule = &r + } } - req.Close = true - req.Header.Set("User-Agent", userAgent) - resp, err := httpClient.Do(req) + result, err := f.pickRetriever(rule).Retrieve(ctx, reqURL) if err != nil { - log.Printf("[WARN] failed to get anything from %s, error=%v", reqURL, err) return nil, err } - defer func() { - if err = resp.Body.Close(); err != nil { - log.Printf("[WARN] failed to close response body, error=%v", err) - } - }() - rb.URL = resp.Request.URL.String() - dataBytes, e := io.ReadAll(resp.Body) - - if e != nil { - log.Printf("[WARN] failed to read data from %s, error=%v", reqURL, e) - return nil, e - } + rb.URL = result.URL var body string - rb.ContentType, rb.Charset, body = f.toUtf8(dataBytes, resp.Header) + rb.ContentType, rb.Charset, body = f.toUtf8(result.Body, result.Header) rb.Content, rb.Rich, err = f.getContent(ctx, body, reqURL, rule) if err != nil { log.Printf("[WARN] failed to parse %s, error=%v", reqURL, err) @@ -120,12 +139,14 @@ func (f *UReadability) extractWithRules(ctx context.Context, reqURL string, rule rb.Title = dbody.Find("title").First().Text() - if r, e := url.Parse(rb.URL); e == nil { - rb.Domain = r.Host + finalURL, err := url.Parse(rb.URL) + if err != nil { + return nil, fmt.Errorf("parse final URL %q: %w", rb.URL, err) } + rb.Domain = finalURL.Host rb.Content = f.getText(rb.Content, rb.Title) - rb.Rich, rb.AllLinks = f.normalizeLinks(rb.Rich, resp.Request) + rb.Rich, rb.AllLinks = f.normalizeLinks(rb.Rich, finalURL) rb.Excerpt = f.getSnippet(rb.Content) darticle, err := goquery.NewDocumentFromReader(strings.NewReader(rb.Rich)) if err != nil { @@ -141,10 +162,10 @@ func (f *UReadability) extractWithRules(ctx context.Context, reqURL string, rule return rb, nil } -// getContent retrieves content from raw body string, both content (text only) and rich (with html tags) -// if rule is provided, it uses custom rule, otherwise tries to retrieve one from the storage, -// and at last tries to use general readability parser -func (f *UReadability) getContent(ctx context.Context, body, reqURL string, rule *datastore.Rule) (content, rich string, err error) { +// getContent retrieves content from raw body string, both content (text only) and rich (with html tags). +// if rule is provided, it tries the custom rule first and falls back to the general parser on failure. +// rule lookup for a given URL is done upstream in extractWithRules. +func (f *UReadability) getContent(_ context.Context, body, reqURL string, rule *datastore.Rule) (content, rich string, err error) { // general parser genParser := func(body, _ string) (content, rich string, err error) { doc, err := readability.NewDocument(body) @@ -177,28 +198,19 @@ func (f *UReadability) getContent(ctx context.Context, body, reqURL string, rule if rule != nil { log.Printf("[DEBUG] custom rule provided for %s: %v", reqURL, rule) - return customParser(body, reqURL, *rule) - } - - if f.Rules != nil { - r := f.Rules - if rule, found := r.Get(ctx, reqURL); found { - if content, rich, err = customParser(body, reqURL, rule); err == nil { - return content, rich, nil - } - log.Printf("[WARN] custom extractor failed for %s, error=%v", reqURL, err) // back to general parser + if content, rich, err = customParser(body, reqURL, *rule); err == nil { + return content, rich, nil } - } else { - log.Print("[DEBUG] no rules defined!") + log.Printf("[WARN] custom extractor failed for %s, error=%v", reqURL, err) // back to general parser } return genParser(body, reqURL) } // makes all links absolute and returns all found links -func (f *UReadability) normalizeLinks(data string, reqContext *http.Request) (result string, links []string) { +func (f *UReadability) normalizeLinks(data string, baseURL *url.URL) (result string, links []string) { absoluteLink := func(link string) (absLink string, changed bool) { - if r, err := reqContext.URL.Parse(link); err == nil { + if r, err := baseURL.Parse(link); err == nil { return r.String(), r.String() != link } return "", false diff --git a/extractor/readability_test.go b/extractor/readability_test.go index 39f519b1..68a9ea6d 100644 --- a/extractor/readability_test.go +++ b/extractor/readability_test.go @@ -26,6 +26,10 @@ func TestExtractURL(t *testing.T) { } if r.URL.String() == "/2015/11/26/vsiem-mirom-dlia-obshchiei-polzy/" { fh, err := os.Open("testdata/vsiem-mirom-dlia-obshchiei-polzy.html") + if err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + return + } w.Header().Set("Content-Type", "text/html; charset=windows-1251") // test non-standard charset decoding testHTML, err := io.ReadAll(fh) assert.NoError(t, err) @@ -105,6 +109,10 @@ func TestExtractGeneral(t *testing.T) { } if r.URL.String() == "/p/2015/11/22/podcast-369/" { fh, err := os.Open("testdata/podcast-369.html") + if err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + return + } testHTML, err := io.ReadAll(fh) assert.NoError(t, err) assert.NoError(t, fh.Close()) @@ -114,6 +122,10 @@ func TestExtractGeneral(t *testing.T) { } if r.URL.String() == "/2015/11/26/vsiem-mirom-dlia-obshchiei-polzy/" { fh, err := os.Open("testdata/vsiem-mirom-dlia-obshchiei-polzy.html") + if err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + return + } testHTML, err := io.ReadAll(fh) assert.NoError(t, err) assert.NoError(t, fh.Close()) @@ -124,13 +136,16 @@ func TestExtractGeneral(t *testing.T) { })) defer ts.Close() + tsURL, err := url.Parse(ts.URL) + require.NoError(t, err) + lr := UReadability{TimeOut: 30 * time.Second, SnippetSize: 200} a, err := lr.Extract(context.Background(), ts.URL+"/2015/11/26/vsiem-mirom-dlia-obshchiei-polzy/") require.NoError(t, err) assert.Equal(t, "Всем миром для общей пользы • Umputun тут был", a.Title) assert.Equal(t, ts.URL+"/2015/11/26/vsiem-mirom-dlia-obshchiei-polzy/", a.URL) assert.Equal(t, "Не первый раз я практикую идею “а давайте, ребята, сделаем для общего блага …”, и вот опять. В нашем подкасте радио-т есть незаменимый инструмент, позволяющий собирать новости, готовить их к выпуску, ...", a.Excerpt) - assert.Contains(t, ts.URL, a.Domain) + assert.Equal(t, tsURL.Host, a.Domain) a, err = lr.Extract(context.Background(), ts.URL+"/v48b6Q") require.NoError(t, err) @@ -138,7 +153,7 @@ func TestExtractGeneral(t *testing.T) { assert.Equal(t, ts.URL+"/p/2015/11/22/podcast-369/", a.URL) assert.Equal(t, "2015-11-22 Нагло ходил в гости. Табличка на двери сработала на 50%Никогда нас школа не хвалила. Девочка осваивает новый прибор. Мое неприятие их логики. И разошлись по будкам …Отбиваюсь от опасных ...", a.Excerpt) assert.Equal(t, "https://podcast.umputun.com/images/uwp/uwp369.jpg", a.Image) - assert.Contains(t, ts.URL, a.Domain) + assert.Equal(t, tsURL.Host, a.Domain) assert.Len(t, a.AllLinks, 13) assert.Contains(t, a.AllLinks, "https://podcast.umputun.com/media/ump_podcast369.mp3") assert.Contains(t, a.AllLinks, "https://podcast.umputun.com/images/uwp/uwp369.jpg") @@ -149,13 +164,13 @@ func TestNormalizeLinks(t *testing.T) { lr := UReadability{TimeOut: 30 * time.Second, SnippetSize: 200} inp := `blah sdfasd something blah33 xx` u, _ := url.Parse("http://ukeeper.com/blah") - out, links := lr.normalizeLinks(inp, &http.Request{URL: u}) + out, links := lr.normalizeLinks(inp, u) assert.Equal(t, `blah sdfasd something blah33 xx`, out) assert.Len(t, links, 3) inp = ` View Page Source` - _, links = lr.normalizeLinks(inp, &http.Request{URL: u}) + _, links = lr.normalizeLinks(inp, u) assert.Len(t, links, 1) assert.Equal(t, "http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2016/01/page-source.jpg", links[0]) } @@ -170,6 +185,10 @@ func TestExtractByRule(t *testing.T) { ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { if r.URL.String() == "/2015/09/25/poiezdka-s-apple-maps/" { fh, err := os.Open("testdata/poiezdka-s-apple-maps.html") + if err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + return + } testHTML, err := io.ReadAll(fh) assert.NoError(t, err) assert.NoError(t, fh.Close()) @@ -207,17 +226,134 @@ func TestExtractByRule(t *testing.T) { }) } -func TestGetContentCustom(t *testing.T) { - rulesMock := &mocks.RulesMock{ - GetFunc: func(_ context.Context, _ string) (datastore.Rule, bool) { - return datastore.Rule{Content: "#content p, .post-title"}, true +func TestExtractWithCustomRetriever(t *testing.T) { + testHTML := `Test Page +

This is the article content from a custom retriever.

` + + mockRetriever := &RetrieverMock{ + RetrieveFunc: func(_ context.Context, reqURL string) (*RetrieveResult, error) { + header := make(http.Header) + header.Set("Content-Type", "text/html; charset=utf-8") + return &RetrieveResult{ + Body: []byte(testHTML), + URL: reqURL, + Header: header, + }, nil }, } - lr := UReadability{TimeOut: 30 * time.Second, SnippetSize: 200, Rules: rulesMock} + + lr := UReadability{TimeOut: 30 * time.Second, SnippetSize: 200, Retriever: mockRetriever} + res, err := lr.Extract(context.Background(), "https://example.com/test-page") + require.NoError(t, err) + + assert.Equal(t, "Test Page", res.Title) + assert.Equal(t, "https://example.com/test-page", res.URL) + assert.Equal(t, "example.com", res.Domain) + assert.NotEmpty(t, res.Content) + assert.Contains(t, res.Content, "article content from a custom retriever") + + calls := mockRetriever.RetrieveCalls() + require.Len(t, calls, 1) + assert.Equal(t, "https://example.com/test-page", calls[0].URL) +} + +func TestPickRetriever(t *testing.T) { + mkRetriever := func(tag string) *RetrieverMock { + return &RetrieverMock{ + RetrieveFunc: func(_ context.Context, reqURL string) (*RetrieveResult, error) { + h := make(http.Header) + h.Set("Content-Type", "text/html; charset=utf-8") + return &RetrieveResult{ + Body: []byte("" + tag + "

body-" + tag + "

"), + URL: reqURL, + Header: h, + }, nil + }, + } + } + + tests := []struct { + name string + cfRouteAll bool + useCloudflare bool + cfConfigured bool + wantTag string + }{ + {name: "no CF configured uses HTTP", cfConfigured: false, wantTag: "http"}, + {name: "CF configured, no flag uses HTTP", cfConfigured: true, wantTag: "http"}, + {name: "CF configured, rule asks for CF uses CF", cfConfigured: true, useCloudflare: true, wantTag: "cf"}, + {name: "CF configured, route-all uses CF", cfConfigured: true, cfRouteAll: true, wantTag: "cf"}, + {name: "route-all overrides rule flag", cfConfigured: true, cfRouteAll: true, useCloudflare: false, wantTag: "cf"}, + {name: "route-all without CF configured falls back to HTTP", cfConfigured: false, cfRouteAll: true, wantTag: "http"}, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + httpR := mkRetriever("http") + var cfR *RetrieverMock + lr := UReadability{ + TimeOut: time.Second, + SnippetSize: 200, + Retriever: httpR, + CFRouteAll: tt.cfRouteAll, + Rules: &mocks.RulesMock{ + GetFunc: func(_ context.Context, _ string) (datastore.Rule, bool) { + return datastore.Rule{Domain: "example.com", UseCloudflare: tt.useCloudflare}, true + }, + }, + } + if tt.cfConfigured { + cfR = mkRetriever("cf") + lr.CFRetriever = cfR + } + + _, err := lr.Extract(context.Background(), "https://example.com/page") + require.NoError(t, err) + + switch tt.wantTag { + case "http": + assert.Len(t, httpR.RetrieveCalls(), 1, "http retriever should have been called") + if cfR != nil { + assert.Empty(t, cfR.RetrieveCalls(), "cf retriever should not have been called") + } + case "cf": + require.NotNil(t, cfR) + assert.Len(t, cfR.RetrieveCalls(), 1, "cf retriever should have been called") + assert.Empty(t, httpR.RetrieveCalls(), "http retriever should not have been called") + } + }) + } +} + +func TestPickRetrieverNoRules(t *testing.T) { + httpR := &RetrieverMock{RetrieveFunc: func(_ context.Context, reqURL string) (*RetrieveResult, error) { + h := make(http.Header) + h.Set("Content-Type", "text/html; charset=utf-8") + return &RetrieveResult{Body: []byte("tx"), URL: reqURL, Header: h}, nil + }} + cfR := &RetrieverMock{RetrieveFunc: func(_ context.Context, reqURL string) (*RetrieveResult, error) { + h := make(http.Header) + h.Set("Content-Type", "text/html; charset=utf-8") + return &RetrieveResult{Body: []byte("tx"), URL: reqURL, Header: h}, nil + }} + lr := UReadability{TimeOut: time.Second, SnippetSize: 200, Retriever: httpR, CFRetriever: cfR} // no Rules + _, err := lr.Extract(context.Background(), "https://example.com/page") + require.NoError(t, err) + assert.Len(t, httpR.RetrieveCalls(), 1, "no rules → HTTP path") + assert.Empty(t, cfR.RetrieveCalls()) +} + +func TestGetContentCustom(t *testing.T) { + rule := &datastore.Rule{Content: "#content p, .post-title"} + lr := UReadability{TimeOut: 30 * time.Second, SnippetSize: 200} httpClient := &http.Client{Timeout: 30 * time.Second} ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { if r.URL.String() == "/2015/09/25/poiezdka-s-apple-maps/" { fh, err := os.Open("testdata/poiezdka-s-apple-maps.html") + if err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + return + } testHTML, err := io.ReadAll(fh) assert.NoError(t, err) assert.NoError(t, fh.Close()) @@ -234,7 +370,7 @@ func TestGetContentCustom(t *testing.T) { require.NoError(t, err) body := string(dataBytes) - content, rich, err := lr.getContent(context.Background(), body, ts.URL+"/2015/09/25/poiezdka-s-apple-maps/", nil) + content, rich, err := lr.getContent(context.Background(), body, ts.URL+"/2015/09/25/poiezdka-s-apple-maps/", rule) require.NoError(t, err) assert.Len(t, content, 6988) assert.Len(t, rich, 7169) diff --git a/extractor/retriever.go b/extractor/retriever.go new file mode 100644 index 00000000..600a467b --- /dev/null +++ b/extractor/retriever.go @@ -0,0 +1,282 @@ +package extractor + +import ( + "bytes" + "context" + "encoding/json" + "errors" + "fmt" + "io" + "net/http" + "strconv" + "sync" + "time" + + log "github.com/go-pkgz/lgr" +) + +//go:generate moq -out retriever_mock_test.go -skip-ensure -fmt goimports . Retriever + +// Retriever abstracts how page content is fetched from a URL +type Retriever interface { + Retrieve(ctx context.Context, url string) (*RetrieveResult, error) +} + +// RetrieveResult holds the raw page content and metadata from a retrieval +type RetrieveResult struct { + Body []byte // raw page content (HTML) + URL string // final URL after redirects + Header http.Header // response headers (for charset detection) +} + +// HTTPRetriever fetches pages using a standard HTTP client +type HTTPRetriever struct { + Timeout time.Duration + + once sync.Once + client *http.Client +} + +const httpDefaultTimeout = 30 * time.Second + +func (h *HTTPRetriever) httpClient() *http.Client { + h.once.Do(func() { + timeout := h.Timeout + if timeout == 0 { + timeout = httpDefaultTimeout + } + h.client = &http.Client{Timeout: timeout} + }) + return h.client +} + +// Retrieve fetches the URL using an HTTP GET with Safari user-agent, following redirects +func (h *HTTPRetriever) Retrieve(ctx context.Context, reqURL string) (*RetrieveResult, error) { + req, err := http.NewRequestWithContext(ctx, "GET", reqURL, http.NoBody) + if err != nil { + log.Printf("[WARN] failed to create request for %s, error=%v", reqURL, err) + return nil, err + } + req.Header.Set("User-Agent", userAgent) + resp, err := h.httpClient().Do(req) + if err != nil { + log.Printf("[WARN] failed to get anything from %s, error=%v", reqURL, err) + return nil, err + } + defer func() { + if closeErr := resp.Body.Close(); closeErr != nil { + log.Printf("[WARN] failed to close response body, error=%v", closeErr) + } + }() + + body, err := io.ReadAll(resp.Body) + if err != nil { + log.Printf("[WARN] failed to read data from %s, error=%v", reqURL, err) + return nil, err + } + + return &RetrieveResult{ + Body: body, + URL: resp.Request.URL.String(), + Header: resp.Header, + }, nil +} + +const ( + cfDefaultBaseURL = "https://api.cloudflare.com/client/v4" + cfDefaultWaitUntil = "networkidle0" + cfDefaultTimeout = 60 * time.Second + cfDefaultRetryDelay = 11 * time.Second // free tier: 1 req / 10s — add a little headroom + cfMaxRetryDelay = 30 * time.Second + + // CFDefaultMaxRetries is the suggested MaxRetries value for production CloudflareRetriever setup. + // 2 retries keeps worst-case backoff at ~33s (11s + 22s) so the total handler time stays under + // common upstream timeouts (nginx proxy_read_timeout default is 60s). Callers must set MaxRetries + // explicitly — CloudflareRetriever does not substitute a default for the zero value. + CFDefaultMaxRetries = 2 +) + +// errCFRateLimited is returned by the single-attempt inner retrieve when the CF API signals rate limiting; +// the outer Retrieve uses it to decide whether to back off and retry. +var errCFRateLimited = errors.New("cloudflare rate limited") + +// CloudflareRetriever fetches pages using Cloudflare Browser Rendering API. +// it sends a POST to the /content endpoint which returns fully rendered HTML after JS execution. +// on HTTP 429 it retries with backoff (respecting Retry-After) up to MaxRetries times. +type CloudflareRetriever struct { + AccountID string + APIToken string + BaseURL string // override for testing; defaults to Cloudflare API + Timeout time.Duration // per-request HTTP client timeout; defaults to 60s + MaxRetries int // number of retries on 429; defaults to 3. set to -1 to disable retries + RetryDelay time.Duration // base delay between 429 retries; defaults to 11s (CF free tier is 1 req/10s) + + once sync.Once + client *http.Client +} + +func (c *CloudflareRetriever) httpClient() *http.Client { + c.once.Do(func() { + timeout := c.Timeout + if timeout == 0 { + timeout = cfDefaultTimeout + } + c.client = &http.Client{Timeout: timeout} + }) + return c.client +} + +type cfRequest struct { + URL string `json:"url"` + GotoOptions cfGotoOptions `json:"gotoOptions"` +} + +type cfGotoOptions struct { + WaitUntil string `json:"waitUntil"` +} + +type cfResponse struct { + Success bool `json:"success"` + Result string `json:"result"` +} + +// Retrieve fetches the URL via Cloudflare Browser Rendering /content endpoint. +// on HTTP 429 it backs off and retries up to MaxRetries times, holding the caller's +// connection open in the meantime. aborts early if the caller's context is canceled. +// MaxRetries: 0 means no retries. RetryDelay: 0 falls back to the package default. +func (c *CloudflareRetriever) Retrieve(ctx context.Context, reqURL string) (*RetrieveResult, error) { + maxRetries := c.MaxRetries + if maxRetries < 0 { + maxRetries = 0 + } + baseDelay := c.RetryDelay + if baseDelay <= 0 { + baseDelay = cfDefaultRetryDelay + } + + var lastErr error + for attempt := 0; attempt <= maxRetries; attempt++ { + result, retryAfter, err := c.doRetrieve(ctx, reqURL) + if err == nil { + return result, nil + } + lastErr = err + if !errors.Is(err, errCFRateLimited) { + return nil, err + } + if attempt == maxRetries { + break + } + delay := retryAfter + if delay <= 0 { + delay = baseDelay << attempt // 11s, 22s, 44s, ... + } + if delay > cfMaxRetryDelay { + delay = cfMaxRetryDelay + } + log.Printf("[INFO] cloudflare rate limited for %s, retry %d/%d after %s", reqURL, attempt+1, maxRetries, delay) + select { + case <-ctx.Done(): + return nil, ctx.Err() + case <-time.After(delay): + } + } + return nil, lastErr +} + +// doRetrieve performs a single Browser Rendering request. on 429 it returns errCFRateLimited +// (possibly wrapped) and the parsed Retry-After duration (0 if absent or unparseable). +func (c *CloudflareRetriever) doRetrieve(ctx context.Context, reqURL string) (*RetrieveResult, time.Duration, error) { + baseURL := c.BaseURL + if baseURL == "" { + baseURL = cfDefaultBaseURL + } + endpoint := fmt.Sprintf("%s/accounts/%s/browser-rendering/content", baseURL, c.AccountID) + + cfReq := cfRequest{ + URL: reqURL, + GotoOptions: cfGotoOptions{WaitUntil: cfDefaultWaitUntil}, + } + reqBody, err := json.Marshal(cfReq) + if err != nil { + return nil, 0, fmt.Errorf("marshal cf request: %w", err) + } + + httpReq, err := http.NewRequestWithContext(ctx, "POST", endpoint, bytes.NewReader(reqBody)) + if err != nil { + return nil, 0, fmt.Errorf("create cf request: %w", err) + } + httpReq.Header.Set("Authorization", "Bearer "+c.APIToken) + httpReq.Header.Set("Content-Type", "application/json") + + resp, err := c.httpClient().Do(httpReq) + if err != nil { + log.Printf("[WARN] cloudflare request failed for %s, error=%v", reqURL, err) + return nil, 0, err + } + defer func() { + if closeErr := resp.Body.Close(); closeErr != nil { + log.Printf("[WARN] failed to close cf response body, error=%v", closeErr) + } + }() + + body, err := io.ReadAll(resp.Body) + if err != nil { + log.Printf("[WARN] failed to read cf response for %s, error=%v", reqURL, err) + return nil, 0, err + } + + if resp.StatusCode == http.StatusTooManyRequests { + retryAfter := parseRetryAfter(resp.Header.Get("Retry-After")) + return nil, retryAfter, fmt.Errorf("%w: status 429", errCFRateLimited) + } + if resp.StatusCode != http.StatusOK { + bodySnippet := body + if len(bodySnippet) > 512 { + bodySnippet = bodySnippet[:512] + } + return nil, 0, fmt.Errorf("cloudflare API error: status %d, body: %s", resp.StatusCode, string(bodySnippet)) + } + + // try JSON response format first: {"success": true, "result": ""} + var cfResp cfResponse + if err = json.Unmarshal(body, &cfResp); err == nil { + switch { + case cfResp.Success && cfResp.Result != "": + body = []byte(cfResp.Result) + case cfResp.Success && cfResp.Result == "": + return nil, 0, fmt.Errorf("cloudflare returned empty content for %s", reqURL) + default: // !cfResp.Success + return nil, 0, fmt.Errorf("cloudflare API returned success=false for %s", reqURL) + } + } + // if unmarshal fails, use the raw body as-is (raw HTML response) + + header := make(http.Header) + header.Set("Content-Type", "text/html; charset=utf-8") + + // note: URL is the original request URL, not the final URL after any JS-driven redirects, + // because the Cloudflare Browser Rendering /content API does not expose the final navigated URL + return &RetrieveResult{ + Body: body, + URL: reqURL, + Header: header, + }, 0, nil +} + +// parseRetryAfter parses an HTTP Retry-After header value as either delta-seconds +// or an HTTP date. returns 0 if empty or unparseable. +func parseRetryAfter(value string) time.Duration { + if value == "" { + return 0 + } + if secs, err := strconv.Atoi(value); err == nil && secs >= 0 { + return time.Duration(secs) * time.Second + } + if t, err := http.ParseTime(value); err == nil { + if delta := time.Until(t); delta > 0 { + return delta + } + } + return 0 +} diff --git a/extractor/retriever_mock_test.go b/extractor/retriever_mock_test.go new file mode 100644 index 00000000..c2276a74 --- /dev/null +++ b/extractor/retriever_mock_test.go @@ -0,0 +1,77 @@ +// Code generated by moq; DO NOT EDIT. +// github.com/matryer/moq + +package extractor + +import ( + "context" + "sync" +) + +// RetrieverMock is a mock implementation of Retriever. +// +// func TestSomethingThatUsesRetriever(t *testing.T) { +// +// // make and configure a mocked Retriever +// mockedRetriever := &RetrieverMock{ +// RetrieveFunc: func(ctx context.Context, url string) (*RetrieveResult, error) { +// panic("mock out the Retrieve method") +// }, +// } +// +// // use mockedRetriever in code that requires Retriever +// // and then make assertions. +// +// } +type RetrieverMock struct { + // RetrieveFunc mocks the Retrieve method. + RetrieveFunc func(ctx context.Context, url string) (*RetrieveResult, error) + + // calls tracks calls to the methods. + calls struct { + // Retrieve holds details about calls to the Retrieve method. + Retrieve []struct { + // Ctx is the ctx argument value. + Ctx context.Context + // URL is the url argument value. + URL string + } + } + lockRetrieve sync.RWMutex +} + +// Retrieve calls RetrieveFunc. +func (mock *RetrieverMock) Retrieve(ctx context.Context, url string) (*RetrieveResult, error) { + if mock.RetrieveFunc == nil { + panic("RetrieverMock.RetrieveFunc: method is nil but Retriever.Retrieve was just called") + } + callInfo := struct { + Ctx context.Context + URL string + }{ + Ctx: ctx, + URL: url, + } + mock.lockRetrieve.Lock() + mock.calls.Retrieve = append(mock.calls.Retrieve, callInfo) + mock.lockRetrieve.Unlock() + return mock.RetrieveFunc(ctx, url) +} + +// RetrieveCalls gets all the calls that were made to Retrieve. +// Check the length with: +// +// len(mockedRetriever.RetrieveCalls()) +func (mock *RetrieverMock) RetrieveCalls() []struct { + Ctx context.Context + URL string +} { + var calls []struct { + Ctx context.Context + URL string + } + mock.lockRetrieve.RLock() + calls = mock.calls.Retrieve + mock.lockRetrieve.RUnlock() + return calls +} diff --git a/extractor/retriever_test.go b/extractor/retriever_test.go new file mode 100644 index 00000000..8526723a --- /dev/null +++ b/extractor/retriever_test.go @@ -0,0 +1,423 @@ +package extractor + +import ( + "context" + "encoding/json" + "io" + "net/http" + "net/http/httptest" + "sync/atomic" + "testing" + "time" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +func TestHTTPRetriever_Retrieve(t *testing.T) { + ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch r.URL.Path { + case "/ok": + assert.Equal(t, userAgent, r.Header.Get("User-Agent"), "should send Safari user-agent") + w.Header().Set("Content-Type", "text/html; charset=utf-8") + _, _ = w.Write([]byte("hello")) + case "/redirect-start": + http.Redirect(w, r, "/redirect-end", http.StatusFound) + case "/redirect-end": + w.Header().Set("Content-Type", "text/html") + _, _ = w.Write([]byte("redirected")) + default: + http.NotFound(w, r) + } + })) + defer ts.Close() + + tests := []struct { + name string + url string + wantBody string + wantURL string + wantErr bool + }{ + { + name: "successful fetch", + url: ts.URL + "/ok", + wantBody: "hello", + wantURL: ts.URL + "/ok", + }, + { + name: "redirect following", + url: ts.URL + "/redirect-start", + wantBody: "redirected", + wantURL: ts.URL + "/redirect-end", + }, + { + name: "bad url", + url: "http://\x00invalid", + wantErr: true, + }, + { + name: "connection refused", + url: "http://127.0.0.1:1", + wantErr: true, + }, + } + + retriever := &HTTPRetriever{Timeout: 5 * time.Second} + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + result, err := retriever.Retrieve(context.Background(), tt.url) + if tt.wantErr { + require.Error(t, err) + assert.Nil(t, result) + return + } + require.NoError(t, err) + assert.Equal(t, tt.wantBody, string(result.Body)) + assert.Equal(t, tt.wantURL, result.URL) + assert.NotNil(t, result.Header) + }) + } +} + +func TestHTTPRetriever_UserAgent(t *testing.T) { + var receivedUA string + ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + receivedUA = r.Header.Get("User-Agent") + _, _ = w.Write([]byte("ok")) + })) + defer ts.Close() + + retriever := &HTTPRetriever{Timeout: 5 * time.Second} + _, err := retriever.Retrieve(context.Background(), ts.URL) + require.NoError(t, err) + assert.Equal(t, userAgent, receivedUA) +} + +func TestHTTPRetriever_ResponseHeaders(t *testing.T) { + ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.Header().Set("Content-Type", "text/html; charset=windows-1251") + w.Header().Set("X-Custom", "test-value") + _, _ = w.Write([]byte("ok")) + })) + defer ts.Close() + + retriever := &HTTPRetriever{Timeout: 5 * time.Second} + result, err := retriever.Retrieve(context.Background(), ts.URL) + require.NoError(t, err) + assert.Equal(t, "text/html; charset=windows-1251", result.Header.Get("Content-Type")) + assert.Equal(t, "test-value", result.Header.Get("X-Custom")) +} + +func TestCloudflareRetriever_Retrieve(t *testing.T) { + const testHTML = "rendered content" + + ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + // verify request method and path + assert.Equal(t, "POST", r.Method) + assert.Equal(t, "/accounts/test-account/browser-rendering/content", r.URL.Path) + assert.Equal(t, "Bearer test-token", r.Header.Get("Authorization")) + assert.Equal(t, "application/json", r.Header.Get("Content-Type")) + + // verify request body + body, err := io.ReadAll(r.Body) + require.NoError(t, err) + var req cfRequest + require.NoError(t, json.Unmarshal(body, &req)) + assert.Equal(t, "networkidle0", req.GotoOptions.WaitUntil) + + switch req.URL { + case "https://example.com/json-response": + w.Header().Set("Content-Type", "application/json") + resp := cfResponse{Success: true, Result: testHTML} + _ = json.NewEncoder(w).Encode(resp) + case "https://example.com/raw-html": + w.Header().Set("Content-Type", "text/html") + _, _ = w.Write([]byte(testHTML)) + case "https://example.com/error": + w.WriteHeader(http.StatusForbidden) + _, _ = w.Write([]byte(`{"success":false,"errors":[{"message":"forbidden"}]}`)) + default: + http.NotFound(w, r) + } + })) + defer ts.Close() + + tests := []struct { + name string + url string + wantBody string + wantURL string + wantErr string + }{ + { + name: "successful fetch with JSON response", + url: "https://example.com/json-response", + wantBody: testHTML, + wantURL: "https://example.com/json-response", + }, + { + name: "successful fetch with raw HTML response", + url: "https://example.com/raw-html", + wantBody: testHTML, + wantURL: "https://example.com/raw-html", + }, + { + name: "API error returns error", + url: "https://example.com/error", + wantErr: "cloudflare API error: status 403", + }, + } + + retriever := &CloudflareRetriever{ + AccountID: "test-account", + APIToken: "test-token", + BaseURL: ts.URL, + Timeout: 5 * time.Second, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + result, err := retriever.Retrieve(context.Background(), tt.url) + if tt.wantErr != "" { + require.Error(t, err) + assert.Contains(t, err.Error(), tt.wantErr) + assert.Nil(t, result) + return + } + require.NoError(t, err) + assert.Equal(t, tt.wantBody, string(result.Body)) + assert.Equal(t, tt.wantURL, result.URL) + assert.Equal(t, "text/html; charset=utf-8", result.Header.Get("Content-Type")) + }) + } +} + +func TestCloudflareRetriever_URLPathConstruction(t *testing.T) { + // verify that the retriever constructs the correct Cloudflare API URL path from AccountID + var capturedPath string + ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + capturedPath = r.URL.Path + w.WriteHeader(http.StatusForbidden) + _, _ = w.Write([]byte("forbidden")) + })) + defer ts.Close() + + retriever := &CloudflareRetriever{ + AccountID: "my-account", + APIToken: "my-token", + BaseURL: ts.URL, + Timeout: 5 * time.Second, + } + _, err := retriever.Retrieve(context.Background(), "https://example.com") + require.Error(t, err) + assert.Equal(t, "/accounts/my-account/browser-rendering/content", capturedPath) +} + +func TestCloudflareRetriever_SuccessEmptyResult(t *testing.T) { + ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.Header().Set("Content-Type", "application/json") + _, _ = w.Write([]byte(`{"success":true,"result":""}`)) + })) + defer ts.Close() + + retriever := &CloudflareRetriever{ + AccountID: "test-account", + APIToken: "test-token", + BaseURL: ts.URL, + Timeout: 5 * time.Second, + } + _, err := retriever.Retrieve(context.Background(), "https://example.com") + require.Error(t, err) + assert.Contains(t, err.Error(), "empty content") +} + +func TestCloudflareRetriever_SuccessFalse(t *testing.T) { + ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.Header().Set("Content-Type", "application/json") + _, _ = w.Write([]byte(`{"success":false,"errors":[{"message":"rate limited"}]}`)) + })) + defer ts.Close() + + retriever := &CloudflareRetriever{ + AccountID: "test-account", + APIToken: "test-token", + BaseURL: ts.URL, + Timeout: 5 * time.Second, + } + _, err := retriever.Retrieve(context.Background(), "https://example.com") + require.Error(t, err) + assert.Contains(t, err.Error(), "success=false") +} + +func TestCloudflareRetriever_RateLimitThenSuccess(t *testing.T) { + const testHTML = "recovered" + var calls atomic.Int32 + ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + n := calls.Add(1) + if n < 3 { + w.WriteHeader(http.StatusTooManyRequests) + _, _ = w.Write([]byte(`{"success":false,"errors":[{"code":2001,"message":"Rate limit exceeded"}]}`)) + return + } + w.Header().Set("Content-Type", "application/json") + _ = json.NewEncoder(w).Encode(cfResponse{Success: true, Result: testHTML}) + })) + defer ts.Close() + + retriever := &CloudflareRetriever{ + AccountID: "test-account", + APIToken: "test-token", + BaseURL: ts.URL, + Timeout: 5 * time.Second, + MaxRetries: 3, + RetryDelay: 10 * time.Millisecond, // fast in tests + } + result, err := retriever.Retrieve(context.Background(), "https://example.com") + require.NoError(t, err) + assert.Equal(t, testHTML, string(result.Body)) + assert.Equal(t, int32(3), calls.Load(), "should have retried twice before succeeding") +} + +func TestCloudflareRetriever_RateLimitExhausted(t *testing.T) { + var calls atomic.Int32 + ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + calls.Add(1) + w.WriteHeader(http.StatusTooManyRequests) + _, _ = w.Write([]byte(`{"success":false,"errors":[{"code":2001,"message":"Rate limit exceeded"}]}`)) + })) + defer ts.Close() + + retriever := &CloudflareRetriever{ + AccountID: "test-account", + APIToken: "test-token", + BaseURL: ts.URL, + Timeout: 5 * time.Second, + MaxRetries: 2, + RetryDelay: 10 * time.Millisecond, + } + _, err := retriever.Retrieve(context.Background(), "https://example.com") + require.Error(t, err) + assert.Contains(t, err.Error(), "429") + assert.Equal(t, int32(3), calls.Load(), "initial attempt + 2 retries") +} + +func TestCloudflareRetriever_RateLimitHonoursRetryAfter(t *testing.T) { + var calls atomic.Int32 + var timestamps [3]time.Time + ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + n := calls.Add(1) + if int(n) <= len(timestamps) { + timestamps[n-1] = time.Now() + } + if n < 2 { + w.Header().Set("Retry-After", "1") + w.WriteHeader(http.StatusTooManyRequests) + _, _ = w.Write([]byte(`rate limited`)) + return + } + w.Header().Set("Content-Type", "application/json") + _ = json.NewEncoder(w).Encode(cfResponse{Success: true, Result: "ok"}) + })) + defer ts.Close() + + retriever := &CloudflareRetriever{ + AccountID: "test-account", + APIToken: "test-token", + BaseURL: ts.URL, + Timeout: 5 * time.Second, + MaxRetries: 2, + RetryDelay: 10 * time.Second, // would be ignored in favour of Retry-After + } + start := time.Now() + _, err := retriever.Retrieve(context.Background(), "https://example.com") + elapsed := time.Since(start) + require.NoError(t, err) + assert.GreaterOrEqual(t, elapsed, 900*time.Millisecond, "should have waited for Retry-After") + assert.Less(t, elapsed, 3*time.Second, "should have used 1s Retry-After, not 10s base delay") +} + +func TestCloudflareRetriever_RateLimitContextCancelled(t *testing.T) { + var calls atomic.Int32 + ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + calls.Add(1) + w.WriteHeader(http.StatusTooManyRequests) + _, _ = w.Write([]byte(`rate limited`)) + })) + defer ts.Close() + + retriever := &CloudflareRetriever{ + AccountID: "test-account", + APIToken: "test-token", + BaseURL: ts.URL, + Timeout: 5 * time.Second, + MaxRetries: 5, + RetryDelay: 10 * time.Second, // long enough that ctx cancel will fire first + } + ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond) + defer cancel() + _, err := retriever.Retrieve(ctx, "https://example.com") + require.Error(t, err) + assert.ErrorIs(t, err, context.DeadlineExceeded) + assert.Equal(t, int32(1), calls.Load(), "should abort in backoff before second call") +} + +func TestCloudflareRetriever_RateLimitRetriesDisabled(t *testing.T) { + tests := []struct { + name string + maxRetries int + }{ + {name: "zero means no retries", maxRetries: 0}, + {name: "negative clamped to zero", maxRetries: -1}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + var calls atomic.Int32 + ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { + calls.Add(1) + w.WriteHeader(http.StatusTooManyRequests) + _, _ = w.Write([]byte(`rate limited`)) + })) + defer ts.Close() + + retriever := &CloudflareRetriever{ + AccountID: "test-account", + APIToken: "test-token", + BaseURL: ts.URL, + Timeout: 5 * time.Second, + MaxRetries: tt.maxRetries, + } + _, err := retriever.Retrieve(context.Background(), "https://example.com") + require.Error(t, err) + assert.Equal(t, int32(1), calls.Load()) + }) + } +} + +func TestParseRetryAfter(t *testing.T) { + tests := []struct { + name string + value string + want time.Duration + // allow a bit of slack for HTTP-date cases where time.Until is evaluated at assertion time + tolerance time.Duration + }{ + {name: "empty", value: "", want: 0}, + {name: "seconds", value: "5", want: 5 * time.Second}, + {name: "zero", value: "0", want: 0}, + {name: "negative", value: "-5", want: 0}, + {name: "garbage", value: "soon", want: 0}, + {name: "http date in the past", value: "Mon, 02 Jan 2006 15:04:05 GMT", want: 0}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got := parseRetryAfter(tt.value) + if tt.tolerance == 0 { + assert.Equal(t, tt.want, got) + return + } + assert.InDelta(t, tt.want, got, float64(tt.tolerance)) + }) + } +} diff --git a/extractor/text.go b/extractor/text.go index 2ae986b2..220b0140 100644 --- a/extractor/text.go +++ b/extractor/text.go @@ -2,10 +2,11 @@ package extractor import ( "io" - "log" "net/http" "strings" + log "github.com/go-pkgz/lgr" + "github.com/PuerkitoBio/goquery" "github.com/kennygrant/sanitize" "golang.org/x/net/html/charset" @@ -95,7 +96,7 @@ func (f *UReadability) toUtf8(content []byte, header http.Header) (contentType, } conv2utf8, err := io.ReadAll(rr) if err != nil { - log.Printf("[WARN] convert to utf-8 failed, %b", err) + log.Printf("[WARN] convert to utf-8 failed, %v", err) return contentType, origEncoding, result } result = string(conv2utf8) diff --git a/main.go b/main.go index 180d166d..3f2e4344 100644 --- a/main.go +++ b/main.go @@ -23,10 +23,13 @@ var opts struct { Port int `long:"port" env:"UKEEPER_PORT" default:"8080" description:"port"` FrontendDir string `long:"frontend-dir" env:"FRONTEND_DIR" default:"/srv/web" description:"directory with frontend templates and static/ directory for static assets"` Credentials map[string]string `long:"creds" env:"CREDS" description:"credentials for protected calls (POST, DELETE /rules)"` - Token string `long:"token" env:"UKEEPER_TOKEN" description:"token for /content/v1/parser endpoint auth"` + Token string `long:"token" env:"UKEEPER_TOKEN" description:"token for API endpoint auth"` MongoURI string `short:"m" long:"mongo-uri" env:"MONGO_URI" required:"true" description:"MongoDB connection string"` MongoDelay time.Duration `long:"mongo-delay" env:"MONGO_DELAY" default:"0" description:"mongo initial delay"` MongoDB string `long:"mongo-db" env:"MONGO_DB" default:"ureadability" description:"mongo database name"` + CFAccountID string `long:"cf-account-id" env:"CF_ACCOUNT_ID" description:"Cloudflare account ID for Browser Rendering API"` + CFAPIToken string `long:"cf-api-token" env:"CF_API_TOKEN" description:"Cloudflare API token with Browser Rendering Edit permission"` + CFRouteAll bool `long:"cf-route-all" env:"CF_ROUTE_ALL" description:"route every request through Cloudflare Browser Rendering (requires cf-account-id and cf-api-token)"` Debug bool `long:"dbg" env:"DEBUG" description:"debug mode"` } @@ -47,11 +50,41 @@ func main() { log.Fatalf("[ERROR] can't connect to mongo %v", err) } stores := db.GetStores() + + // default retriever is always HTTP; CF is optional and, when configured, acts as a + // second retriever available for per-rule routing or global route-all. + httpRetriever := &extractor.HTTPRetriever{Timeout: 30 * time.Second} + var cfRetriever extractor.Retriever + if opts.CFAccountID != "" && opts.CFAPIToken != "" { + cfRetriever = &extractor.CloudflareRetriever{ + AccountID: opts.CFAccountID, + APIToken: opts.CFAPIToken, + Timeout: 30 * time.Second, + MaxRetries: extractor.CFDefaultMaxRetries, + } + if opts.CFRouteAll { + log.Printf("[INFO] Cloudflare Browser Rendering enabled, account=%s, mode=route-all", opts.CFAccountID) + } else { + log.Printf("[INFO] Cloudflare Browser Rendering enabled, account=%s, mode=per-rule", opts.CFAccountID) + } + } else { + if opts.CFAccountID != "" || opts.CFAPIToken != "" { + log.Print("[WARN] both --cf-account-id and --cf-api-token must be set for Cloudflare Browser Rendering; disabling Cloudflare routing") + } + if opts.CFRouteAll { + log.Print("[WARN] --cf-route-all is set but Cloudflare credentials are not configured; routing through default HTTP retriever") + } + log.Print("[INFO] using default HTTP retriever") + } + srv := rest.Server{ Readability: extractor.UReadability{ TimeOut: 30 * time.Second, SnippetSize: 300, Rules: stores.Rules, + Retriever: httpRetriever, + CFRetriever: cfRetriever, + CFRouteAll: opts.CFRouteAll, }, Token: opts.Token, Credentials: opts.Credentials, diff --git a/rest/server.go b/rest/server.go index 78c162e8..efd0c2f1 100644 --- a/rest/server.go +++ b/rest/server.go @@ -48,8 +48,14 @@ func (s *Server) Run(ctx context.Context, address string, port int, frontendDir Addr: fmt.Sprintf("%s:%d", address, port), Handler: s.routes(frontendDir), ReadHeaderTimeout: 5 * time.Second, - // WriteTimeout: 120 * time.Second, // TODO: such a long timeout needed for blocking export (backup) request - IdleTimeout: 30 * time.Second, + // WriteTimeout is server-wide rather than per-route because the extraction endpoints + // are the only potentially long-running handlers (other handlers — static files, rule + // CRUD, /ping — finish in milliseconds and are unaffected by this ceiling). 150s covers + // the worst-case Cloudflare path: 1 initial request + 2 retries with 11s/22s exponential + // backoff + up to 30s per CF request. If extraction ever moves off the server-wide + // timeout, wrap only those routes with http.TimeoutHandler instead. + WriteTimeout: 150 * time.Second, + IdleTimeout: 30 * time.Second, } go func() { // shutdown on context cancellation @@ -175,15 +181,7 @@ func (s *Server) extractArticle(w http.ResponseWriter, r *http.Request) { // extractArticleEmulateReadability emulates readability API parse - https://www.readability.com/api/content/v1/parser?token=%s&url=%s // if token is not set for application, it won't be checked func (s *Server) extractArticleEmulateReadability(w http.ResponseWriter, r *http.Request) { - token := r.URL.Query().Get("token") - - if s.Token != "" && token == "" { - rest.SendErrorJSON(w, r, log.Default(), http.StatusExpectationFailed, nil, "no token passed") - return - } - - if s.Token != "" && s.Token != token { - rest.SendErrorJSON(w, r, log.Default(), http.StatusUnauthorized, nil, "wrong token passed") + if !s.checkToken(w, r) { return } @@ -283,14 +281,15 @@ func (s *Server) saveRule(w http.ResponseWriter, r *http.Request) { return } rule := datastore.Rule{ - Enabled: true, - ID: getBid(r.FormValue("id")), - Domain: r.FormValue("domain"), - Author: r.FormValue("author"), - Content: r.FormValue("content"), - MatchURLs: strings.Split(r.FormValue("match_url"), "\n"), - Excludes: strings.Split(r.FormValue("excludes"), "\n"), - TestURLs: strings.Split(r.FormValue("test_urls"), "\n"), + Enabled: true, + ID: getBid(r.FormValue("id")), + Domain: r.FormValue("domain"), + Author: r.FormValue("author"), + Content: r.FormValue("content"), + MatchURLs: strings.Split(r.FormValue("match_url"), "\n"), + Excludes: strings.Split(r.FormValue("excludes"), "\n"), + TestURLs: strings.Split(r.FormValue("test_urls"), "\n"), + UseCloudflare: r.FormValue("use_cloudflare") == "true", } // return error in case domain is not set @@ -353,6 +352,21 @@ func getBid(id string) bson.ObjectID { return bid } +// checkToken validates the token query parameter if the server has a token configured. +// returns true if auth passed, false if the request was rejected. +func (s *Server) checkToken(w http.ResponseWriter, r *http.Request) bool { + token := r.URL.Query().Get("token") + if s.Token != "" && token == "" { + rest.SendErrorJSON(w, r, log.Default(), http.StatusExpectationFailed, nil, "no token passed") + return false + } + if s.Token != "" && subtle.ConstantTimeCompare([]byte(s.Token), []byte(token)) == 0 { + rest.SendErrorJSON(w, r, log.Default(), http.StatusUnauthorized, nil, "wrong token passed") + return false + } + return true +} + // basicAuth returns a piece of middleware that will allow access only // if the provided credentials match within the given service // otherwise, it will return a 401 and not call the next handler. diff --git a/rest/server_test.go b/rest/server_test.go index 57f40384..87dbf97d 100644 --- a/rest/server_test.go +++ b/rest/server_test.go @@ -114,6 +114,10 @@ func TestServer_Extract(t *testing.T) { tss := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { if r.URL.String() == "/2015/11/26/vsiem-mirom-dlia-obshchiei-polzy/" { fh, err := os.Open("../extractor/testdata/vsiem-mirom-dlia-obshchiei-polzy.html") + if err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + return + } testHTML, err := io.ReadAll(fh) assert.NoError(t, err) assert.NoError(t, fh.Close()) @@ -177,6 +181,10 @@ func TestServer_LegacyExtract(t *testing.T) { tss := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { if r.URL.String() == "/2015/11/26/vsiem-mirom-dlia-obshchiei-polzy/" { fh, err := os.Open("../extractor/testdata/vsiem-mirom-dlia-obshchiei-polzy.html") + if err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + return + } testHTML, err := io.ReadAll(fh) assert.NoError(t, err) assert.NoError(t, fh.Close()) @@ -533,6 +541,10 @@ func TestServer_Preview(t *testing.T) { tss := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { if r.URL.String() == "/2015/11/26/vsiem-mirom-dlia-obshchiei-polzy/" { fh, err := os.Open("../extractor/testdata/vsiem-mirom-dlia-obshchiei-polzy.html") + if err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + return + } testHTML, err := io.ReadAll(fh) assert.NoError(t, err) assert.NoError(t, fh.Close()) diff --git a/web/components/rule-form.gohtml b/web/components/rule-form.gohtml index 40fbe7b9..f42899e1 100644 --- a/web/components/rule-form.gohtml +++ b/web/components/rule-form.gohtml @@ -39,6 +39,14 @@
+
+
+ +
+