Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ web/ → Go HTML templates (HTMX v2), static assets
**Key interfaces:**
- `extractor.Rules` (defined consumer-side in `extractor/readability.go`), implemented by `datastore.RulesDAO`. Mock generated with `//go:generate moq` in extractor package.
- `extractor.Retriever` (defined in `extractor/retriever.go`) — abstracts URL content fetching. Two implementations: `HTTPRetriever` (default, standard HTTP GET with Safari user-agent) and `CloudflareRetriever` (Cloudflare Browser Rendering API for JS-rendered pages). When `UReadability.Retriever` is nil, defaults to `HTTPRetriever`.
- `extractor.AIEvaluator` (defined in `extractor/evaluator.go`) — evaluates extraction quality via OpenAI. Implementation: `OpenAIEvaluator`. Mock generated with `//go:generate moq` as test-only mock (`evaluator_mock_test.go`).

## Content Extraction Flow

Expand All @@ -48,6 +49,14 @@ web/ → Go HTML templates (HTMX v2), static assets
4. If rule found → extract via goquery CSS selector; if fails → fall back to general parser
5. If no rule → use `go-readability` general parser
6. Normalize relative links to absolute, extract images concurrently (pick largest as lead image)
7. If `AIEvaluator` is configured and no existing rule for domain (or force mode): evaluate extraction quality via OpenAI, iterate up to `MaxGPTIter` times with suggested CSS selectors, save the best as a new rule

`ExtractAndImprove()` is the force-mode entry point — ignores stored rules, re-extracts with general parser, then evaluates. Used by the `/api/content-parsed-wrong` protected endpoint.

Optional OpenAI flags (when `--openai-api-key` is set, enables auto-evaluation):
- `--openai-api-key` / `OPENAI_API_KEY` — OpenAI API key
- `--openai-model` / `OPENAI_MODEL` — model for evaluation (default: `gpt-5.4-mini`)
- `--openai-max-iter` / `OPENAI_MAX_ITER` — max evaluation iterations (default: `3`)

## Key Conventions

Expand Down
16 changes: 16 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@
| creds | CREDS | none | credentials for protected calls (POST, DELETE /rules) |
| cf-account-id| CF_ACCOUNT_ID | none | Cloudflare account ID for Browser Rendering API |
| cf-api-token | CF_API_TOKEN | none | Cloudflare API token with Browser Rendering Edit perm |
| openai-api-key | OPENAI_API_KEY | none | OpenAI API key; enables auto-evaluation when set |
| openai-model | OPENAI_MODEL | `gpt-5.4-mini` | OpenAI model for evaluation |
| openai-max-iter | OPENAI_MAX_ITER | `3` | max evaluation iterations per extraction |
| dbg | DEBUG | `false` | debug mode |

### Cloudflare Browser Rendering (optional)
Expand All @@ -26,10 +29,23 @@ When both `--cf-account-id` and `--cf-api-token` are set, the service uses Cloud

When these flags are not set, the service uses a standard HTTP client (default).

### OpenAI Auto-Evaluation (optional)

When `--openai-api-key` is set, the service automatically evaluates extraction quality using OpenAI. If the extracted content looks poor (missing article body, too short, mostly boilerplate), GPT suggests a CSS selector targeting the main content. The service iterates up to `--openai-max-iter` times, saving the best selector as a rule for future use.

Evaluation only runs for domains without an existing extraction rule. For domains that already have rules, use the force-mode endpoint to re-evaluate:

POST /api/content-parsed-wrong?url=http://example.com/article

This protected endpoint (requires basicAuth credentials) ignores the stored rule, re-extracts with the general parser, and runs the evaluation loop to find a better selector.

When OpenAI is not configured, extraction works exactly as before — no GPT calls are made.

### API

GET /api/content/v1/parser?token=secret&url=http://aa.com/blah - extract content (emulate Readability API parse call)
POST /api/extract {url: http://aa.com/blah} - extract content
POST /api/content-parsed-wrong?url=http://aa.com/blah - force re-extraction with AI evaluation (requires basicAuth)

## Development

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,81 +46,81 @@
- Create: `extractor/evaluator_test.go`
- Create: `extractor/mocks/evaluator.go` (generated)

- [ ] run `go get github.com/sashabaranov/go-openai@latest && go mod tidy && go mod vendor`
- [ ] define `AIEvaluator` interface with `Evaluate(ctx, url, extractedText, htmlBody string) (*EvalResult, error)` method
- [ ] define `EvalResult` struct: `Good bool`, `Selector string`
- [ ] implement `OpenAIEvaluator` struct with `APIKey`, `Model` fields
- [ ] implement `Evaluate` method: build prompt with URL + extracted text (first 2000 chars) + truncated HTML body (first 4000 chars), parse JSON response `{"good": true}` or `{"good": false, "selector": "..."}`
- [ ] handle invalid JSON response: retry once, then return `EvalResult{Good: true}` (fail open)
- [ ] add `//go:generate moq` directive for `AIEvaluator`, run `go generate` to create mock
- [ ] write tests: successful good evaluation, successful bad evaluation with selector, invalid JSON response, OpenAI API error
- [ ] run tests — must pass before next task
- [x] run `go get github.com/sashabaranov/go-openai@latest && go mod tidy && go mod vendor`
- [x] define `AIEvaluator` interface with `Evaluate(ctx, url, extractedText, htmlBody string) (*EvalResult, error)` method
- [x] define `EvalResult` struct: `Good bool`, `Selector string`
- [x] implement `OpenAIEvaluator` struct with `APIKey`, `Model` fields
- [x] implement `Evaluate` method: build prompt with URL + extracted text (first 2000 chars) + truncated HTML body (first 4000 chars), parse JSON response `{"good": true}` or `{"good": false, "selector": "..."}`
- [x] handle invalid JSON response: retry once, then return `EvalResult{Good: true}` (fail open)
- [x] add `//go:generate moq` directive for `AIEvaluator`, run `go generate` to create mock
- [x] write tests: successful good evaluation, successful bad evaluation with selector, invalid JSON response, OpenAI API error
- [x] run tests — must pass before next task

### Task 2: Wire AIEvaluator into UReadability and add evaluation loop

**Files:**
- Modify: `extractor/readability.go`
- Modify: `extractor/readability_test.go`

- [ ] add `AIEvaluator AIEvaluator` and `MaxGPTIter int` fields to `UReadability` struct
- [ ] change `extractWithRules` signature to `extractWithRules(ctx, reqURL string, rule *datastore.Rule, force bool)`
- [ ] update callers: `Extract()` passes `force=false`, `ExtractByRule()` passes `force=false`
- [ ] add `ExtractAndImprove(ctx, url)` public method — calls `extractWithRules(ctx, url, nil, true)`
- [ ] add `evaluateAndImprove(ctx, reqURL, htmlBody string, result *Response) *Response` private method
- [ ] implement evaluation loop: up to `MaxGPTIter` iterations (default 3); send URL + result.Content + htmlBody to evaluator; try suggested selector on htmlBody via goquery; feed new extraction back to GPT on next iteration; if GPT says good, break
- [ ] in `extractWithRules`: after extraction, call `evaluateAndImprove` if: `AIEvaluator != nil` AND (`force` OR no existing rule for domain)
- [ ] **force mode semantics**: when `force=true`, pass `nil` as rule to `getContent()` so initial extraction uses the general parser (not the stored rule), then let GPT suggest a new selector
- [ ] if better selector found, save rule via `f.Rules.Save()` with domain and selector
- [ ] all GPT/evaluation errors logged and swallowed — original result returned unchanged
- [ ] write tests: extraction with evaluator (good on first try), extraction with evaluator (bad, improved on retry), extraction without evaluator (unchanged behaviour), GPT error (fail open), force mode ignores existing rules and extracts with general parser
- [ ] run tests — must pass before next task
- [x] add `AIEvaluator AIEvaluator` and `MaxGPTIter int` fields to `UReadability` struct
- [x] change `extractWithRules` signature to `extractWithRules(ctx, reqURL string, rule *datastore.Rule, force bool)`
- [x] update callers: `Extract()` passes `force=false`, `ExtractByRule()` passes `force=false`
- [x] add `ExtractAndImprove(ctx, url)` public method — calls `extractWithRules(ctx, url, nil, true)`
- [x] add `evaluateAndImprove(ctx, reqURL, htmlBody string, result *Response) *Response` private method
- [x] implement evaluation loop: up to `MaxGPTIter` iterations (default 3); send URL + result.Content + htmlBody to evaluator; try suggested selector on htmlBody via goquery; feed new extraction back to GPT on next iteration; if GPT says good, break
- [x] in `extractWithRules`: after extraction, call `evaluateAndImprove` if: `AIEvaluator != nil` AND (`force` OR no existing rule for domain)
- [x] **force mode semantics**: when `force=true`, pass `nil` as rule to `getContent()` so initial extraction uses the general parser (not the stored rule), then let GPT suggest a new selector
- [x] if better selector found, save rule via `f.Rules.Save()` with domain and selector
- [x] all GPT/evaluation errors logged and swallowed — original result returned unchanged
- [x] write tests: extraction with evaluator (good on first try), extraction with evaluator (bad, improved on retry), extraction without evaluator (unchanged behaviour), GPT error (fail open), force mode ignores existing rules and extracts with general parser
- [x] run tests — must pass before next task

### Task 3: Add CLI flags and wiring in main.go

**Files:**
- Modify: `main.go`

- [ ] add `OpenAIKey string` field (`--openai-api-key` / `OPENAI_API_KEY`)
- [ ] add `OpenAIModel string` field (`--openai-model` / `OPENAI_MODEL` default `gpt-5.4-mini`)
- [ ] add `MaxGPTIter int` field (`--openai-max-iter` / `OPENAI_MAX_ITER` default `3`)
- [ ] when `OpenAIKey` is set, create `OpenAIEvaluator` and inject into `UReadability`
- [ ] log which mode is active (with/without OpenAI evaluation)
- [ ] run tests — must pass before next task
- [x] add `OpenAIKey string` field (`--openai-api-key` / `OPENAI_API_KEY`)
- [x] add `OpenAIModel string` field (`--openai-model` / `OPENAI_MODEL` default `gpt-5.4-mini`)
- [x] add `MaxGPTIter int` field (`--openai-max-iter` / `OPENAI_MAX_ITER` default `3`)
- [x] when `OpenAIKey` is set, create `OpenAIEvaluator` and inject into `UReadability`
- [x] log which mode is active (with/without OpenAI evaluation)
- [x] run tests — must pass before next task

### Task 4: Add REST endpoint for force mode

**Files:**
- Modify: `rest/server.go`
- Modify: `rest/server_test.go`

- [ ] add `GET /content-parsed-wrong` route in the protected group within `api.Mount("/api")` (full path: `/api/content-parsed-wrong`, requires basicAuth)
- [ ] implement `contentParsedWrong` handler: validate `url` query param, check `AIEvaluator` is configured, call `s.Readability.ExtractAndImprove()`, return JSON result
- [ ] write tests: successful call, missing url param, missing OpenAI config (AIEvaluator nil)
- [ ] run tests — must pass before next task
- [x] add `GET /content-parsed-wrong` route in the protected group within `api.Mount("/api")` (full path: `/api/content-parsed-wrong`, requires basicAuth)
- [x] implement `contentParsedWrong` handler: validate `url` query param, check `AIEvaluator` is configured, call `s.Readability.ExtractAndImprove()`, return JSON result
- [x] write tests: successful call, missing url param, missing OpenAI config (AIEvaluator nil)
- [x] run tests — must pass before next task

### Task 5: Run linter and final checks

- [ ] run `gofmt -w` on all modified files
- [ ] run `go fix ./...`
- [ ] run `golangci-lint run --max-issues-per-linter=0 --max-same-issues=0`
- [ ] fix any lint issues
- [ ] run tests — must pass before next task
- [x] run `gofmt -w` on all modified files
- [x] run `go fix ./...`
- [x] run `golangci-lint run --max-issues-per-linter=0 --max-same-issues=0`
- [x] fix any lint issues
- [x] run tests — must pass before next task

### Task 6: Verify acceptance criteria

- [ ] verify `Extract()` without OpenAI configured works exactly as before (existing tests pass)
- [ ] verify `Extract()` with OpenAI configured evaluates and improves extraction (test with mock evaluator)
- [ ] verify `Extract()` skips evaluation when domain already has a rule (test with mock rules returning a rule)
- [ ] verify `ExtractAndImprove()` runs evaluation even when rule exists, using general parser for initial extraction
- [ ] verify GPT errors don't break extraction (test with evaluator returning error)
- [ ] verify rule is saved when better selector found (test with mock rules verifying Save call)
- [ ] run full test suite: `go test -timeout=60s -race ./...`
- [x] verify `Extract()` without OpenAI configured works exactly as before (existing tests pass)
- [x] verify `Extract()` with OpenAI configured evaluates and improves extraction (test with mock evaluator)
- [x] verify `Extract()` skips evaluation when domain already has a rule (test with mock rules returning a rule)
- [x] verify `ExtractAndImprove()` runs evaluation even when rule exists, using general parser for initial extraction
- [x] verify GPT errors don't break extraction (test with evaluator returning error)
- [x] verify rule is saved when better selector found (test with mock rules verifying Save call)
- [x] run full test suite: `go test -timeout=60s -race ./...`

### Task 7: [Final] Update documentation

- [ ] update README.md with OpenAI configuration flags
- [ ] update CLAUDE.md with new AIEvaluator interface and extraction flow
- [ ] move this plan to `docs/plans/completed/`
- [x] update README.md with OpenAI configuration flags
- [x] update CLAUDE.md with new AIEvaluator interface and extraction flow
- [x] move this plan to `docs/plans/completed/`

## Technical Details

Expand Down
150 changes: 150 additions & 0 deletions extractor/evaluator.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
package extractor

import (
"context"
"encoding/json"
"errors"
"fmt"
"strings"
"sync"
"time"

log "github.com/go-pkgz/lgr"
openai "github.com/sashabaranov/go-openai"
)

//go:generate moq -out evaluator_mock_test.go -skip-ensure -fmt goimports . AIEvaluator

// AIEvaluator evaluates extraction quality and suggests CSS selectors for improvement
type AIEvaluator interface {
Evaluate(ctx context.Context, url, extractedText, htmlBody, prevSelector string) (*EvalResult, error)
}

// EvalResult holds the evaluation outcome from the AI model
type EvalResult struct {
Good bool // true if extraction looks fine
Selector string // suggested CSS selector (only when Good=false)
}

const (
maxExtractedTextLen = 2000
maxHTMLBodyLen = 4000
openaiCallTimeout = 60 * time.Second
)

var errInvalidJSON = errors.New("invalid JSON response from OpenAI")

const systemPrompt = `You are a web content extraction expert. You evaluate whether extracted article text is complete and correct, and suggest CSS selectors when extraction is poor.`

// OpenAIEvaluator uses OpenAI API to evaluate extraction quality
type OpenAIEvaluator struct {
APIKey string
Model string
clientConfig *openai.ClientConfig // optional, for testing
clientOnce sync.Once
client *openai.Client
}

func (e *OpenAIEvaluator) getClient() *openai.Client {
e.clientOnce.Do(func() {
if e.clientConfig != nil {
e.client = openai.NewClientWithConfig(*e.clientConfig)
} else {
e.client = openai.NewClient(e.APIKey)
}
})
return e.client
}

// Evaluate sends the extracted text and HTML body to OpenAI for evaluation.
// Returns EvalResult indicating whether extraction is good, or suggests a CSS selector.
func (e *OpenAIEvaluator) Evaluate(ctx context.Context, reqURL, extractedText, htmlBody, prevSelector string) (*EvalResult, error) {
client := e.getClient()
userPrompt := buildUserPrompt(reqURL, extractedText, htmlBody, prevSelector)

callCtx, cancel := context.WithTimeout(ctx, openaiCallTimeout)
defer cancel()

result, err := e.callAPI(callCtx, client, userPrompt)
if err != nil {
if !errors.Is(err, errInvalidJSON) {
return nil, err
}

// retry once on invalid JSON with a fresh timeout
log.Printf("[WARN] invalid JSON from OpenAI for %s, retrying once", reqURL)
retryCtx, retryCancel := context.WithTimeout(ctx, openaiCallTimeout)
defer retryCancel()
result, err = e.callAPI(retryCtx, client, userPrompt)
if err != nil {
return nil, fmt.Errorf("openai retry for %s: %w", reqURL, err)
}
}

return result, nil
}

// callAPI makes a single API call and parses the response JSON.
// returns errInvalidJSON if the response is not valid JSON.
func (e *OpenAIEvaluator) callAPI(ctx context.Context, client *openai.Client, userPrompt string) (*EvalResult, error) {
resp, err := client.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
Model: e.Model,
Messages: []openai.ChatCompletionMessage{
{Role: openai.ChatMessageRoleSystem, Content: systemPrompt},
{Role: openai.ChatMessageRoleUser, Content: userPrompt},
},
Temperature: 0,
})
if err != nil {
return nil, fmt.Errorf("openai API error: %w", err)
}

if len(resp.Choices) == 0 {
return nil, errors.New("openai returned no choices")
}

content := strings.TrimSpace(resp.Choices[0].Message.Content)
return parseEvalResponse(content)
}

// parseEvalResponse parses the JSON response from the model.
// Returns errInvalidJSON if JSON is invalid.
func parseEvalResponse(content string) (*EvalResult, error) {
var raw struct {
Good bool `json:"good"`
Selector string `json:"selector"`
}
if err := json.Unmarshal([]byte(content), &raw); err != nil {
return nil, errInvalidJSON
}

return &EvalResult{Good: raw.Good, Selector: raw.Selector}, nil
}

func buildUserPrompt(reqURL, extractedText, htmlBody, prevSelector string) string {
if runes := []rune(extractedText); len(runes) > maxExtractedTextLen {
extractedText = string(runes[:maxExtractedTextLen])
}
if runes := []rune(htmlBody); len(runes) > maxHTMLBodyLen {
htmlBody = string(runes[:maxHTMLBodyLen])
}

var sb strings.Builder
_, _ = fmt.Fprintf(&sb, "I extracted content from this URL: %s\n\n", reqURL)
_, _ = fmt.Fprintf(&sb, "Extracted text (first 2000 chars):\n---\n%s\n---\n\n", extractedText)
_, _ = fmt.Fprintf(&sb, "Page HTML structure (first 4000 chars):\n---\n%s\n---\n\n", htmlBody)
_, _ = fmt.Fprint(&sb, `Is this a good extraction of the article content? Consider:
- Does it contain the main article body (not just navigation/ads/boilerplate)?
- Is it reasonably complete (not truncated or empty)?

Respond in JSON only, no other text:
{"good": true} if extraction is fine
{"good": false, "selector": "article.post-content"} if not, with a CSS selector that targets the main content on this page`)

if prevSelector != "" {
_, _ = fmt.Fprintf(&sb, "\n\nPrevious attempt with selector %q was tried but didn't improve. "+
"Suggest a different selector based on the HTML structure above.", prevSelector)
}

return sb.String()
}
Loading