Skip to content

feat: hetushu.com adapter#34

Merged
anantham merged 1 commit intomainfrom
feat/codex-hetushu-source
Apr 4, 2026
Merged

feat: hetushu.com adapter#34
anantham merged 1 commit intomainfrom
feat/codex-hetushu-source

Conversation

@anantham
Copy link
Copy Markdown
Owner

@anantham anantham commented Apr 4, 2026

Summary

  • Adds HetushuAdapter for scraping Chinese novel chapters from hetushu.com
  • Strips fullwidth-obfuscated watermarks (e.g. hetushu.com) via NFKC normalization before pattern-matching against known watermark hosts
  • Registers the site in SUPPORTED_WEBSITES_CONFIG and getExampleUrl()

Implementation details

  • Title: #ctitle .title text content
  • Content: :scope > div children of #content, after removing .mask overlays, the duplicate <h2 class="h2">, and watermark-bearing <big>/<kbd>/<code>/<cite> elements
  • Navigation: #right .pre a[href] (prev) and #right a#next[href] (next), resolved to absolute URLs using the page origin

Test plan

  • npx vitest run tests/services/adapters.hetushu.test.ts — 17/17 tests pass
  • npx tsc --noEmit — no new type errors
  • Manual smoke test: fetch https://hetushu.com/book/2991/2051039.html through the app and confirm title, clean paragraphs, and next/prev navigation

🤖 Generated with Claude Code

…ipping

MOTIVATION:
- hetushu.com is a major Chinese web novel host not yet supported by LexiconForge
- Site embeds watermarks using <big>/<kbd>/<code>/<cite> tags with fullwidth
  Unicode obfuscation (e.g. hetushu.com) that must be stripped before
  content is stored

APPROACH:
- Added HetushuAdapter class following the existing BaseAdapter pattern
- Watermark detection normalises text via NFKC before pattern-matching, so
  fullwidth-obfuscated URLs are caught the same as plain ASCII ones
- Content extraction collects :scope > div children (the paragraph structure
  hetushu uses), removing .mask overlays, the duplicate h2.h2, and any
  watermark-bearing inline elements first
- Navigation resolves relative hrefs against the page origin using URL()

CHANGES:
- config/constants.ts: add hetushu.com to SUPPORTED_WEBSITES_CONFIG
- services/scraping/siteAdapters.ts: add HetushuAdapter; register in getAdapter()
- services/scraping/urlUtils.ts: add hetushu.com example URL
- tests/services/adapters.hetushu.test.ts: 17 tests covering title, content,
  watermark stripping (ASCII + fullwidth), navigation, and URL support

IMPACT:
- Users can now fetch chapters from hetushu.com URLs
- No breaking changes to existing adapters or factory interface

TESTING:
- npx vitest run tests/services/adapters.hetushu.test.ts → 17/17 pass
- npx tsc --noEmit → no new type errors

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
lexicon-forge Building Building Preview, Comment Apr 4, 2026 8:48pm

@anantham anantham merged commit 782b95c into main Apr 4, 2026
2 of 3 checks passed
@anantham anantham deleted the feat/codex-hetushu-source branch April 5, 2026 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant