feat: hetushu.com adapter by anantham · Pull Request #34 · anantham/LexiconForge

anantham · 2026-04-04T20:48:05Z

Summary

Adds HetushuAdapter for scraping Chinese novel chapters from hetushu.com
Strips fullwidth-obfuscated watermarks (e.g. ｈｅｔｕｓｈｕ.ｃｏｍ) via NFKC normalization before pattern-matching against known watermark hosts
Registers the site in SUPPORTED_WEBSITES_CONFIG and getExampleUrl()

Implementation details

Title: #ctitle .title text content
Content: :scope > div children of #content, after removing .mask overlays, the duplicate <h2 class="h2">, and watermark-bearing <big>/<kbd>/<code>/<cite> elements
Navigation: #right .pre a[href] (prev) and #right a#next[href] (next), resolved to absolute URLs using the page origin

Test plan

npx vitest run tests/services/adapters.hetushu.test.ts — 17/17 tests pass
npx tsc --noEmit — no new type errors
Manual smoke test: fetch https://hetushu.com/book/2991/2051039.html through the app and confirm title, clean paragraphs, and next/prev navigation

🤖 Generated with Claude Code

…ipping MOTIVATION: - hetushu.com is a major Chinese web novel host not yet supported by LexiconForge - Site embeds watermarks using <big>/<kbd>/<code>/<cite> tags with fullwidth Unicode obfuscation (e.g. ｈｅｔｕｓｈｕ.ｃｏｍ) that must be stripped before content is stored APPROACH: - Added HetushuAdapter class following the existing BaseAdapter pattern - Watermark detection normalises text via NFKC before pattern-matching, so fullwidth-obfuscated URLs are caught the same as plain ASCII ones - Content extraction collects :scope > div children (the paragraph structure hetushu uses), removing .mask overlays, the duplicate h2.h2, and any watermark-bearing inline elements first - Navigation resolves relative hrefs against the page origin using URL() CHANGES: - config/constants.ts: add hetushu.com to SUPPORTED_WEBSITES_CONFIG - services/scraping/siteAdapters.ts: add HetushuAdapter; register in getAdapter() - services/scraping/urlUtils.ts: add hetushu.com example URL - tests/services/adapters.hetushu.test.ts: 17 tests covering title, content, watermark stripping (ASCII + fullwidth), navigation, and URL support IMPACT: - Users can now fetch chapters from hetushu.com URLs - No breaking changes to existing adapters or factory interface TESTING: - npx vitest run tests/services/adapters.hetushu.test.ts → 17/17 pass - npx tsc --noEmit → no new type errors Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vercel · 2026-04-04T20:48:12Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
lexicon-forge	Building	Preview, Comment	Apr 4, 2026 8:48pm

anantham merged commit 782b95c into main Apr 4, 2026
2 of 3 checks passed

vercel Bot deployed to Preview April 4, 2026 20:48 View deployment

anantham deleted the feat/codex-hetushu-source branch April 5, 2026 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: hetushu.com adapter#34

feat: hetushu.com adapter#34
anantham merged 1 commit intomainfrom
feat/codex-hetushu-source

anantham commented Apr 4, 2026

Uh oh!

vercel Bot commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anantham commented Apr 4, 2026

Summary

Implementation details

Test plan

Uh oh!

vercel Bot commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant