Conversation
…ipping MOTIVATION: - hetushu.com is a major Chinese web novel host not yet supported by LexiconForge - Site embeds watermarks using <big>/<kbd>/<code>/<cite> tags with fullwidth Unicode obfuscation (e.g. hetushu.com) that must be stripped before content is stored APPROACH: - Added HetushuAdapter class following the existing BaseAdapter pattern - Watermark detection normalises text via NFKC before pattern-matching, so fullwidth-obfuscated URLs are caught the same as plain ASCII ones - Content extraction collects :scope > div children (the paragraph structure hetushu uses), removing .mask overlays, the duplicate h2.h2, and any watermark-bearing inline elements first - Navigation resolves relative hrefs against the page origin using URL() CHANGES: - config/constants.ts: add hetushu.com to SUPPORTED_WEBSITES_CONFIG - services/scraping/siteAdapters.ts: add HetushuAdapter; register in getAdapter() - services/scraping/urlUtils.ts: add hetushu.com example URL - tests/services/adapters.hetushu.test.ts: 17 tests covering title, content, watermark stripping (ASCII + fullwidth), navigation, and URL support IMPACT: - Users can now fetch chapters from hetushu.com URLs - No breaking changes to existing adapters or factory interface TESTING: - npx vitest run tests/services/adapters.hetushu.test.ts → 17/17 pass - npx tsc --noEmit → no new type errors Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
HetushuAdapterfor scraping Chinese novel chapters from hetushu.comhetushu.com) via NFKC normalization before pattern-matching against known watermark hostsSUPPORTED_WEBSITES_CONFIGandgetExampleUrl()Implementation details
#ctitle .titletext content:scope > divchildren of#content, after removing.maskoverlays, the duplicate<h2 class="h2">, and watermark-bearing<big>/<kbd>/<code>/<cite>elements#right .pre a[href](prev) and#right a#next[href](next), resolved to absolute URLs using the page originTest plan
npx vitest run tests/services/adapters.hetushu.test.ts— 17/17 tests passnpx tsc --noEmit— no new type errorshttps://hetushu.com/book/2991/2051039.htmlthrough the app and confirm title, clean paragraphs, and next/prev navigation🤖 Generated with Claude Code