Empirical visual similarity scoring for Unicode confusable characters. Renders character pairs across 230 system fonts, measures structural similarity (SSIM), and produces scored JSON artifacts that tell you exactly how confusable two characters are, in which fonts, and with what confidence.
Key results from 26.5 million SSIM comparisons across 22,000+ characters and 12 writing systems:
- 793 confusable pairs not in any standard. Characters that look like Latin letters on screen but are absent from Unicode's official confusables.txt. 74.5% are valid in package names and domain names today.
- World-first cross-script confusable dataset. 563 visually confusable pairs between non-Latin scripts (Cyrillic vs Greek, Hangul vs Han, Devanagari vs Thai) that no prior public dataset covers.
- Font-aware confidence scores replace binary lists. TR39 says a pair is confusable or it isn't. confusable-vision says how confusable, in which fonts, with measured SSIM. 96.5% of TR39 pairs score below 0.7; the dangerous 3.5% score above 0.95.
The output (confusable-weights.json) feeds directly into namespace-guard for runtime confusable detection in package names, domain names, and identifiers.
Unicode has over 149,000 characters. Many look identical to Latin letters: Cyrillic а (U+0430) is visually indistinguishable from Latin a in most fonts. Attackers exploit this for IDN homograph attacks, package name typosquatting, and credential phishing.
Unicode TR39 publishes a confusables.txt mapping, but it's a binary list: a pair is either confusable or it isn't. It doesn't account for fonts, doesn't score confidence, and misses hundreds of pairs. confusable-vision fills that gap with per-font, per-pair SSIM scores derived from actual rendered pixels.
26.5 million SSIM comparisons across 230 macOS system fonts, 12 ICANN-relevant scripts, and 22,000+ Unicode characters:
| Pairs | Comparisons | Source | |
|---|---|---|---|
| TR39 validation | 1,418 | 235,625 | confusables.txt (single-codepoint, Latin targets) |
| Novel discovery | 793 | 2,904,376 | 23,317 identifier-safe codepoints vs Latin a-z/0-9 |
| Cross-script | 563 | 23,629,492 | 12 scripts x 66 script pairs (Latin, Cyrillic, Greek, Arabic, Han, Hangul, Katakana, Hiragana, Devanagari, Thai, Georgian, Armenian) |
1,397 weighted confusable edges in the final output, each with same-font/cross-font statistics, danger scores, and cost values.
96.5% of confusables.txt scores below 0.7 mean SSIM. The median pair scores 0.322. But 82 pairs are pixel-identical (SSIM 1.000) in at least one font, and 47 pairs score negative SSIM (less similar than random noise). The list conflates genuinely dangerous pairs with pairs no human would confuse.
These are characters that look like Latin letters on screen but do not appear in Unicode's official confusables.txt. Top find: U+A7FE LATIN EPIGRAPHIC LETTER I LONGA scores 0.998 against "l" in Geneva. Most are vertical stroke characters from obscure scripts (Pahawh Hmong, Nabataean, Duployan) that render as "l" or "i" lookalikes.
74.5% of these are valid in both JavaScript identifiers and domain names, meaning they can appear in package names and URLs today with no tooling flagging them.
Same-font comparisons average 0.536 SSIM; cross-font average 0.339. Font danger rates range from 0% (Zapfino) to 67.5% (Phosphate). Switching from Arial to Georgia drops confusable pair coverage from 438 to 103. The font a product ships matters for its attack surface.
Prior work on confusable detection focuses almost entirely on non-Latin vs Latin (Cyrillic а vs Latin a). No public dataset measures visual confusability between non-Latin scripts: Cyrillic vs Greek, Hangul vs Han, Devanagari vs Thai.
confusable-vision scored all 66 script pairs from 12 ICANN-relevant scripts (23.6M comparisons), finding 563 cross-script confusable pairs across 37 of them. Highest-yield: Cyrillic-Greek (126 pairs), Latin-Cyrillic (103), Latin-Greek (86).
Top discovery: Hangul jamo U+1175 vs CJK U+4E28 at SSIM 0.999. Also confirmed empirically: Katakana ロ vs CJK 口, Devanagari ० vs Thai ๐, Georgian Ⴝ vs Latin S. 29 of 66 script pairs produced zero matches, confirming that most distant scripts are visually distinct.
npm install
# TR39 confusable pair scoring
npx tsx scripts/build-index.ts # Render index (~160s, 11,370 PNGs)
npx tsx scripts/score-all-pairs.ts # Score all pairs (~65s, 235K comparisons)
# Novel confusable discovery
npx tsx scripts/build-candidates.ts # Candidate set (~23K chars)
npx tsx scripts/build-index.ts --candidates # Render candidates (~40min, 89K PNGs)
npx tsx scripts/score-candidates.ts # Score against Latin targets (~15min, 2.9M comparisons)
# Extract high-scoring discoveries from both pipelines
npx tsx scripts/extract-discoveries.tsQuery which confusable pairs exist for a specific font. Useful for font designers shipping a new typeface, browser vendors evaluating a system font change, or anyone choosing a display font for security-sensitive contexts like IDN domains.
npx tsx scripts/query-font.ts --list-fonts # 218 fonts in discovery data
npx tsx scripts/query-font.ts "Arial" # All pairs for Arial (SSIM >= 0.7)
npx tsx scripts/query-font.ts "Arial" --threshold 0.8 # High-confidence only
npx tsx scripts/query-font.ts "Arial" --compare "Georgia" # Diff two fonts by SSIM delta
npx tsx scripts/query-font.ts "Arial" --json # JSON for downstream processingFont name matching is case-insensitive substring, so "arial" matches Arial, Arial Black, and Arial Unicode MS. Compare mode sorts by the biggest SSIM differences first, surfacing exactly which pairs get better or worse when switching fonts.
Requires the discovery files from the scoring pipeline (gitignored, regenerate locally).
-
build-index renders source and target characters as 48x48 greyscale PNGs, one per font that natively contains the character. Fontconfig is queried per-character to skip fonts lacking coverage (97% reduction vs brute-force).
-
score-all-pairs / score-candidates computes SSIM for every valid source/target combination in two modes: same-font (both characters in one font) and cross-font (source in supplemental font, target in standard font).
-
extract-discoveries filters to high-scoring pairs (mean SSIM >= 0.7) and writes compact, licenced JSON files.
-
generate-weights combines all discoveries into
confusable-weights.jsonwith per-pair same-font/cross-font statistics, danger (max SSIM), stableDanger (p95 SSIM), and cost (1 - stableDanger).
- Greyscale rendering. Gupta et al. 2023 ("GlyphNet") found greyscale outperforms colour for glyph comparison.
- No image augmentation. Flipping/rotating characters creates unrealistic glyphs.
- SSIM over learned embeddings. Deterministic, reproducible, no training data or GPU required.
- Fontconfig-targeted rendering. Only render characters in fonts that actually contain them.
No GlyphNet code is incorporated (GPL licence ambiguity in their repository).
Rather than a hardcoded font list, confusable-vision auto-discovers every system font with Latin a-z coverage:
fc-list ':charset=61-7A' --format='%{file}|%{family[0]}\n'| Category | Count | Purpose |
|---|---|---|
| standard | 74 | Latin-primary fonts (Arial, Menlo, Georgia, Helvetica, etc.) |
| script | 49 | CJK, Indic, Thai fonts that also contain Latin glyphs |
| noto | 103 | Noto Sans variants for non-Latin scripts |
| math | 3 | STIX Two Math, STIX Two Text, STIXGeneral |
| symbol | 1 | Apple Symbols |
| Total | 230 |
| File | Description |
|---|---|
data/output/confusable-discoveries.json |
110 TR39 pairs with high SSIM (>= 0.7) or pixel-identical |
data/output/candidate-discoveries.json |
793 novel pairs not in TR39, mean SSIM >= 0.7 |
data/output/confusable-weights.json |
1,397 weighted edges for namespace-guard integration |
| File | Description |
|---|---|
data/output/render-index/ |
11,370 render PNGs + index.json |
data/output/candidate-index/ |
89,478 render PNGs + index.json |
data/output/confusable-scores.json |
Full scored results (63 MB) |
data/output/candidate-scores.json |
Full scored results (573 MB) |
data/output/report-stats.txt |
Detailed statistics for REPORT.md |
- TR39 validation (1,418 pairs, 230 fonts, technical report)
- Novel confusable discovery (793 high-scoring pairs from 23,317 candidates)
- CJK/Hangul verification (122,862 logographic characters, 69 high-scoring pairs found, confirms M2 exclusion was broadly correct)
- Glyph reuse detection, identifier property annotations, weighted edge computation, namespace-guard integration
- Cross-script confusable scanning (12 ICANN scripts, 23.6M pairs scored, 563 discoveries)
- Per-font querying and font comparison
- Score arbitrary fonts. Register a new font by path, render all source characters against it, and produce a confusable risk report. Enables "I'm switching from Arial to Inter for my banking app, which confusable pairs change?" without re-running the full pipeline.
- Multi-character confusables (
rnvsm,clvsd). Shelved: SSIM cannot weight categorical features like dots (niscores 0.86 againstmbecause the dot is a handful of pixels, while humans treat it as an instant disambiguator). Revisit with a perceptual metric that weights distinctive features.
- namespace-guard (v0.16.0+) consumes
confusable-weights.jsonfor measured visual risk scoring viaconfusableDistance({ weights }) - REPORT.md: full technical report (12 sections, per-font analysis, appendices)
Write-ups on paultendo.github.io covering the findings and methodology behind this project:
- I rendered 1,418 Unicode confusable pairs across 230 fonts. Most aren't confusable to the eye. — TR39 validation results and the case for measured confidence scores
- 793 Unicode characters look like Latin letters but aren't (yet) in confusables.txt — novel discovery pipeline and the highest-scoring finds
- 28 CJK and Hangul characters look like Latin letters — verifying the CJK/Hangul exclusion from the main scan
- 248 cross-script confusable pairs that no standard covers — cross-script scanning across 12 ICANN scripts
- 148x faster: rebuilding a Unicode scanning pipeline for cross-script scale — WASM SSIM workers, pure JS resize, and the optimisation path to 23.6M comparisons
- When shape similarity lies: size-ratio artifacts in confusable detection — why normalisation choices matter and how size-ratio filtering reduces false positives
- The new DDoS: Unicode confusables can't fool LLMs, but they can 5x your API bill — testing confusable attacks against GPT-4o, Claude, Gemini, and Llama
Posts covering the broader problem space that motivated this project:
- A threat model for Unicode identifier spoofing — attack taxonomy for package names, domains, and source code identifiers
- Making Unicode risk measurable — why binary confusable lists aren't enough and what a scored approach looks like
- Your LLM reads Unicode codepoints, not glyphs. That's an attack surface. — how confusables interact with tokenisation and code review by LLMs
- Who does confusable detection actually protect? — the anglocentric bias in TR39 and what it means for non-Latin users
- Unicode ships one confusable map. You need two. — the NFKC/TR39 divergence that started this project
- confusables.txt and NFKC disagree on 31 characters — the 31 composability vectors that confusable-vision was originally built to resolve
- Code (src/, scripts/): MIT
- Generated data (data/output/): CC-BY-4.0. Free to use, share, and adapt for any purpose including commercial, with attribution.
- Attribution: Paul Wood FRSA (@paultendo), confusable-vision