confusable-vision

Empirical visual similarity scoring for Unicode confusable characters. Renders character pairs across 230 system fonts, measures structural similarity (SSIM), and produces scored JSON artifacts that tell you exactly how confusable two characters are, in which fonts, and with what confidence.

Key results from 26.5 million SSIM comparisons across 22,000+ characters and 12 writing systems:

793 confusable pairs not in any standard. Characters that look like Latin letters on screen but are absent from Unicode's official confusables.txt. 74.5% are valid in package names and domain names today.
World-first cross-script confusable dataset. 563 visually confusable pairs between non-Latin scripts (Cyrillic vs Greek, Hangul vs Han, Devanagari vs Thai) that no prior public dataset covers.
Font-aware confidence scores replace binary lists. TR39 says a pair is confusable or it isn't. confusable-vision says how confusable, in which fonts, with measured SSIM. 96.5% of TR39 pairs score below 0.7; the dangerous 3.5% score above 0.95.

The output (confusable-weights.json) feeds directly into namespace-guard for runtime confusable detection in package names, domain names, and identifiers.

Why this matters

Unicode has over 149,000 characters. Many look identical to Latin letters: Cyrillic а (U+0430) is visually indistinguishable from Latin a in most fonts. Attackers exploit this for IDN homograph attacks, package name typosquatting, and credential phishing.

Unicode TR39 publishes a confusables.txt mapping, but it's a binary list: a pair is either confusable or it isn't. It doesn't account for fonts, doesn't score confidence, and misses hundreds of pairs. confusable-vision fills that gap with per-font, per-pair SSIM scores derived from actual rendered pixels.

What it found

26.5 million SSIM comparisons across 230 macOS system fonts, 12 ICANN-relevant scripts, and 22,000+ Unicode characters:

	Pairs	Comparisons	Source
TR39 validation	1,418	235,625	confusables.txt (single-codepoint, Latin targets)
Novel discovery	793	2,904,376	23,317 identifier-safe codepoints vs Latin a-z/0-9
Cross-script	563	23,629,492	12 scripts x 66 script pairs (Latin, Cyrillic, Greek, Arabic, Han, Hangul, Katakana, Hiragana, Devanagari, Thai, Georgian, Armenian)

1,397 weighted confusable edges in the final output, each with same-font/cross-font statistics, danger scores, and cost values.

TR39 is mostly noise, but the high end is severe

96.5% of confusables.txt scores below 0.7 mean SSIM. The median pair scores 0.322. But 82 pairs are pixel-identical (SSIM 1.000) in at least one font, and 47 pairs score negative SSIM (less similar than random noise). The list conflates genuinely dangerous pairs with pairs no human would confuse.

793 confusable pairs are missing from TR39

These are characters that look like Latin letters on screen but do not appear in Unicode's official confusables.txt. Top find: U+A7FE LATIN EPIGRAPHIC LETTER I LONGA scores 0.998 against "l" in Geneva. Most are vertical stroke characters from obscure scripts (Pahawh Hmong, Nabataean, Duployan) that render as "l" or "i" lookalikes.

74.5% of these are valid in both JavaScript identifiers and domain names, meaning they can appear in package names and URLs today with no tooling flagging them.

Font choice changes confusable risk dramatically

Same-font comparisons average 0.536 SSIM; cross-font average 0.339. Font danger rates range from 0% (Zapfino) to 67.5% (Phosphate). Switching from Arial to Georgia drops confusable pair coverage from 438 to 103. The font a product ships matters for its attack surface.

World-first cross-script confusable measurement

Prior work on confusable detection focuses almost entirely on non-Latin vs Latin (Cyrillic а vs Latin a). No public dataset measures visual confusability between non-Latin scripts: Cyrillic vs Greek, Hangul vs Han, Devanagari vs Thai.

confusable-vision scored all 66 script pairs from 12 ICANN-relevant scripts (23.6M comparisons), finding 563 cross-script confusable pairs across 37 of them. Highest-yield: Cyrillic-Greek (126 pairs), Latin-Cyrillic (103), Latin-Greek (86).

Top discovery: Hangul jamo U+1175 vs CJK U+4E28 at SSIM 0.999. Also confirmed empirically: Katakana ロ vs CJK 口, Devanagari ० vs Thai ๐, Georgian Ⴝ vs Latin S. 29 of 66 script pairs produced zero matches, confirming that most distant scripts are visually distinct.

Quick start

npm install

# TR39 confusable pair scoring
npx tsx scripts/build-index.ts          # Render index (~160s, 11,370 PNGs)
npx tsx scripts/score-all-pairs.ts      # Score all pairs (~65s, 235K comparisons)

# Novel confusable discovery
npx tsx scripts/build-candidates.ts          # Candidate set (~23K chars)
npx tsx scripts/build-index.ts --candidates  # Render candidates (~40min, 89K PNGs)
npx tsx scripts/score-candidates.ts          # Score against Latin targets (~15min, 2.9M comparisons)

# Extract high-scoring discoveries from both pipelines
npx tsx scripts/extract-discoveries.ts

Font querying

Query which confusable pairs exist for a specific font. Useful for font designers shipping a new typeface, browser vendors evaluating a system font change, or anyone choosing a display font for security-sensitive contexts like IDN domains.

npx tsx scripts/query-font.ts --list-fonts                    # 218 fonts in discovery data
npx tsx scripts/query-font.ts "Arial"                         # All pairs for Arial (SSIM >= 0.7)
npx tsx scripts/query-font.ts "Arial" --threshold 0.8         # High-confidence only
npx tsx scripts/query-font.ts "Arial" --compare "Georgia"     # Diff two fonts by SSIM delta
npx tsx scripts/query-font.ts "Arial" --json                  # JSON for downstream processing

Font name matching is case-insensitive substring, so "arial" matches Arial, Arial Black, and Arial Unicode MS. Compare mode sorts by the biggest SSIM differences first, surfacing exactly which pairs get better or worse when switching fonts.

Requires the discovery files from the scoring pipeline (gitignored, regenerate locally).

How it works

Rendering pipeline

build-index renders source and target characters as 48x48 greyscale PNGs, one per font that natively contains the character. Fontconfig is queried per-character to skip fonts lacking coverage (97% reduction vs brute-force).
score-all-pairs / score-candidates computes SSIM for every valid source/target combination in two modes: same-font (both characters in one font) and cross-font (source in supplemental font, target in standard font).
extract-discoveries filters to high-scoring pairs (mean SSIM >= 0.7) and writes compact, licenced JSON files.
generate-weights combines all discoveries into confusable-weights.json with per-pair same-font/cross-font statistics, danger (max SSIM), stableDanger (p95 SSIM), and cost (1 - stableDanger).

Design choices

Greyscale rendering. Gupta et al. 2023 ("GlyphNet") found greyscale outperforms colour for glyph comparison.
No image augmentation. Flipping/rotating characters creates unrealistic glyphs.
SSIM over learned embeddings. Deterministic, reproducible, no training data or GPU required.
Fontconfig-targeted rendering. Only render characters in fonts that actually contain them.

No GlyphNet code is incorporated (GPL licence ambiguity in their repository).

Font discovery

Rather than a hardcoded font list, confusable-vision auto-discovers every system font with Latin a-z coverage:

fc-list ':charset=61-7A' --format='%{file}|%{family[0]}\n'

Category	Count	Purpose
standard	74	Latin-primary fonts (Arial, Menlo, Georgia, Helvetica, etc.)
script	49	CJK, Indic, Thai fonts that also contain Latin glyphs
noto	103	Noto Sans variants for non-Latin scripts
math	3	STIX Two Math, STIX Two Text, STIXGeneral
symbol	1	Apple Symbols
Total	230

Output

Committed (CC-BY-4.0)

File	Description
`data/output/confusable-discoveries.json`	110 TR39 pairs with high SSIM (>= 0.7) or pixel-identical
`data/output/candidate-discoveries.json`	793 novel pairs not in TR39, mean SSIM >= 0.7
`data/output/confusable-weights.json`	1,397 weighted edges for namespace-guard integration

Generated (gitignored, run pipeline to regenerate)

File	Description
`data/output/render-index/`	11,370 render PNGs + index.json
`data/output/candidate-index/`	89,478 render PNGs + index.json
`data/output/confusable-scores.json`	Full scored results (63 MB)
`data/output/candidate-scores.json`	Full scored results (573 MB)
`data/output/report-stats.txt`	Detailed statistics for REPORT.md

Progress

TR39 validation (1,418 pairs, 230 fonts, technical report)
Novel confusable discovery (793 high-scoring pairs from 23,317 candidates)
CJK/Hangul verification (122,862 logographic characters, 69 high-scoring pairs found, confirms M2 exclusion was broadly correct)
Glyph reuse detection, identifier property annotations, weighted edge computation, namespace-guard integration
Cross-script confusable scanning (12 ICANN scripts, 23.6M pairs scored, 563 discoveries)
Per-font querying and font comparison
Score arbitrary fonts. Register a new font by path, render all source characters against it, and produce a confusable risk report. Enables "I'm switching from Arial to Inter for my banking app, which confusable pairs change?" without re-running the full pipeline.
Multi-character confusables (rn vs m, cl vs d). Shelved: SSIM cannot weight categorical features like dots (ni scores 0.86 against m because the dot is a handful of pixels, while humans treat it as an instant disambiguator). Revisit with a perceptual metric that weights distinctive features.

I rendered 1,418 Unicode confusable pairs across 230 fonts. Most aren't confusable to the eye. — TR39 validation results and the case for measured confidence scores
793 Unicode characters look like Latin letters but aren't (yet) in confusables.txt — novel discovery pipeline and the highest-scoring finds
28 CJK and Hangul characters look like Latin letters — verifying the CJK/Hangul exclusion from the main scan
248 cross-script confusable pairs that no standard covers — cross-script scanning across 12 ICANN scripts
148x faster: rebuilding a Unicode scanning pipeline for cross-script scale — WASM SSIM workers, pure JS resize, and the optimisation path to 23.6M comparisons
When shape similarity lies: size-ratio artifacts in confusable detection — why normalisation choices matter and how size-ratio filtering reduces false positives
The new DDoS: Unicode confusables can't fool LLMs, but they can 5x your API bill — testing confusable attacks against GPT-4o, Claude, Gemini, and Llama

Background

Posts covering the broader problem space that motivated this project:

A threat model for Unicode identifier spoofing — attack taxonomy for package names, domains, and source code identifiers
Making Unicode risk measurable — why binary confusable lists aren't enough and what a scored approach looks like
Your LLM reads Unicode codepoints, not glyphs. That's an attack surface. — how confusables interact with tokenisation and code review by LLMs
Who does confusable detection actually protect? — the anglocentric bias in TR39 and what it means for non-Latin users
Unicode ships one confusable map. You need two. — the NFKC/TR39 divergence that started this project
confusables.txt and NFKC disagree on 31 characters — the 31 composability vectors that confusable-vision was originally built to resolve

Licence

Code (src/, scripts/): MIT
Generated data (data/output/): CC-BY-4.0. Free to use, share, and adapt for any purpose including commercial, with attribution.
Attribution: Paul Wood FRSA (@paultendo), confusable-vision

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
attack-tests		attack-tests
data		data
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
REPORT.md		REPORT.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

confusable-vision

Why this matters

What it found

TR39 is mostly noise, but the high end is severe

793 confusable pairs are missing from TR39

Font choice changes confusable risk dramatically

World-first cross-script confusable measurement

Quick start

Font querying

How it works

Rendering pipeline

Design choices

Font discovery

Output

Committed (CC-BY-4.0)

Generated (gitignored, run pipeline to regenerate)

Progress

Related

Blog posts

Background

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

paultendo/confusable-vision

Folders and files

Latest commit

History

Repository files navigation

confusable-vision

Why this matters

What it found

TR39 is mostly noise, but the high end is severe

793 confusable pairs are missing from TR39

Font choice changes confusable risk dramatically

World-first cross-script confusable measurement

Quick start

Font querying

How it works

Rendering pipeline

Design choices

Font discovery

Output

Committed (CC-BY-4.0)

Generated (gitignored, run pipeline to regenerate)

Progress

Related

Blog posts

Background

Licence

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages