Support non-ASCII Unicode in grammar rule names#2196
Merged
Conversation
The grammar currently supports only ASCII rule names. We want to support non-ASCII Unicode symbols such as `⊥` (bottom) since we plan to add that rule. In this commit, we add `is_name_start` and `is_name_continue` predicates that centralize the decision of what can appear in a rule name. `is_name_start` accepts alphabetic characters, underscores, and non-ASCII characters; `is_name_continue` accepts alphanumeric characters, underscores, and non-ASCII characters. We use `is_name_start` in the `parse_expr1` condition that routes to `parse_nonterminal`. The previous condition (`is_alphanumeric`) was slightly misaligned with what `parse_name` actually accepts -- it included digits (which `parse_name` rejects) and excluded underscores (which `parse_name` accepts). Using `is_name_start` makes the dispatch condition match `parse_name` exactly. The `NAMES_RE` regex in `mdbook-spec` encodes the same name-matching logic as a regex pattern, so let's add a comment tying it to the predicates.
ehuss
reviewed
Mar 4, 2026
ehuss
approved these changes
Mar 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The grammar currently supports only ASCII rule names. We want to support non-ASCII Unicode symbols such as
⊥(bottom) since we plan to add that rule.In this commit, we add
is_name_startandis_name_continuepredicates that centralize the decision of what can appear in a rule name.is_name_startaccepts alphabetic characters, underscores, and non-ASCII characters;is_name_continueaccepts alphanumeric characters, underscores, and non-ASCII characters.We use
is_name_startin theparse_expr1condition that routes toparse_nonterminal. The previous condition (is_alphanumeric) was slightly misaligned with whatparse_nameactually accepts -- it included digits (whichparse_namerejects) and excluded underscores (whichparse_nameaccepts). Usingis_name_startmakes the dispatch condition matchparse_nameexactly.The
NAMES_REregex inmdbook-specencodes the same name-matching logic as a regex pattern, so let's add a comment tying it to the predicates.cc @ehuss