Skip to content

feat: add StringTokenScannerSymbols for configurable multi-character delimiters (fixes #195)#1303

Open
jmoraleda wants to merge 2 commits intoHubSpot:masterfrom
jmoraleda:master
Open

feat: add StringTokenScannerSymbols for configurable multi-character delimiters (fixes #195)#1303
jmoraleda wants to merge 2 commits intoHubSpot:masterfrom
jmoraleda:master

Conversation

@jmoraleda
Copy link
Copy Markdown

Here's the revised PR description:


Title: feat: add StringTokenScannerSymbols for configurable multi-character delimiters (fixes #195)


Description:

Closes #195.

Python's Jinja2 allows full customization of the six delimiter strings via its Environment constructor (block_start_string, block_end_string, variable_start_string, variable_end_string, comment_start_string, comment_end_string), plus line_statement_prefix and line_comment_prefix. Jinjava had no equivalent, making it impossible to use Jinja-style templating in contexts where {{, {%, or {# appear as literal content (e.g. LaTeX documents, some JSON schemas, or Kubernetes YAML with Helm-style markers).

What this PR adds:

A new StringTokenScannerSymbols class with a builder API that allows all six delimiter strings to be configured independently, with no constraint on length or shared prefix characters:

JinjavaConfig config = JinjavaConfig.newBuilder()
    .withTokenScannerSymbols(StringTokenScannerSymbols.builder()
        .withVariableStartString("\\VAR{")
        .withVariableEndString("}")
        .withBlockStartString("\\BLOCK{")
        .withBlockEndString("}")
        .withCommentStartString("\\#{")
        .withCommentEndString("}")
        .withLineStatementPrefix("%%")
        .withLineCommentPrefix("%#")
        .build())
    .build();

Changes:

  • StringTokenScannerSymbols (new) — builder-configured TokenScannerSymbols implementation. Uses Unicode Private Use Area sentinel characters as internal token-kind discriminators so Token.newToken() dispatches correctly without changes to Token.

  • TokenScanner — adds a string-matching scan path (getNextTokenStringBased()) activated when symbols.isStringBased() is true. The original char-based path is completely unchanged. Also supports lineStatementPrefix and lineCommentPrefix, matching Python Jinja2 semantics including indented prefixes.

  • TokenScannerSymbols — adds isStringBased() (default false), six delimiter-length accessors (getTagStartLength() etc.), and two optional line-prefix accessors (getLineStatementPrefix(), getLineCommentPrefix()). All default implementations preserve existing behaviour.

  • TagToken, ExpressionToken, NoteToken — replaced hardcoded delimiter offsets with calls to the new length accessors on symbols. This is a correctness fix that affects all TokenScannerSymbols implementations, not just StringTokenScannerSymbols: ExpressionToken.parse() was calling WhitespaceUtils.unwrap(image, "{{", "}}") with literal strings regardless of the configured symbols, meaning any custom char-based subclass (like the one in CustomTokenScannerSymbolsTest) would silently fail to strip its expression delimiters. The fix uses symbols.getExpressionStart() and symbols.getExpressionEnd() instead.

Backward compatibility:

The char-based scan path and all existing TokenScannerSymbols subclasses are completely unaffected. The new length accessors on TokenScannerSymbols default to getTheCorrespondingString().length(), which for DefaultTokenScannerSymbols always returns 2. The full test suite passes without modification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow configurable block/variable/comment starts and ends

1 participant