Skip to content

feat: extract tokenizer into separate project with multi-tokenizer support#256

Merged
ModerRAS merged 6 commits intomasterfrom
feat/extract-tokenizer-v2
Apr 22, 2026
Merged

feat: extract tokenizer into separate project with multi-tokenizer support#256
ModerRAS merged 6 commits intomasterfrom
feat/extract-tokenizer-v2

Conversation

@ModerRAS
Copy link
Copy Markdown
Owner

@ModerRAS ModerRAS commented Mar 27, 2026

Summary

  • Create new TelegramSearchBot.Tokenizer project with abstraction layer for multi-tokenizer support
  • Refactor Search project to use ITokenizer interface instead of direct implementation
  • Add comprehensive unit tests (18 tests for tokenizer)

Changes

New Project: TelegramSearchBot.Tokenizer

  • Abstractions:

    • ITokenizer interface (Tokenize, SafeTokenize, TokenizeWithOffsets)
    • ITokenizerFactory for factory pattern
    • TokenizerMetadata for tokenizer info
    • TokenizerType enum for tokenizer types
    • TokenWithOffset record for offset information
  • Implementations:

    • SmartChineseTokenizer using Lucene SmartChineseAnalyzer
    • TokenizerFactory implementation

New Test Project: TelegramSearchBot.Tokenizer.Tests

  • 18 unit tests covering ITokenizer, SmartChineseTokenizer, and TokenizerFactory

Updated: Search Project

  • LuceneManager, SimpleSearchService, SyntaxSearchService
  • PhraseQueryProcessor, ContentQueryBuilder, ExtQueryBuilder
  • SearchHelper now uses ITokenizer via SetTokenizer()

Removed

  • TelegramSearchBot.Search/Tokenizer/UnifiedTokenizer.cs (duplicate implementation)
  • Dead code in SimpleSearchService

Verification

  • Build: 0 errors
  • Tests: 398 Passed, 0 Failed

Future Extensibility

Adding new tokenizers (e.g., JiebaTokenizer) now only requires:

  1. Add new type to TokenizerType enum
  2. Implement ITokenizer interface
  3. Add case to TokenizerFactory.Create()

Summary by CodeRabbit

  • Refactor
    • Introduced a pluggable tokenizer module and replaced concrete tokenizer usage with an abstract tokenizer API; search components now use the tokenizer abstraction.
  • Tests
    • Added unit tests for the tokenizer contract, factory, and concrete tokenizer behavior.
  • Documentation
    • Added an agent development guide covering build, test, formatting, and contribution conventions.
  • Chores
    • Added new tokenizer projects and test project to the solution.

…pport

- Create new TelegramSearchBot.Tokenizer project with abstraction layer:
  - ITokenizer interface (Tokenize, SafeTokenize, TokenizeWithOffsets)
  - ITokenizerFactory for factory pattern
  - TokenizerMetadata for tokenizer info
  - TokenizerType enum for tokenizer types

- Implement SmartChineseTokenizer using Lucene SmartChineseAnalyzer
- Add TokenizerFactory implementation
- Create Tokenizer.Tests project with 18 unit tests

- Update Search project to use ITokenizer interface:
  - LuceneManager, SimpleSearchService, SyntaxSearchService
  - PhraseQueryProcessor, ContentQueryBuilder, ExtQueryBuilder

- Refactor SearchHelper to use ITokenizer with SetTokenizer()
- Remove UnifiedTokenizer duplicate implementation
- Delete dead code in SimpleSearchService

All 398 tests pass (398 Passed, 0 Failed).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 27, 2026

Warning

Rate limit exceeded

@ModerRAS has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 39 minutes and 18 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 39 minutes and 18 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a0a6a5c1-427c-461a-806d-bbe918f91ef2

📥 Commits

Reviewing files that changed from the base of the PR and between 1aee613 and 2de4c38.

📒 Files selected for processing (16)
  • TelegramSearchBot.Search.Test/PhraseQueryProcessorTests.cs
  • TelegramSearchBot.Search.Test/SearchHelperTests.cs
  • TelegramSearchBot.Search.Test/TelegramSearchBot.Search.Test.csproj
  • TelegramSearchBot.Search/Tool/LuceneManager.cs
  • TelegramSearchBot.Search/Tool/PhraseQueryProcessor.cs
  • TelegramSearchBot.Search/Tool/SearchHelper.cs
  • TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs
  • TelegramSearchBot.Tokenizer.Tests/TelegramSearchBot.Tokenizer.Tests.csproj
  • TelegramSearchBot.Tokenizer.Tests/TokenizerFactoryTests.cs
  • TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs
  • TelegramSearchBot.Tokenizer/Abstractions/ITokenizerFactory.cs
  • TelegramSearchBot.Tokenizer/Abstractions/TokenizerMetadata.cs
  • TelegramSearchBot.Tokenizer/Abstractions/TokenizerType.cs
  • TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs
  • TelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.cs
  • TelegramSearchBot.sln
📝 Walkthrough

Walkthrough

Tokenization was extracted into a new TelegramSearchBot.Tokenizer project (abstractions, SmartChinese implementation, factory, and tests). Search code and tools were switched from the removed UnifiedTokenizer to the new ITokenizer abstraction; SearchHelper was refactored to use token offsets and a pluggable tokenizer.

Changes

Cohort / File(s) Summary
New Tokenizer Abstractions
TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs, TelegramSearchBot.Tokenizer/Abstractions/ITokenizerFactory.cs, TelegramSearchBot.Tokenizer/Abstractions/TokenizerMetadata.cs, TelegramSearchBot.Tokenizer/Abstractions/TokenizerType.cs
Added ITokenizer (and TokenWithOffset), ITokenizerFactory, TokenizerMetadata, and TokenizerType enum.
New Tokenizer Implementations
TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs, TelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.cs
Added SmartChineseTokenizer implementing ITokenizer (Tokenize, SafeTokenize, TokenizeWithOffsets) and TokenizerFactory that creates tokenizer instances.
Search services switched to ITokenizer
TelegramSearchBot.Search/Service/SimpleSearchService.cs, TelegramSearchBot.Search/Service/SyntaxSearchService.cs
Constructor/field types changed from UnifiedTokenizer to ITokenizer; tokenization results are now materialized with .ToList().
Query & phrase builders → ITokenizer
TelegramSearchBot.Search/Tool/ContentQueryBuilder.cs, TelegramSearchBot.Search/Tool/ExtQueryBuilder.cs, TelegramSearchBot.Search/Tool/PhraseQueryProcessor.cs
Dependencies updated to ITokenizer; tokenizer outputs are explicitly materialized (.ToList()) where used.
Lucene manager updates
TelegramSearchBot.Search/Tool/LuceneManager.cs
Private tokenizer field changed to ITokenizer; now instantiates SmartChineseTokenizer and passes ITokenizer to downstream components.
SearchHelper refactor
TelegramSearchBot.Search/Tool/SearchHelper.cs
Removed Lucene token-stream helpers; added static pluggable ITokenizer (Set/Get); updated FindBestSnippet and matching helpers to use TokenWithOffset and new tokenizer APIs.
Removed legacy tokenizer
TelegramSearchBot.Search/Tokenizer/UnifiedTokenizer.cs
Deleted the previous UnifiedTokenizer implementation.
Tests added
TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs, TelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.cs, TelegramSearchBot.Tokenizer.Tests/TokenizerFactoryTests.cs, TelegramSearchBot.Tokenizer.Tests/TelegramSearchBot.Tokenizer.Tests.csproj
New unit tests for ITokenizer contract, SmartChineseTokenizer behavior, and TokenizerFactory; new test project added.
Project & solution updates
TelegramSearchBot.Tokenizer/TelegramSearchBot.Tokenizer.csproj, TelegramSearchBot.Search/TelegramSearchBot.Search.csproj, TelegramSearchBot.sln
Added Tokenizer project with Lucene deps, Search project now references Tokenizer, and solution updated with new projects/configurations.
Misc small edits
AGENTS.md, tests formatting, minor whitespace changes in TelegramSearchBot.Test/*, TelegramSearchBot.Model/*, TelegramSearchBot.Service/*
Documentation added (AGENTS.md) and several non-functional formatting/whitespace/test tweaks.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client as Client
participant SearchSvc as SearchService
participant LuceneMgr as LuceneManager
participant Tokenizer as SmartChineseTokenizer
participant Lucene as SmartChineseAnalyzer
Client->>SearchSvc: submit query/text
SearchSvc->>LuceneMgr: request query build / snippet
LuceneMgr->>Tokenizer: Tokenize / TokenizeWithOffsets
Tokenizer->>Lucene: create token stream
Lucene-->>Tokenizer: tokens and offsets
Tokenizer-->>LuceneMgr: IReadOnlyList / IReadOnlyList(TokenWithOffset)
LuceneMgr-->>SearchSvc: assembled query parts / snippet candidates
SearchSvc-->>Client: results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped through code in moonlit light,

Tokens scattered, then tucked up tight.
An interface, a brand-new tune—
Lucene sings beneath the moon,
A rabbit claps for every byte! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and accurately summarizes the main objective: extracting tokenizer functionality into a separate, reusable project with support for multiple tokenizer implementations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/extract-tokenizer-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 27, 2026

🔍 PR检查报告

📋 检查概览

🧪 测试结果

平台 状态 详情
Ubuntu 🔴 失败 测试结果不可用
Windows 🔴 失败 测试结果不可用

📊 代码质量

  • ✅ 代码格式化检查
  • ✅ 安全漏洞扫描
  • ✅ 依赖包分析
  • ✅ 代码覆盖率收集

📁 测试产物

  • 测试结果文件已上传为artifacts
  • 代码覆盖率已上传到Codecov

🔗 相关链接


此报告由GitHub Actions自动生成

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
TelegramSearchBot.Search/Tool/PhraseQueryProcessor.cs (1)

38-51: ⚠️ Potential issue | 🟠 Major

Phrase queries will fail for repeated tokens due to deduplication.

The SafeTokenize() method (via SmartChineseTokenizer.Tokenize()) uses a HashSet<string> internally (see SmartChineseTokenizer.cs:29), which:

  1. Loses duplicate tokens — A phrase like "北京 北京" produces only ["北京"] instead of ["北京", "北京"]
  2. Does not guarantee token orderHashSet ordering is undefined

This breaks BuildPhraseQueryForField() which relies on positional indices (lines 67-68). For phrase search, the tokenizer must preserve both order and duplicates.

Consider using TokenizeWithOffsets() which returns tokens in order (from Lucene's token stream), or create a separate tokenization method that preserves duplicates:

-                    var terms = _tokenizer.SafeTokenize(phraseText).ToList();
+                    // Use TokenizeWithOffsets to preserve order and duplicates
+                    var terms = _tokenizer.TokenizeWithOffsets(phraseText)
+                        .Select(t => t.Term)
+                        .ToList();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot.Search/Tool/PhraseQueryProcessor.cs` around lines 38 - 51,
SafeTokenize (via SmartChineseTokenizer.Tokenize) uses a HashSet and drops
duplicates and ordering, which breaks BuildPhraseQueryForField's positional
logic; replace the call to _tokenizer.SafeTokenize(phraseText) in the
PhraseRegex loop with an ordered tokenization that preserves duplicates and
positions (e.g., use TokenizeWithOffsets or add a new tokenizer method that
returns tokens in sequence including repeated tokens), then pass that ordered
list to BuildPhraseQueryForField and _extOptimizer.BuildOptimizedExtPhraseQuery
so phrase position indices remain correct.
🧹 Nitpick comments (8)
TelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.cs (1)

11-11: Consider caching tokenizer instances to avoid repeated Lucene analyzer initialization.

The factory creates a new SmartChineseTokenizer instance on each Create() call. Since SmartChineseAnalyzer (from Lucene.Net) initialization is expensive—involving dictionary and resource loading—and SmartChineseTokenizer is thread-safe (all fields are readonly after construction), caching instances per TokenizerType would improve performance without introducing thread-safety concerns.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.cs` at line 11,
The factory's Create() returns a new SmartChineseTokenizer each call which
causes repeated expensive Lucene analyzer initialization; modify
TokenizerFactory to cache tokenizer instances per TokenizerType (e.g., store a
readonly/static ConcurrentDictionary<TokenizerType, ITokenizer> or a
lazy-initialized field) and return the cached SmartChineseTokenizer for
TokenizerType.SmartChinese in Create(), ensuring the cache is populated once
(thread-safe) and reused thereafter; reference TokenizerFactory.Create(),
TokenizerType.SmartChinese, SmartChineseTokenizer and SmartChineseAnalyzer when
making the change.
TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs (2)

3-3: Document that End offset is exclusive.

Based on SmartChineseTokenizer.TokenizeWithOffsets() which uses Lucene's IOffsetAttribute.EndOffset (exclusive), the End property follows Lucene conventions. This should be documented to prevent off-by-one errors.

-public record TokenWithOffset(int Start, int End, string Term);
+/// <summary>
+/// Represents a token with its character offsets in the source text.
+/// </summary>
+/// <param name="Start">Inclusive start offset.</param>
+/// <param name="End">Exclusive end offset (Lucene convention).</param>
+/// <param name="Term">The extracted token text.</param>
+public record TokenWithOffset(int Start, int End, string Term);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs` at line 3, Add
documentation to the TokenWithOffset record clarifying that the End offset is
exclusive (follows Lucene's IOffsetAttribute.EndOffset semantics) to avoid
off-by-one mistakes; update the XML doc comment on the TokenWithOffset
declaration (and mention the End property) to state explicitly that Start is
inclusive and End is exclusive and that offsets come from
SmartChineseTokenizer.TokenizeWithOffsets()/Lucene conventions.

5-11: Consider adding XML documentation to clarify method contracts.

The interface design is clean, but the distinction between Tokenize() and SafeTokenize() is unclear from the interface alone. Adding documentation would help implementers and consumers understand the expected behavior.

📝 Proposed documentation
 public interface ITokenizer
 {
+    /// <summary>
+    /// Tokenizes the input text. May throw exceptions on invalid input.
+    /// </summary>
+    /// <param name="text">Text to tokenize.</param>
+    /// <returns>List of tokens extracted from the text.</returns>
     IReadOnlyList<string> Tokenize(string text);
+    
+    /// <summary>
+    /// Tokenizes the input text with exception handling.
+    /// Returns empty list instead of throwing on errors.
+    /// </summary>
+    /// <param name="text">Text to tokenize.</param>
+    /// <returns>List of tokens, or empty list if tokenization fails.</returns>
     IReadOnlyList<string> SafeTokenize(string text);
+    
+    /// <summary>
+    /// Tokenizes the input text and returns tokens with their character offsets.
+    /// </summary>
+    /// <param name="text">Text to tokenize.</param>
+    /// <returns>List of tokens with start/end offsets (end is exclusive).</returns>
     IReadOnlyList<TokenWithOffset> TokenizeWithOffsets(string text);
+    
+    /// <summary>
+    /// Metadata describing this tokenizer implementation.
+    /// </summary>
     TokenizerMetadata Metadata { get; }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs` around lines 5 - 11,
Add XML documentation to the ITokenizer interface to clarify the contract for
each member: document ITokenizer.Tokenize(string) describing expected
normalization, token boundaries, behavior on null/empty input and whether it may
throw; document ITokenizer.SafeTokenize(string) to state how it differs (e.g.,
never throws, returns empty list on invalid input, or falls back to a safe
tokenization strategy); document TokenizeWithOffsets(string) to specify that
returned TokenWithOffset values align with original string indices and how
offsets are computed; and document the Metadata property to explain what
tokenizer capabilities and limits it exposes. Use the method/property names
(Tokenize, SafeTokenize, TokenizeWithOffsets, Metadata, ITokenizer) so
implementers can follow the exact expected behaviors.
TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs (2)

35-47: Test name is misleading - SafeTokenize error handling isn't actually tested.

The test SafeTokenize_ReturnsNonEmptyList_EvenOnError doesn't actually test error handling behavior. The MockTokenizer.SafeTokenize simply delegates to Tokenize without any error simulation. Consider either:

  1. Renaming to SafeTokenize_ReturnsNonEmptyList_ForValidText
  2. Or adding a mock that throws in Tokenize to verify SafeTokenize handles it gracefully
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs` around lines 35 - 47,
The test name claims to validate SafeTokenize's error handling but
MockTokenizer.SafeTokenize just delegates to Tokenize and no error is simulated;
either rename the test to SafeTokenize_ReturnsNonEmptyList_ForValidText to
reflect the current behavior, or modify the test to validate error handling by
creating a mock/stub whose Tokenize method throws (or returns null) and then
call SafeTokenize to assert it still returns a non-empty, non-null result;
target MockTokenizer and its Tokenize/SafeTokenize methods when making the
change.

73-87: Missing test coverage for TokenizeWithOffsets.

The MockTokenizer implements TokenizeWithOffsets but no test exercises this method. Consider adding a test to validate the contract:

📝 Suggested test
[Fact]
public void TokenizeWithOffsets_ReturnsTokensWithCorrectOffsets()
{
    // Arrange
    var tokenizer = new MockTokenizer();
    
    // Act
    var result = tokenizer.TokenizeWithOffsets("hello world");
    
    // Assert
    Assert.NotNull(result);
    Assert.Equal(2, result.Count);
    Assert.Equal(0, result[0].Start);
    Assert.Equal(5, result[0].End);
    Assert.Equal("hello", result[0].Term);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs` around lines 73 - 87,
Add a unit test that exercises MockTokenizer.TokenizeWithOffsets to verify it
returns correct TokenWithOffset entries: create a test (e.g.,
TokenizeWithOffsets_ReturnsTokensWithCorrectOffsets) that instantiates
MockTokenizer, calls TokenizeWithOffsets("hello world"), asserts result is not
null, has Count == 2, and checks the first token's Start == 0, End == 5 and Term
== "hello" (and optionally checks the second token offsets/term); this ensures
the TokenWithOffset contract is covered.
TelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.cs (2)

99-125: Log callback tests don't verify actual logging behavior.

The tests with log callbacks only verify the methods don't throw, but don't assert that the callback was actually invoked or verify log content. If the goal is to test logging integration, consider:

📝 Example enhancement for logging verification
 [Fact]
-public void Tokenize_WithLogAction_DoesNotThrow()
+public void Tokenize_WithLogAction_InvokesCallbackOnFailure()
 {
     // Arrange
     var logger = new List<string>();
-    var tokenizerWithLog = new SmartChineseTokenizer(msg => logger.Add(msg));
-    
-    // Act
-    var result = tokenizerWithLog.Tokenize("测试文本");
+    var tokenizerWithLog = new SmartChineseTokenizer(msg => logger.Add(msg));
     
-    // Assert
+    // Act - trigger an error scenario if possible, or verify callback mechanism
+    var result = tokenizerWithLog.Tokenize("测试文本");
+    
+    // Assert - at minimum verify the tokenizer works with callback attached
     Assert.NotNull(result);
+    // Note: logger may be empty for successful tokenization
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.cs` around lines
99 - 125, The tests Tokenize_WithLogAction_DoesNotThrow and
SafeTokenize_WithLogAction_DoesNotThrow only assert no exception; update them to
also verify the provided log callback was invoked by asserting on the logger
list (e.g., Assert.True(logger.Count > 0) and/or
Assert.Contains(expectedMessageSubstring, logger)) after calling
SmartChineseTokenizer.Tokenize and SafeTokenize respectively; reference the
SmartChineseTokenizer constructor (msg => logger.Add(msg)), the Tokenize and
SafeTokenize calls, and assert logger contents or count to confirm actual
logging behavior.

6-126: Missing test coverage for TokenizeWithOffsets.

The SmartChineseTokenizer.TokenizeWithOffsets method is not tested. This method is used for offset-based matching in search results and should have dedicated tests verifying correct offset calculation.

📝 Suggested tests
[Fact]
public void TokenizeWithOffsets_ReturnsTokensWithValidOffsets()
{
    // Act
    var result = _tokenizer.TokenizeWithOffsets("今天天气真好");
    
    // Assert
    Assert.NotNull(result);
    Assert.NotEmpty(result);
    Assert.All(result, token => 
    {
        Assert.True(token.Start >= 0);
        Assert.True(token.End > token.Start);
        Assert.False(string.IsNullOrEmpty(token.Term));
    });
}

[Fact]
public void TokenizeWithOffsets_ReturnsNonEmptyList_ForEmptyText()
{
    // Act
    var result = _tokenizer.TokenizeWithOffsets("");
    
    // Assert
    Assert.NotNull(result);
    Assert.Empty(result);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.cs` around lines
6 - 126, Add unit tests covering SmartChineseTokenizer.TokenizeWithOffsets:
create tests in SmartChineseTokenizerTests that call
_tokenizer.TokenizeWithOffsets with Chinese text and with an empty string, then
assert the results are non-null and non-empty for normal text and empty for
empty input; for the normal-text test, iterate returned tokens and assert each
token has Term non-empty, Start >= 0 and End > Start to verify offset
correctness. Ensure tests are named like
TokenizeWithOffsets_ReturnsTokensWithValidOffsets and
TokenizeWithOffsets_ReturnsEmptyList_ForEmptyText and use the existing
_tokenizer instance (or construct a tokenizer with a log action when needed).
TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs (1)

54-70: SafeTokenize catch block may be unreachable.

The outer catch block in SafeTokenize (lines 60-69) may never execute because Tokenize() already catches all exceptions internally and returns a fallback list. Unless Tokenize's catch block itself throws (unlikely), the SafeTokenize catch is defensive but unreachable.

This is fine as defensive programming, but consider if this is the intended behavior or if Tokenize should propagate exceptions while only SafeTokenize handles them.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs` around
lines 54 - 70, SafeTokenize's catch is effectively unreachable because Tokenize
currently swallows exceptions internally; decide to centralize error handling by
letting Tokenize propagate errors and having SafeTokenize perform the fallback.
Change Tokenize (the method called by SafeTokenize) to remove or rethrow its
internal catch (or at least rethrow after logging) so exceptions reach
SafeTokenize, which will then log via _logAction and return the fallback token
list in SafeTokenize; alternatively if you prefer Tokenize to keep its own
fallback, remove the outer try/catch in SafeTokenize to avoid dead code—update
Tokenize or SafeTokenize accordingly so only one of them handles exceptions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@TelegramSearchBot.Search/Tool/SearchHelper.cs`:
- Around line 10-25: The _tokenizer static is mutated without synchronization
causing races in GetTokenizer and SetTokenizer and leaking state across tests;
make initialization thread-safe (e.g., replace the current nullable _tokenizer
with a thread-safe Lazy<ITokenizer> or add a private static readonly lock object
like _tokenizerLock and use double‑check locking inside GetTokenizer and lock in
SetTokenizer) so lazy creation of
TelegramSearchBot.Tokenizer.Implementations.SmartChineseTokenizer is safe, and
add an internal ResetTokenizer() helper that clears the tokenizer under the same
lock so SearchHelperTests.cs can reset state between tests.

In `@TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs`:
- Around line 75-103: TokenizeWithOffsets is missing a tokenStream.End() call
like Tokenize has; inside the try block, after the while
(tokenStream.IncrementToken()) loop and before disposing tokenStream, call
tokenStream.End() to properly finalize the token stream (and then you can read
final offsetAttribute values if needed) so the analyzer is correctly terminated
and the stream is ended before returning tokens.
- Around line 30-52: The TokenStream lifecycle in SmartChineseTokenizer.cs is
missing a call to tokenStream.End() after the token iteration; update the try
block so that after the while (tokenStream.IncrementToken()) loop completes you
call tokenStream.End() before disposing (still within the using scope) to ensure
end-of-stream finalization (offsets/positions) is applied; keep the existing
exception handling and logging around _logAction and retain adding the original
text to keywords on error.

---

Outside diff comments:
In `@TelegramSearchBot.Search/Tool/PhraseQueryProcessor.cs`:
- Around line 38-51: SafeTokenize (via SmartChineseTokenizer.Tokenize) uses a
HashSet and drops duplicates and ordering, which breaks
BuildPhraseQueryForField's positional logic; replace the call to
_tokenizer.SafeTokenize(phraseText) in the PhraseRegex loop with an ordered
tokenization that preserves duplicates and positions (e.g., use
TokenizeWithOffsets or add a new tokenizer method that returns tokens in
sequence including repeated tokens), then pass that ordered list to
BuildPhraseQueryForField and _extOptimizer.BuildOptimizedExtPhraseQuery so
phrase position indices remain correct.

---

Nitpick comments:
In `@TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs`:
- Around line 35-47: The test name claims to validate SafeTokenize's error
handling but MockTokenizer.SafeTokenize just delegates to Tokenize and no error
is simulated; either rename the test to
SafeTokenize_ReturnsNonEmptyList_ForValidText to reflect the current behavior,
or modify the test to validate error handling by creating a mock/stub whose
Tokenize method throws (or returns null) and then call SafeTokenize to assert it
still returns a non-empty, non-null result; target MockTokenizer and its
Tokenize/SafeTokenize methods when making the change.
- Around line 73-87: Add a unit test that exercises
MockTokenizer.TokenizeWithOffsets to verify it returns correct TokenWithOffset
entries: create a test (e.g.,
TokenizeWithOffsets_ReturnsTokensWithCorrectOffsets) that instantiates
MockTokenizer, calls TokenizeWithOffsets("hello world"), asserts result is not
null, has Count == 2, and checks the first token's Start == 0, End == 5 and Term
== "hello" (and optionally checks the second token offsets/term); this ensures
the TokenWithOffset contract is covered.

In `@TelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.cs`:
- Around line 99-125: The tests Tokenize_WithLogAction_DoesNotThrow and
SafeTokenize_WithLogAction_DoesNotThrow only assert no exception; update them to
also verify the provided log callback was invoked by asserting on the logger
list (e.g., Assert.True(logger.Count > 0) and/or
Assert.Contains(expectedMessageSubstring, logger)) after calling
SmartChineseTokenizer.Tokenize and SafeTokenize respectively; reference the
SmartChineseTokenizer constructor (msg => logger.Add(msg)), the Tokenize and
SafeTokenize calls, and assert logger contents or count to confirm actual
logging behavior.
- Around line 6-126: Add unit tests covering
SmartChineseTokenizer.TokenizeWithOffsets: create tests in
SmartChineseTokenizerTests that call _tokenizer.TokenizeWithOffsets with Chinese
text and with an empty string, then assert the results are non-null and
non-empty for normal text and empty for empty input; for the normal-text test,
iterate returned tokens and assert each token has Term non-empty, Start >= 0 and
End > Start to verify offset correctness. Ensure tests are named like
TokenizeWithOffsets_ReturnsTokensWithValidOffsets and
TokenizeWithOffsets_ReturnsEmptyList_ForEmptyText and use the existing
_tokenizer instance (or construct a tokenizer with a log action when needed).

In `@TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs`:
- Line 3: Add documentation to the TokenWithOffset record clarifying that the
End offset is exclusive (follows Lucene's IOffsetAttribute.EndOffset semantics)
to avoid off-by-one mistakes; update the XML doc comment on the TokenWithOffset
declaration (and mention the End property) to state explicitly that Start is
inclusive and End is exclusive and that offsets come from
SmartChineseTokenizer.TokenizeWithOffsets()/Lucene conventions.
- Around line 5-11: Add XML documentation to the ITokenizer interface to clarify
the contract for each member: document ITokenizer.Tokenize(string) describing
expected normalization, token boundaries, behavior on null/empty input and
whether it may throw; document ITokenizer.SafeTokenize(string) to state how it
differs (e.g., never throws, returns empty list on invalid input, or falls back
to a safe tokenization strategy); document TokenizeWithOffsets(string) to
specify that returned TokenWithOffset values align with original string indices
and how offsets are computed; and document the Metadata property to explain what
tokenizer capabilities and limits it exposes. Use the method/property names
(Tokenize, SafeTokenize, TokenizeWithOffsets, Metadata, ITokenizer) so
implementers can follow the exact expected behaviors.

In `@TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs`:
- Around line 54-70: SafeTokenize's catch is effectively unreachable because
Tokenize currently swallows exceptions internally; decide to centralize error
handling by letting Tokenize propagate errors and having SafeTokenize perform
the fallback. Change Tokenize (the method called by SafeTokenize) to remove or
rethrow its internal catch (or at least rethrow after logging) so exceptions
reach SafeTokenize, which will then log via _logAction and return the fallback
token list in SafeTokenize; alternatively if you prefer Tokenize to keep its own
fallback, remove the outer try/catch in SafeTokenize to avoid dead code—update
Tokenize or SafeTokenize accordingly so only one of them handles exceptions.

In `@TelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.cs`:
- Line 11: The factory's Create() returns a new SmartChineseTokenizer each call
which causes repeated expensive Lucene analyzer initialization; modify
TokenizerFactory to cache tokenizer instances per TokenizerType (e.g., store a
readonly/static ConcurrentDictionary<TokenizerType, ITokenizer> or a
lazy-initialized field) and return the cached SmartChineseTokenizer for
TokenizerType.SmartChinese in Create(), ensuring the cache is populated once
(thread-safe) and reused thereafter; reference TokenizerFactory.Create(),
TokenizerType.SmartChinese, SmartChineseTokenizer and SmartChineseAnalyzer when
making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dfc6637b-b22c-422e-ab4b-b3fb900f963d

📥 Commits

Reviewing files that changed from the base of the PR and between f30324f and cb9d4ba.

📒 Files selected for processing (21)
  • TelegramSearchBot.Search/Service/SimpleSearchService.cs
  • TelegramSearchBot.Search/Service/SyntaxSearchService.cs
  • TelegramSearchBot.Search/TelegramSearchBot.Search.csproj
  • TelegramSearchBot.Search/Tokenizer/UnifiedTokenizer.cs
  • TelegramSearchBot.Search/Tool/ContentQueryBuilder.cs
  • TelegramSearchBot.Search/Tool/ExtQueryBuilder.cs
  • TelegramSearchBot.Search/Tool/LuceneManager.cs
  • TelegramSearchBot.Search/Tool/PhraseQueryProcessor.cs
  • TelegramSearchBot.Search/Tool/SearchHelper.cs
  • TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs
  • TelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.cs
  • TelegramSearchBot.Tokenizer.Tests/TelegramSearchBot.Tokenizer.Tests.csproj
  • TelegramSearchBot.Tokenizer.Tests/TokenizerFactoryTests.cs
  • TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs
  • TelegramSearchBot.Tokenizer/Abstractions/ITokenizerFactory.cs
  • TelegramSearchBot.Tokenizer/Abstractions/TokenizerMetadata.cs
  • TelegramSearchBot.Tokenizer/Abstractions/TokenizerType.cs
  • TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs
  • TelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.cs
  • TelegramSearchBot.Tokenizer/TelegramSearchBot.Tokenizer.csproj
  • TelegramSearchBot.sln
💤 Files with no reviewable changes (1)
  • TelegramSearchBot.Search/Tokenizer/UnifiedTokenizer.cs

Comment thread TelegramSearchBot.Search/Tool/SearchHelper.cs Outdated
Comment thread TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs Outdated
Comment thread TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs (1)

13-21: Consider implementing IDisposable to dispose the analyzer.

SmartChineseAnalyzer (via Analyzer) is disposable. The _analyzer field is created once and never disposed, which could lead to resource leaks if SmartChineseTokenizer instances are created and discarded frequently (e.g., in tests or dynamic factory scenarios).

♻️ Proposed fix
-public class SmartChineseTokenizer : ITokenizer {
+public class SmartChineseTokenizer : ITokenizer, IDisposable {
     private readonly Analyzer _analyzer;
     private readonly Action<string>? _logAction;
+    private bool _disposed;

     public SmartChineseTokenizer(Action<string>? logAction = null) {
         _analyzer = new SmartChineseAnalyzer(LuceneVersion.LUCENE_48);
         _logAction = logAction;
         Metadata = new TokenizerMetadata("SmartChinese", "Chinese", true);
     }
+
+    public void Dispose() {
+        if (!_disposed) {
+            _analyzer.Dispose();
+            _disposed = true;
+        }
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs` around
lines 13 - 21, SmartChineseTokenizer constructs a SmartChineseAnalyzer and never
disposes it, risking resource leaks; implement IDisposable on
SmartChineseTokenizer, store the analyzer in _analyzer (already present), add a
public void Dispose() (or Dispose(bool) pattern) that calls _analyzer.Dispose()
and suppresses finalization as appropriate, and ensure any consumers can call
Dispose (or use a finalizer/IDisposable pattern) so the SmartChineseAnalyzer
resources are released when SmartChineseTokenizer is no longer used.
TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs (2)

54-77: Missing test coverage for TokenizeWithOffsets.

The ITokenizer contract includes TokenizeWithOffsets, but there's no test validating this method. Consider adding a test to ensure implementers correctly produce offset information.

Additionally, MockTokenizer.Tokenize returns distinct tokens (via Distinct()), but TokenizeWithOffsets doesn't apply the same deduplication—this inconsistency could mask bugs in real implementations.

💡 Suggested test addition
[Fact]
public void TokenizeWithOffsets_ReturnsCorrectOffsets() {
    // Arrange
    var tokenizer = new MockTokenizer();
    
    // Act
    var result = tokenizer.TokenizeWithOffsets("hello world");
    
    // Assert
    Assert.Equal(2, result.Count);
    Assert.Equal(0, result[0].Start);
    Assert.Equal(5, result[0].End);
    Assert.Equal("hello", result[0].Term);
    Assert.Equal(6, result[1].Start);
    Assert.Equal(11, result[1].End);
    Assert.Equal("world", result[1].Term);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs` around lines 54 - 77,
Add a unit test covering MockTokenizer.TokenizeWithOffsets to verify it returns
correct start/end offsets and terms for input strings (e.g., "hello world"), and
either update TokenizeWithOffsets or the test to respect the same deduplication
behaviour as MockTokenizer.Tokenize (which uses Distinct()); specifically,
locate MockTokenizer (implements ITokenizer) and ensure TokenizeWithOffsets
produces offsets for the same set/order of tokens Tokenize returns or adjust
Tokenize to not dedupe, then add assertions that check count, Start, End, and
Term for each TokenWithOffset to match expected values.

32-43: Test name is misleading - no error scenario is tested.

The test name SafeTokenize_ReturnsNonEmptyList_EvenOnError implies error handling is being tested, but MockTokenizer.SafeTokenize simply delegates to Tokenize without simulating any error condition. This test only verifies that SafeTokenize returns a non-empty list for valid input, identical to the Tokenize test.

Consider either:

  1. Renaming to SafeTokenize_ReturnsNonEmptyList_ForValidText
  2. Creating a mock that throws in Tokenize to actually test error fallback behavior
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs` around lines 32 - 43,
The test name falsely claims error behavior; update the test to either rename it
to SafeTokenize_ReturnsNonEmptyList_ForValidText or modify MockTokenizer to
throw from Tokenize and assert SafeTokenize still returns a non-empty list;
specifically locate the test method SafeTokenize_ReturnsNonEmptyList_EvenOnError
and either change its name to reflect valid input behavior or implement a
MockTokenizer override of Tokenize that throws an exception and verify
SafeTokenize catches that and returns a non-empty collection.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs`:
- Around line 54-77: Add a unit test covering MockTokenizer.TokenizeWithOffsets
to verify it returns correct start/end offsets and terms for input strings
(e.g., "hello world"), and either update TokenizeWithOffsets or the test to
respect the same deduplication behaviour as MockTokenizer.Tokenize (which uses
Distinct()); specifically, locate MockTokenizer (implements ITokenizer) and
ensure TokenizeWithOffsets produces offsets for the same set/order of tokens
Tokenize returns or adjust Tokenize to not dedupe, then add assertions that
check count, Start, End, and Term for each TokenWithOffset to match expected
values.
- Around line 32-43: The test name falsely claims error behavior; update the
test to either rename it to SafeTokenize_ReturnsNonEmptyList_ForValidText or
modify MockTokenizer to throw from Tokenize and assert SafeTokenize still
returns a non-empty list; specifically locate the test method
SafeTokenize_ReturnsNonEmptyList_EvenOnError and either change its name to
reflect valid input behavior or implement a MockTokenizer override of Tokenize
that throws an exception and verify SafeTokenize catches that and returns a
non-empty collection.

In `@TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs`:
- Around line 13-21: SmartChineseTokenizer constructs a SmartChineseAnalyzer and
never disposes it, risking resource leaks; implement IDisposable on
SmartChineseTokenizer, store the analyzer in _analyzer (already present), add a
public void Dispose() (or Dispose(bool) pattern) that calls _analyzer.Dispose()
and suppresses finalization as appropriate, and ensure any consumers can call
Dispose (or use a finalizer/IDisposable pattern) so the SmartChineseAnalyzer
resources are released when SmartChineseTokenizer is no longer used.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f86d06d-541c-40c1-a36a-226f65a2e25b

📥 Commits

Reviewing files that changed from the base of the PR and between cb9d4ba and 484da27.

📒 Files selected for processing (9)
  • TelegramSearchBot.Search/Tool/SearchHelper.cs
  • TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs
  • TelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.cs
  • TelegramSearchBot.Tokenizer.Tests/TokenizerFactoryTests.cs
  • TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs
  • TelegramSearchBot.Tokenizer/Abstractions/ITokenizerFactory.cs
  • TelegramSearchBot.Tokenizer/Abstractions/TokenizerType.cs
  • TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs
  • TelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.cs
✅ Files skipped from review due to trivial changes (3)
  • TelegramSearchBot.Tokenizer/Abstractions/TokenizerType.cs
  • TelegramSearchBot.Tokenizer/Abstractions/ITokenizerFactory.cs
  • TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs
🚧 Files skipped from review as they are similar to previous changes (3)
  • TelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.cs
  • TelegramSearchBot.Tokenizer.Tests/TokenizerFactoryTests.cs
  • TelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.cs

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
AGENTS.md (1)

53-65: Consider adding a language specifier to the fenced code block.

The static analysis tool suggests adding a language specifier. For ASCII diagrams, you can use ```text to silence the linter.

📝 Proposed fix
-```
+```text
 TelegramSearchBot.sln
 ├── TelegramSearchBot/              # Main console app
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` around lines 53 - 65, Update the fenced code block containing the
ASCII project tree (starting with "TelegramSearchBot.sln" and the directory
entries like "TelegramSearchBot/", "TelegramSearchBot.Common/", etc.) to include
a language specifier (e.g., use ```text) so the linter/static analysis no longer
flags the block; simply change the opening fence to ```text and leave the ASCII
diagram content unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@AGENTS.md`:
- Line 140: The doc line incorrectly references TypeScript/JavaScript constructs
("as any", "@ts-ignore", "@ts-expect-error"); replace that guidance with
C#-appropriate type-safety guidance by removing those JS/TS tokens and instead
recommending avoiding "dynamic", avoiding "unsafe" code blocks, and using proper
nullable reference type handling and explicit casts/guarding; update the
sentence around the existing phrase to read something like "Never use dynamic,
unsafe code, or ignore nullable reference warnings—handle nullability and
casting explicitly" so the guidance matches C# idioms.

---

Nitpick comments:
In `@AGENTS.md`:
- Around line 53-65: Update the fenced code block containing the ASCII project
tree (starting with "TelegramSearchBot.sln" and the directory entries like
"TelegramSearchBot/", "TelegramSearchBot.Common/", etc.) to include a language
specifier (e.g., use ```text) so the linter/static analysis no longer flags the
block; simply change the opening fence to ```text and leave the ASCII diagram
content unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a6b41995-c83e-46b0-9d09-c518ce2df147

📥 Commits

Reviewing files that changed from the base of the PR and between 484da27 and 1aee613.

📒 Files selected for processing (4)
  • AGENTS.md
  • TelegramSearchBot.Test/Manage/EditMcpConfServiceTests.cs
  • TelegramSearchBot/Model/Mcp/McpConfState.cs
  • TelegramSearchBot/Service/Manage/EditMcpConfService.cs
✅ Files skipped from review due to trivial changes (3)
  • TelegramSearchBot/Model/Mcp/McpConfState.cs
  • TelegramSearchBot/Service/Manage/EditMcpConfService.cs
  • TelegramSearchBot.Test/Manage/EditMcpConfServiceTests.cs

Comment thread AGENTS.md Outdated
```

### Type Safety
- **Never** use `as any`, `@ts-ignore`, `@ts-expect-error`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix incorrect TypeScript/JavaScript references in C# documentation.

Line 140 references TypeScript/JavaScript constructs (as any, @ts-ignore, @ts-expect-error) that don't exist in C#. This appears to be copied from a TypeScript/JavaScript style guide. For C# type safety, relevant guidance would cover avoiding dynamic, unsafe code, and proper nullable reference handling.

🔧 Proposed fix
-- **Never** use `as any`, `@ts-ignore`, `@ts-expect-error`
+- Avoid `dynamic` keyword unless absolutely necessary
+- Avoid `unsafe` code blocks
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- **Never** use `as any`, `@ts-ignore`, `@ts-expect-error`
- Avoid `dynamic` keyword unless absolutely necessary
- Avoid `unsafe` code blocks
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` at line 140, The doc line incorrectly references
TypeScript/JavaScript constructs ("as any", "@ts-ignore", "@ts-expect-error");
replace that guidance with C#-appropriate type-safety guidance by removing those
JS/TS tokens and instead recommending avoiding "dynamic", avoiding "unsafe" code
blocks, and using proper nullable reference type handling and explicit
casts/guarding; update the sentence around the existing phrase to read something
like "Never use dynamic, unsafe code, or ignore nullable reference
warnings—handle nullability and casting explicitly" so the guidance matches C#
idioms.

ModerRAS and others added 2 commits April 22, 2026 10:08
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ModerRAS ModerRAS merged commit 05b47ad into master Apr 22, 2026
5 checks passed
@ModerRAS ModerRAS deleted the feat/extract-tokenizer-v2 branch April 22, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant