Skip to content

Fix LLM agent stability: conditional execution, error handling, tool registration#268

Merged
ModerRAS merged 1 commit intomasterfrom
fix/llm-agent-stability
Apr 17, 2026
Merged

Fix LLM agent stability: conditional execution, error handling, tool registration#268
ModerRAS merged 1 commit intomasterfrom
fix/llm-agent-stability

Conversation

@ModerRAS
Copy link
Copy Markdown
Owner

@ModerRAS ModerRAS commented Apr 17, 2026

Summary

Fixes multiple stability issues in the LLM agent process separation implementation.

Changes

1. Conditional BackgroundService execution (prevents crash when agent mode disabled)

  • \TelegramTaskConsumer, \ChunkPollingService, and \AgentRegistryService\ now early-return from \ExecuteAsync()\ when \EnableLLMAgentProcess=false\ (the default)
  • Previously these services would make Redis BRPOP/HGETALL/ListRange calls every loop iteration regardless of config, causing \RedisTimeoutException\ → \StopHost\ → app crash

2. Agent BRPOP timeout fix

3. Agent error handling

  • \AgentLoopService\ main loop now catches \RedisException\ with 1s retry delay
  • Heartbeat loop catches \OperationCanceledException\ and \RedisException\
  • Prevents single transient Redis failures from crashing the agent process

4. Agent tool registration

  • Changed \McpToolHelper.EnsureInitialized\ to use two-assembly overload, registering both \AgentToolService\ and LLM project tools
  • Added \FileToolService\ (read/write/edit/search/list files) and \BashToolService\ (shell execution) to agent DI
  • Agent now has 8+ useful tools instead of only 3 trivial ones (echo/calculator/send_message)

Testing

  • All 413 tests pass (223 + 186 + 4)
  • Build: 0 errors

Summary by CodeRabbit

  • New Features

    • Added file and bash tool capabilities to the LLM agent.
  • Bug Fixes

    • Improved service resilience with graceful error recovery and exception handling.
    • Enhanced cancellation handling for more responsive background operations.
  • Improvements

    • Service execution now respects LLM agent process configuration settings.

…registration

- BackgroundServices (TelegramTaskConsumer, ChunkPollingService, AgentRegistryService)
  now early-return when EnableLLMAgentProcess is disabled, preventing unnecessary
  Redis calls and potential crashes from RedisTimeoutException

- Agent process BRPOP timeout reduced from 5s to 2s to avoid race with SE.Redis
  async timeout (same fix as PR #267 for the main process)

- AgentLoopService main loop now catches RedisException with retry delay,
  preventing a single transient Redis failure from crashing the agent process

- Heartbeat loop catches OperationCanceledException and RedisException,
  preventing unobserved exceptions during shutdown or transient failures

- Agent process now registers LLM project tools (FileToolService, BashToolService)
  via two-assembly McpToolHelper.EnsureInitialized, giving the agent access to
  file operations and shell execution instead of only echo/calculator/send_message

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

Changes add MCP tool service integration (file and bash tools) to the agent startup, implement conditional execution guards based on Env.EnableLLMAgentProcess across multiple services, improve Redis exception handling with logging and graceful retry patterns in agent loops, and reduce BRPOP blocking timeout from 5 to 2 seconds.

Changes

Cohort / File(s) Summary
Tool Service Integration
TelegramSearchBot.LLMAgent/LLMAgentProgram.cs
Initializes MCP/tools support by expanding assembly discovery to include FileToolService, registers new scoped bindings for IFileToolService and IBashToolService in dependency injection.
Redis Resilience & Timeouts
TelegramSearchBot.LLMAgent/Service/AgentLoopService.cs
Wraps Redis blocking pop and session save operations in try/catch blocks to handle RedisException gracefully with logging and retry logic; reduces BRPOP wait timeout from 5 to 2 seconds; adds OperationCanceledException handling for heartbeat shutdown.
Feature Flag Guards
TelegramSearchBot/Service/AI/LLM/AgentRegistryService.cs, TelegramSearchBot/Service/AI/LLM/ChunkPollingService.cs, TelegramSearchBot/Service/AI/LLM/TelegramTaskConsumer.cs
Adds early exit checks for Env.EnableLLMAgentProcess in ExecuteAsync methods; wraps polling cycles in exception handling to tolerate RedisException while respecting OperationCanceledException for proper shutdown.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 Hop, hop! New tools await,
With redis guards that seal the gate,
Feature flags now spring so true,
Two-second waits, no five—we flew!
Errors caught like carrots sweet, 🥕
Making services complete!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and comprehensively summarizes the three main categories of changes: conditional execution guards, error handling improvements, and tool registration enhancements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/llm-agent-stability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
TelegramSearchBot/Service/AI/LLM/ChunkPollingService.cs (2)

55-57: Consider logging the disabled-mode early return for parity with sibling services.

TelegramTaskConsumer and AgentRegistryService both emit a LogDebug when returning early; this service silently returns, making it harder to tell from logs whether the polling loop was intentionally skipped. ChunkPollingService currently has no ILogger injected — injecting one would also let you log the swallowed RedisException on line 64.

Suggested change
-        public ChunkPollingService(IConnectionMultiplexer redis) {
+        private readonly ILogger<ChunkPollingService> _logger;
+
+        public ChunkPollingService(IConnectionMultiplexer redis, ILogger<ChunkPollingService> logger) {
             _redis = redis;
+            _logger = logger;
         }
@@
-            if (!Env.EnableLLMAgentProcess) {
-                return;
-            }
+            if (!Env.EnableLLMAgentProcess) {
+                _logger.LogDebug("LLM agent process mode disabled – ChunkPollingService will not start");
+                return;
+            }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot/Service/AI/LLM/ChunkPollingService.cs` around lines 55 -
57, ChunkPollingService currently returns silently when
Env.EnableLLMAgentProcess is false and swallows RedisException; inject an
ILogger<ChunkPollingService> into the constructor (store as _logger), use
_logger.LogDebug(...) to emit a message when the early return occurs in the
Start/loop entry (the if (!Env.EnableLLMAgentProcess) block), and update the
catch that currently swallows RedisException (around the RedisException at line
~64) to call _logger.LogError(ex, "Redis error in ChunkPollingService") so the
exception is visible in logs while preserving current behavior.

64-66: Silently swallowing RedisException hides transient failures.

Without a logger, repeated Redis outages produce no signal and there's also no back-off delay before the next RunPollCycleAsync attempt — the loop only delays via Task.Delay on line 69, which still runs. That's probably fine given the poll interval, but at minimum a LogWarning would help diagnose flapping Redis issues. (Related to the logger-injection suggestion above.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot/Service/AI/LLM/ChunkPollingService.cs` around lines 64 -
66, In ChunkPollingService's RunPollCycleAsync catch block that currently
swallows RedisException, log the exception (use the injected logger instance on
the class) with a warning message that includes the exception details and
context (e.g., "Transient Redis failure during RunPollCycleAsync"); optionally
add a short back-off delay before retrying (e.g., a small Task.Delay or
incremental back-off) so repeated transient Redis outages are visible in logs
and give Redis a moment before the next attempt.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@TelegramSearchBot/Service/AI/LLM/AgentRegistryService.cs`:
- Around line 186-189: The early return when Env.EnableLLMAgentProcess is false
prevents ExecuteAsync from ever calling RunMaintenanceOnceAsync and leaves
previously-known agents running; update the AgentRegistryService start path so
that before returning when Env.EnableLLMAgentProcess is false you perform one
final shutdown pass: either call await
RunMaintenanceOnceAsync(CancellationToken.None) or iterate over _knownSessions
and call RequestShutdownAsync("agent mode disabled") for each (awaiting the
tasks) so existing sessions are asked to terminate gracefully; keep references
to Env.EnableLLMAgentProcess, ExecuteAsync, RunMaintenanceOnceAsync,
_knownSessions and RequestShutdownAsync when making the change.

---

Nitpick comments:
In `@TelegramSearchBot/Service/AI/LLM/ChunkPollingService.cs`:
- Around line 55-57: ChunkPollingService currently returns silently when
Env.EnableLLMAgentProcess is false and swallows RedisException; inject an
ILogger<ChunkPollingService> into the constructor (store as _logger), use
_logger.LogDebug(...) to emit a message when the early return occurs in the
Start/loop entry (the if (!Env.EnableLLMAgentProcess) block), and update the
catch that currently swallows RedisException (around the RedisException at line
~64) to call _logger.LogError(ex, "Redis error in ChunkPollingService") so the
exception is visible in logs while preserving current behavior.
- Around line 64-66: In ChunkPollingService's RunPollCycleAsync catch block that
currently swallows RedisException, log the exception (use the injected logger
instance on the class) with a warning message that includes the exception
details and context (e.g., "Transient Redis failure during RunPollCycleAsync");
optionally add a short back-off delay before retrying (e.g., a small Task.Delay
or incremental back-off) so repeated transient Redis outages are visible in logs
and give Redis a moment before the next attempt.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9b85fde7-5009-4c67-a150-7bd1755e5257

📥 Commits

Reviewing files that changed from the base of the PR and between cf98863 and 3c468eb.

📒 Files selected for processing (5)
  • TelegramSearchBot.LLMAgent/LLMAgentProgram.cs
  • TelegramSearchBot.LLMAgent/Service/AgentLoopService.cs
  • TelegramSearchBot/Service/AI/LLM/AgentRegistryService.cs
  • TelegramSearchBot/Service/AI/LLM/ChunkPollingService.cs
  • TelegramSearchBot/Service/AI/LLM/TelegramTaskConsumer.cs

Comment on lines +186 to +189
if (!Env.EnableLLMAgentProcess) {
_logger.LogDebug("LLM agent process mode disabled – AgentRegistryService will not start");
return;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Disabled-mode shutdown branch in RunMaintenanceOnceAsync becomes unreachable from the background loop.

With this early return, ExecuteAsync never calls RunMaintenanceOnceAsync, so the disabled-mode branch at lines 149–153 (which iterates _knownSessions and calls RequestShutdownAsync("agent mode disabled")) no longer fires from the hosted service. In practice this is mostly fine because EnsureAgentAsync throws when the flag is false so _knownSessions shouldn't grow, but if the flag is toggled from true→false at runtime, previously-running agents will no longer be gracefully asked to shut down from here.

Consider either:

  • Documenting that toggling requires a restart, or
  • Running one final RunMaintenanceOnceAsync pass (or just the shutdown-known-sessions block) before returning, so in-flight sessions get a graceful shutdown request.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot/Service/AI/LLM/AgentRegistryService.cs` around lines 186 -
189, The early return when Env.EnableLLMAgentProcess is false prevents
ExecuteAsync from ever calling RunMaintenanceOnceAsync and leaves
previously-known agents running; update the AgentRegistryService start path so
that before returning when Env.EnableLLMAgentProcess is false you perform one
final shutdown pass: either call await
RunMaintenanceOnceAsync(CancellationToken.None) or iterate over _knownSessions
and call RequestShutdownAsync("agent mode disabled") for each (awaiting the
tasks) so existing sessions are asked to terminate gracefully; keep references
to Env.EnableLLMAgentProcess, ExecuteAsync, RunMaintenanceOnceAsync,
_knownSessions and RequestShutdownAsync when making the change.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 PR检查报告

📋 检查概览

🧪 测试结果

平台 状态 详情
Ubuntu 🔴 失败 测试结果不可用
Windows 🔴 失败 测试结果不可用

📊 代码质量

  • ✅ 代码格式化检查
  • ✅ 安全漏洞扫描
  • ✅ 依赖包分析
  • ✅ 代码覆盖率收集

📁 测试产物

  • 测试结果文件已上传为artifacts
  • 代码覆盖率已上传到Codecov

🔗 相关链接


此报告由GitHub Actions自动生成

@ModerRAS ModerRAS merged commit 2680c95 into master Apr 17, 2026
8 checks passed
@ModerRAS ModerRAS deleted the fix/llm-agent-stability branch April 17, 2026 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant