Fix Redis BRPOP timeout crashing background services#267
Conversation
- TelegramTaskConsumer: reduce BRPOP block time from 5s to 2s so the command completes well within SE.Redis's 5000ms async timeout - TelegramTaskConsumer.ExecuteAsync: catch RedisException and retry after 1s instead of letting it propagate and stop the host - AgentRegistryService.ExecuteAsync: catch RedisException and log a warning, then continue the maintenance loop after the usual delay - OperationCanceledException is re-thrown in both loops to allow clean shutdown when the host is stopping Fixes: BackgroundService crashing with RedisTimeoutException after ~286s uptime, which triggered StopHost and shut down the process. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📝 WalkthroughWalkthroughTwo background services receive improved error handling and retry logic. The agent registry service separates exception handling for maintenance execution and delays with explicit cancellation breaking. The task consumer refactors its Redis blocking operation loop with a reduced timeout and granular exception handling for cancellation and Redis failures. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
TelegramSearchBot/Service/AI/LLM/TelegramTaskConsumer.cs (1)
86-91: Handle cancellation during the 1 s retry delay explicitly.
Task.Delay(..., stoppingToken)insidecatch (RedisException)can throwOperationCanceledExceptionduring shutdown, which escapesExecuteAsync(the siblingcatch (OperationCanceledException)on line 86 does not catch exceptions thrown from within another catch block). WhileBackgroundServicetolerates OCE during shutdown, the companion serviceAgentRegistryService.ExecuteAsync(lines 197-201) already uses a separatetry/catcharound its delay to exit via a cleanbreak. Mirroring that shape keeps the two services' shutdown semantics identical and avoids surprising stack traces if an error handler is ever added.Separately, this service retries after 1 s while
AgentRegistryServiceretries after 5 s (seeAgentRegistryService.cslines 192-198). Consider aligning on a single configurable value (or at least documenting why BRPOP consumer should back off faster than the maintenance loop).♻️ Proposed refactor mirroring AgentRegistryService
- } catch (OperationCanceledException) { - break; - } catch (RedisException ex) { - _logger.LogWarning(ex, "Redis error in TelegramTaskConsumer, retrying in 1 s"); - await Task.Delay(TimeSpan.FromSeconds(1), stoppingToken); - } + } catch (OperationCanceledException) { + break; + } catch (RedisException ex) { + _logger.LogWarning(ex, "Redis error in TelegramTaskConsumer, retrying in 1 s"); + try { + await Task.Delay(TimeSpan.FromSeconds(1), stoppingToken); + } catch (OperationCanceledException) { + break; + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TelegramSearchBot/Service/AI/LLM/TelegramTaskConsumer.cs` around lines 86 - 91, The Redis retry path in TelegramTaskConsumer.ExecuteAsync currently awaits Task.Delay(..., stoppingToken) inside the RedisException catch which can throw OperationCanceledException out of that catch; change the catch (RedisException ex) handling to perform the delay inside its own try/catch that catches OperationCanceledException and breaks out of the main loop (mirroring AgentRegistryService.ExecuteAsync) so shutdown exits cleanly; also consider replacing the hardcoded 1s backoff with a shared configurable backoff value (or align it to the 5s used by AgentRegistryService) so both services use the same retry/backoff policy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@TelegramSearchBot/Service/AI/LLM/AgentRegistryService.cs`:
- Around line 192-198: The warning message in AgentRegistryService currently
hardcodes "retrying in 5 s" while the actual delay uses Math.Max(5,
Env.AgentHeartbeatIntervalSeconds); update the maintenance loop to compute the
delay (e.g., var retryDelay = TimeSpan.FromSeconds(Math.Max(5,
Env.AgentHeartbeatIntervalSeconds))) and use that variable in both the await
Task.Delay(...) and in the _logger.LogWarning call so the log shows the real
retryDelay value (reference: AgentRegistryService,
Env.AgentHeartbeatIntervalSeconds, _logger.LogWarning).
---
Nitpick comments:
In `@TelegramSearchBot/Service/AI/LLM/TelegramTaskConsumer.cs`:
- Around line 86-91: The Redis retry path in TelegramTaskConsumer.ExecuteAsync
currently awaits Task.Delay(..., stoppingToken) inside the RedisException catch
which can throw OperationCanceledException out of that catch; change the catch
(RedisException ex) handling to perform the delay inside its own try/catch that
catches OperationCanceledException and breaks out of the main loop (mirroring
AgentRegistryService.ExecuteAsync) so shutdown exits cleanly; also consider
replacing the hardcoded 1s backoff with a shared configurable backoff value (or
align it to the 5s used by AgentRegistryService) so both services use the same
retry/backoff policy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bf0b5111-1305-4f4d-bfd2-9e6e4787c3a4
📒 Files selected for processing (2)
TelegramSearchBot/Service/AI/LLM/AgentRegistryService.csTelegramSearchBot/Service/AI/LLM/TelegramTaskConsumer.cs
| _logger.LogWarning(ex, "Redis error in AgentRegistryService maintenance, retrying in 5 s"); | ||
| } catch (Exception ex) { | ||
| _logger.LogError(ex, "Unexpected error in AgentRegistryService maintenance"); | ||
| } | ||
|
|
||
| try { | ||
| await Task.Delay(TimeSpan.FromSeconds(Math.Max(5, Env.AgentHeartbeatIntervalSeconds)), stoppingToken); |
There was a problem hiding this comment.
Log message understates the actual retry delay.
The warning hard-codes "retrying in 5 s", but the subsequent delay is Math.Max(5, Env.AgentHeartbeatIntervalSeconds). If AgentHeartbeatIntervalSeconds is configured above 5, the log will misreport the retry interval and make field diagnostics harder.
💡 Proposed tweak
- } catch (RedisException ex) {
- _logger.LogWarning(ex, "Redis error in AgentRegistryService maintenance, retrying in 5 s");
+ } catch (RedisException ex) {
+ var retrySeconds = Math.Max(5, Env.AgentHeartbeatIntervalSeconds);
+ _logger.LogWarning(ex, "Redis error in AgentRegistryService maintenance, retrying in {RetrySeconds} s", retrySeconds);
} catch (Exception ex) {
_logger.LogError(ex, "Unexpected error in AgentRegistryService maintenance");
}
try {
- await Task.Delay(TimeSpan.FromSeconds(Math.Max(5, Env.AgentHeartbeatIntervalSeconds)), stoppingToken);
+ await Task.Delay(TimeSpan.FromSeconds(Math.Max(5, Env.AgentHeartbeatIntervalSeconds)), stoppingToken);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _logger.LogWarning(ex, "Redis error in AgentRegistryService maintenance, retrying in 5 s"); | |
| } catch (Exception ex) { | |
| _logger.LogError(ex, "Unexpected error in AgentRegistryService maintenance"); | |
| } | |
| try { | |
| await Task.Delay(TimeSpan.FromSeconds(Math.Max(5, Env.AgentHeartbeatIntervalSeconds)), stoppingToken); | |
| } catch (RedisException ex) { | |
| var retrySeconds = Math.Max(5, Env.AgentHeartbeatIntervalSeconds); | |
| _logger.LogWarning(ex, "Redis error in AgentRegistryService maintenance, retrying in {RetrySeconds} s", retrySeconds); | |
| } catch (Exception ex) { | |
| _logger.LogError(ex, "Unexpected error in AgentRegistryService maintenance"); | |
| } | |
| try { | |
| await Task.Delay(TimeSpan.FromSeconds(Math.Max(5, Env.AgentHeartbeatIntervalSeconds)), stoppingToken); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@TelegramSearchBot/Service/AI/LLM/AgentRegistryService.cs` around lines 192 -
198, The warning message in AgentRegistryService currently hardcodes "retrying
in 5 s" while the actual delay uses Math.Max(5,
Env.AgentHeartbeatIntervalSeconds); update the maintenance loop to compute the
delay (e.g., var retryDelay = TimeSpan.FromSeconds(Math.Max(5,
Env.AgentHeartbeatIntervalSeconds))) and use that variable in both the await
Task.Delay(...) and in the _logger.LogWarning call so the log shows the real
retryDelay value (reference: AgentRegistryService,
Env.AgentHeartbeatIntervalSeconds, _logger.LogWarning).
🔍 PR检查报告📋 检查概览
🧪 测试结果
📊 代码质量
📁 测试产物
🔗 相关链接此报告由GitHub Actions自动生成 |
…registration - BackgroundServices (TelegramTaskConsumer, ChunkPollingService, AgentRegistryService) now early-return when EnableLLMAgentProcess is disabled, preventing unnecessary Redis calls and potential crashes from RedisTimeoutException - Agent process BRPOP timeout reduced from 5s to 2s to avoid race with SE.Redis async timeout (same fix as PR #267 for the main process) - AgentLoopService main loop now catches RedisException with retry delay, preventing a single transient Redis failure from crashing the agent process - Heartbeat loop catches OperationCanceledException and RedisException, preventing unobserved exceptions during shutdown or transient failures - Agent process now registers LLM project tools (FileToolService, BashToolService) via two-assembly McpToolHelper.EnsureInitialized, giving the agent access to file operations and shell execution instead of only echo/calculator/send_message Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…registration (#268) - BackgroundServices (TelegramTaskConsumer, ChunkPollingService, AgentRegistryService) now early-return when EnableLLMAgentProcess is disabled, preventing unnecessary Redis calls and potential crashes from RedisTimeoutException - Agent process BRPOP timeout reduced from 5s to 2s to avoid race with SE.Redis async timeout (same fix as PR #267 for the main process) - AgentLoopService main loop now catches RedisException with retry delay, preventing a single transient Redis failure from crashing the agent process - Heartbeat loop catches OperationCanceledException and RedisException, preventing unobserved exceptions during shutdown or transient failures - Agent process now registers LLM project tools (FileToolService, BashToolService) via two-assembly McpToolHelper.EnsureInitialized, giving the agent access to file operations and shell execution instead of only echo/calculator/send_message Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem
After ~286s uptime, TelegramTaskConsumer and AgentRegistryService crashed with:
\
StackExchange.Redis.RedisTimeoutException: Timeout awaiting response ... command=BRPOP ... 5016ms elapsed, timeout is 5000ms
\
Both exceptions propagated out of \ExecuteAsync, triggering \BackgroundServiceExceptionBehavior.StopHost\ and killing the process.
Root Cause
Fix
Testing
Summary by CodeRabbit