Skip to content

Fix Redis BRPOP timeout crashing background services#267

Merged
ModerRAS merged 1 commit intomasterfrom
fix/redis-brpop-timeout
Apr 17, 2026
Merged

Fix Redis BRPOP timeout crashing background services#267
ModerRAS merged 1 commit intomasterfrom
fix/redis-brpop-timeout

Conversation

@ModerRAS
Copy link
Copy Markdown
Owner

@ModerRAS ModerRAS commented Apr 17, 2026

Problem

After ~286s uptime, TelegramTaskConsumer and AgentRegistryService crashed with:
\
StackExchange.Redis.RedisTimeoutException: Timeout awaiting response ... command=BRPOP ... 5016ms elapsed, timeout is 5000ms
\
Both exceptions propagated out of \ExecuteAsync, triggering \BackgroundServiceExceptionBehavior.StopHost\ and killing the process.

Root Cause

  • \TelegramTaskConsumer\ called \BRPOP key 5\ (5-second block time) while SE.Redis \�syncTimeout\ is also 5000ms — a race where the client timeout and BRPOP timeout fire nearly simultaneously
  • Neither \ExecuteAsync\ caught \RedisException, so any transient Redis hiccup would crash the host

Fix

  • Reduce BRPOP block time to 2 seconds (well within 5000ms client timeout)
  • Wrap each service's loop body with \catch (RedisException)\ → log warning + retry
  • Let \OperationCanceledException\ propagate normally for clean shutdown

Testing

  • All 6 agent/registry integration tests pass
  • Build clean (0 errors)

Summary by CodeRabbit

  • Bug Fixes
    • Improved system resilience with enhanced error handling and automatic retry logic for transient connection failures.
    • Enhanced responsiveness to shutdown and cancellation requests.
    • Optimized request timeouts for better operational performance.

- TelegramTaskConsumer: reduce BRPOP block time from 5s to 2s so the
  command completes well within SE.Redis's 5000ms async timeout
- TelegramTaskConsumer.ExecuteAsync: catch RedisException and retry
  after 1s instead of letting it propagate and stop the host
- AgentRegistryService.ExecuteAsync: catch RedisException and log a
  warning, then continue the maintenance loop after the usual delay
- OperationCanceledException is re-thrown in both loops to allow
  clean shutdown when the host is stopping

Fixes: BackgroundService crashing with RedisTimeoutException after
~286s uptime, which triggered StopHost and shut down the process.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

Two background services receive improved error handling and retry logic. The agent registry service separates exception handling for maintenance execution and delays with explicit cancellation breaking. The task consumer refactors its Redis blocking operation loop with a reduced timeout and granular exception handling for cancellation and Redis failures.

Changes

Cohort / File(s) Summary
Background Service Error Handling
TelegramSearchBot/Service/AI/LLM/AgentRegistryService.cs, TelegramSearchBot/Service/AI/LLM/TelegramTaskConsumer.cs
Both services enhance error resilience: AgentRegistryService wraps maintenance and delay in separate try/catch blocks with explicit OperationCanceledException breaking; TelegramTaskConsumer refactors outer BRPOP loop with reduced 2-second timeout, immediate loop breaks on cancellation, and 1-second retry delays on Redis failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐰 Loops grow stronger with graceful falls,
Redis whispers through error's halls,
Five becomes two, delays find grace,
Cancellations now know their place! 🎭✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fix: reducing BRPOP timeout and adding Redis error handling to prevent service crashes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/redis-brpop-timeout

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
TelegramSearchBot/Service/AI/LLM/TelegramTaskConsumer.cs (1)

86-91: Handle cancellation during the 1 s retry delay explicitly.

Task.Delay(..., stoppingToken) inside catch (RedisException) can throw OperationCanceledException during shutdown, which escapes ExecuteAsync (the sibling catch (OperationCanceledException) on line 86 does not catch exceptions thrown from within another catch block). While BackgroundService tolerates OCE during shutdown, the companion service AgentRegistryService.ExecuteAsync (lines 197-201) already uses a separate try/catch around its delay to exit via a clean break. Mirroring that shape keeps the two services' shutdown semantics identical and avoids surprising stack traces if an error handler is ever added.

Separately, this service retries after 1 s while AgentRegistryService retries after 5 s (see AgentRegistryService.cs lines 192-198). Consider aligning on a single configurable value (or at least documenting why BRPOP consumer should back off faster than the maintenance loop).

♻️ Proposed refactor mirroring AgentRegistryService
-                } catch (OperationCanceledException) {
-                    break;
-                } catch (RedisException ex) {
-                    _logger.LogWarning(ex, "Redis error in TelegramTaskConsumer, retrying in 1 s");
-                    await Task.Delay(TimeSpan.FromSeconds(1), stoppingToken);
-                }
+                } catch (OperationCanceledException) {
+                    break;
+                } catch (RedisException ex) {
+                    _logger.LogWarning(ex, "Redis error in TelegramTaskConsumer, retrying in 1 s");
+                    try {
+                        await Task.Delay(TimeSpan.FromSeconds(1), stoppingToken);
+                    } catch (OperationCanceledException) {
+                        break;
+                    }
+                }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot/Service/AI/LLM/TelegramTaskConsumer.cs` around lines 86 -
91, The Redis retry path in TelegramTaskConsumer.ExecuteAsync currently awaits
Task.Delay(..., stoppingToken) inside the RedisException catch which can throw
OperationCanceledException out of that catch; change the catch (RedisException
ex) handling to perform the delay inside its own try/catch that catches
OperationCanceledException and breaks out of the main loop (mirroring
AgentRegistryService.ExecuteAsync) so shutdown exits cleanly; also consider
replacing the hardcoded 1s backoff with a shared configurable backoff value (or
align it to the 5s used by AgentRegistryService) so both services use the same
retry/backoff policy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@TelegramSearchBot/Service/AI/LLM/AgentRegistryService.cs`:
- Around line 192-198: The warning message in AgentRegistryService currently
hardcodes "retrying in 5 s" while the actual delay uses Math.Max(5,
Env.AgentHeartbeatIntervalSeconds); update the maintenance loop to compute the
delay (e.g., var retryDelay = TimeSpan.FromSeconds(Math.Max(5,
Env.AgentHeartbeatIntervalSeconds))) and use that variable in both the await
Task.Delay(...) and in the _logger.LogWarning call so the log shows the real
retryDelay value (reference: AgentRegistryService,
Env.AgentHeartbeatIntervalSeconds, _logger.LogWarning).

---

Nitpick comments:
In `@TelegramSearchBot/Service/AI/LLM/TelegramTaskConsumer.cs`:
- Around line 86-91: The Redis retry path in TelegramTaskConsumer.ExecuteAsync
currently awaits Task.Delay(..., stoppingToken) inside the RedisException catch
which can throw OperationCanceledException out of that catch; change the catch
(RedisException ex) handling to perform the delay inside its own try/catch that
catches OperationCanceledException and breaks out of the main loop (mirroring
AgentRegistryService.ExecuteAsync) so shutdown exits cleanly; also consider
replacing the hardcoded 1s backoff with a shared configurable backoff value (or
align it to the 5s used by AgentRegistryService) so both services use the same
retry/backoff policy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf0b5111-1305-4f4d-bfd2-9e6e4787c3a4

📥 Commits

Reviewing files that changed from the base of the PR and between f605f3b and bcd25c2.

📒 Files selected for processing (2)
  • TelegramSearchBot/Service/AI/LLM/AgentRegistryService.cs
  • TelegramSearchBot/Service/AI/LLM/TelegramTaskConsumer.cs

Comment on lines +192 to +198
_logger.LogWarning(ex, "Redis error in AgentRegistryService maintenance, retrying in 5 s");
} catch (Exception ex) {
_logger.LogError(ex, "Unexpected error in AgentRegistryService maintenance");
}

try {
await Task.Delay(TimeSpan.FromSeconds(Math.Max(5, Env.AgentHeartbeatIntervalSeconds)), stoppingToken);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Log message understates the actual retry delay.

The warning hard-codes "retrying in 5 s", but the subsequent delay is Math.Max(5, Env.AgentHeartbeatIntervalSeconds). If AgentHeartbeatIntervalSeconds is configured above 5, the log will misreport the retry interval and make field diagnostics harder.

💡 Proposed tweak
-                } catch (RedisException ex) {
-                    _logger.LogWarning(ex, "Redis error in AgentRegistryService maintenance, retrying in 5 s");
+                } catch (RedisException ex) {
+                    var retrySeconds = Math.Max(5, Env.AgentHeartbeatIntervalSeconds);
+                    _logger.LogWarning(ex, "Redis error in AgentRegistryService maintenance, retrying in {RetrySeconds} s", retrySeconds);
                 } catch (Exception ex) {
                     _logger.LogError(ex, "Unexpected error in AgentRegistryService maintenance");
                 }
 
                 try {
-                    await Task.Delay(TimeSpan.FromSeconds(Math.Max(5, Env.AgentHeartbeatIntervalSeconds)), stoppingToken);
+                    await Task.Delay(TimeSpan.FromSeconds(Math.Max(5, Env.AgentHeartbeatIntervalSeconds)), stoppingToken);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_logger.LogWarning(ex, "Redis error in AgentRegistryService maintenance, retrying in 5 s");
} catch (Exception ex) {
_logger.LogError(ex, "Unexpected error in AgentRegistryService maintenance");
}
try {
await Task.Delay(TimeSpan.FromSeconds(Math.Max(5, Env.AgentHeartbeatIntervalSeconds)), stoppingToken);
} catch (RedisException ex) {
var retrySeconds = Math.Max(5, Env.AgentHeartbeatIntervalSeconds);
_logger.LogWarning(ex, "Redis error in AgentRegistryService maintenance, retrying in {RetrySeconds} s", retrySeconds);
} catch (Exception ex) {
_logger.LogError(ex, "Unexpected error in AgentRegistryService maintenance");
}
try {
await Task.Delay(TimeSpan.FromSeconds(Math.Max(5, Env.AgentHeartbeatIntervalSeconds)), stoppingToken);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TelegramSearchBot/Service/AI/LLM/AgentRegistryService.cs` around lines 192 -
198, The warning message in AgentRegistryService currently hardcodes "retrying
in 5 s" while the actual delay uses Math.Max(5,
Env.AgentHeartbeatIntervalSeconds); update the maintenance loop to compute the
delay (e.g., var retryDelay = TimeSpan.FromSeconds(Math.Max(5,
Env.AgentHeartbeatIntervalSeconds))) and use that variable in both the await
Task.Delay(...) and in the _logger.LogWarning call so the log shows the real
retryDelay value (reference: AgentRegistryService,
Env.AgentHeartbeatIntervalSeconds, _logger.LogWarning).

@github-actions
Copy link
Copy Markdown
Contributor

🔍 PR检查报告

📋 检查概览

🧪 测试结果

平台 状态 详情
Ubuntu 🔴 失败 测试结果不可用
Windows 🔴 失败 测试结果不可用

📊 代码质量

  • ✅ 代码格式化检查
  • ✅ 安全漏洞扫描
  • ✅ 依赖包分析
  • ✅ 代码覆盖率收集

📁 测试产物

  • 测试结果文件已上传为artifacts
  • 代码覆盖率已上传到Codecov

🔗 相关链接


此报告由GitHub Actions自动生成

@ModerRAS ModerRAS merged commit cf98863 into master Apr 17, 2026
8 checks passed
@ModerRAS ModerRAS deleted the fix/redis-brpop-timeout branch April 17, 2026 05:02
ModerRAS added a commit that referenced this pull request Apr 17, 2026
…registration

- BackgroundServices (TelegramTaskConsumer, ChunkPollingService, AgentRegistryService)
  now early-return when EnableLLMAgentProcess is disabled, preventing unnecessary
  Redis calls and potential crashes from RedisTimeoutException

- Agent process BRPOP timeout reduced from 5s to 2s to avoid race with SE.Redis
  async timeout (same fix as PR #267 for the main process)

- AgentLoopService main loop now catches RedisException with retry delay,
  preventing a single transient Redis failure from crashing the agent process

- Heartbeat loop catches OperationCanceledException and RedisException,
  preventing unobserved exceptions during shutdown or transient failures

- Agent process now registers LLM project tools (FileToolService, BashToolService)
  via two-assembly McpToolHelper.EnsureInitialized, giving the agent access to
  file operations and shell execution instead of only echo/calculator/send_message

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ModerRAS added a commit that referenced this pull request Apr 17, 2026
…registration (#268)

- BackgroundServices (TelegramTaskConsumer, ChunkPollingService, AgentRegistryService)
  now early-return when EnableLLMAgentProcess is disabled, preventing unnecessary
  Redis calls and potential crashes from RedisTimeoutException

- Agent process BRPOP timeout reduced from 5s to 2s to avoid race with SE.Redis
  async timeout (same fix as PR #267 for the main process)

- AgentLoopService main loop now catches RedisException with retry delay,
  preventing a single transient Redis failure from crashing the agent process

- Heartbeat loop catches OperationCanceledException and RedisException,
  preventing unobserved exceptions during shutdown or transient failures

- Agent process now registers LLM project tools (FileToolService, BashToolService)
  via two-assembly McpToolHelper.EnsureInitialized, giving the agent access to
  file operations and shell execution instead of only echo/calculator/send_message

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant