feat(taskbroker) Mark Activations as Processing Immediately in Push Mode#619
feat(taskbroker) Mark Activations as Processing Immediately in Push Mode#619george-sentry wants to merge 4 commits intomainfrom
Conversation
| ); | ||
| if let Ok(tasks) = store | ||
| .claim_activations(None, Some(&demoted_namespaces), None, None, false) | ||
| .claim_activations(None, Some(&demoted_namespaces), None, None) |
There was a problem hiding this comment.
Bug: Tasks for demoted namespaces are incorrectly marked as failed and discarded if the initial Kafka publish fails, because their status is not reverted to Pending.
Severity: HIGH
Suggested Fix
In the upkeep forwarding path for demoted namespaces, if a Kafka publish fails, the corresponding task should be reverted from Processing back to Pending. This prevents the processing_attempts counter from being incremented and the task from being prematurely marked as failed due to a transient publish error.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/upkeep.rs#L322
Potential issue: For demoted namespace upkeep, tasks are marked as `Processing`. If the
subsequent Kafka publish fails, the task remains in this state. The
`handle_processing_deadline` function will then repeatedly increment the
`processing_attempts` counter on each upkeep cycle. Once `processing_attempts` reaches
`max_processing_attempts`, the task is incorrectly marked as `Failure` and discarded or
sent to a dead-letter queue, even though no worker ever attempted to process it. This
can lead to permanent task loss due to transient Kafka publish failures.
There was a problem hiding this comment.
This is how it worked before I started the push mode business so it's fine.
| ); | ||
| if let Ok(tasks) = store | ||
| .claim_activations(None, Some(&demoted_namespaces), None, None, false) | ||
| .claim_activations(None, Some(&demoted_namespaces), None, None) |
There was a problem hiding this comment.
Upkeep forwarding now incorrectly uses Processing state
High Severity
The upkeep demoted-namespace forwarding path previously called claim_activations with mark_processing: false, placing tasks in Claimed state. Now it always marks them as Processing. Tasks that fail to forward to Kafka remain in Processing state with no undo_claim_activation call. When handle_processing_deadline eventually catches them, it increments processing_attempts (which handle_claim_expiration did not), and for at_most_once tasks it moves them directly to Failure — causing silent task loss for a transient Kafka outage.
Reviewed by Cursor Bugbot for commit 0a3be17. Configure here.
There was a problem hiding this comment.
This is how it worked before I started the push mode business so it's fine.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 284e487. Configure here.
| .record(received_to_push_latency as f64); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Missing undo on continue leaves tasks stuck Processing
High Severity
When a task's application has no worker pool mapping, both the main loop and the drain loop continue without calling undo_claim_activation. Since claim_activations_for_push now marks tasks as Processing (instead of the old Claimed status), these skipped tasks remain stuck in Processing. When handle_processing_deadline eventually fires, at-most-once tasks are moved to Failure (data loss — they were never actually sent), and regular tasks have processing_attempts incremented, burning retry budget for a configuration error.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 284e487. Configure here.
| max_processing_attempts: config.max_processing_attempts, | ||
| vacuum_page_count: config.vacuum_page_count, | ||
| processing_deadline_grace_sec: config.processing_deadline_grace_sec, | ||
| processing_deadline_grace_sec, |
There was a problem hiding this comment.
Dead field claim_lease_ms no longer read anywhere
Low Severity
The claim_lease_ms field in both PostgresActivationStoreConfig and InflightActivationStoreConfig is initialized but never read. Its only consumer was the removed Claimed branch inside claim_activations. The field, its doc comment, and its initialization are now dead code.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 284e487. Configure here.


Using an intermediate claimed state and marking activations as processing one at a time imposes a severe bottleneck on our throughput. Instead of going that route, we can just manually revert tasks that failed to send back to pending in the push loop. This will happen rarely and therefore limit impact on throughput during regular operation.
Before (5.5K TPS)

After (8.5K TPS)
