Skip to content

feat(taskbroker) Mark Activations as Processing Immediately in Push Mode#619

Open
george-sentry wants to merge 4 commits intomainfrom
george/push-taskbroker/revert-on-send-fail
Open

feat(taskbroker) Mark Activations as Processing Immediately in Push Mode#619
george-sentry wants to merge 4 commits intomainfrom
george/push-taskbroker/revert-on-send-fail

Conversation

@george-sentry
Copy link
Copy Markdown
Member

Using an intermediate claimed state and marking activations as processing one at a time imposes a severe bottleneck on our throughput. Instead of going that route, we can just manually revert tasks that failed to send back to pending in the push loop. This will happen rarely and therefore limit impact on throughput during regular operation.

Before (5.5K TPS)
image

After (8.5K TPS)
image

@george-sentry george-sentry requested a review from a team as a code owner May 1, 2026 20:02
Comment thread src/store/adapters/postgres.rs Outdated
Comment thread src/push/mod.rs
Comment thread src/store/adapters/postgres.rs Outdated
Comment thread src/store/adapters/sqlite.rs
Comment thread src/upkeep.rs
);
if let Ok(tasks) = store
.claim_activations(None, Some(&demoted_namespaces), None, None, false)
.claim_activations(None, Some(&demoted_namespaces), None, None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Tasks for demoted namespaces are incorrectly marked as failed and discarded if the initial Kafka publish fails, because their status is not reverted to Pending.
Severity: HIGH

Suggested Fix

In the upkeep forwarding path for demoted namespaces, if a Kafka publish fails, the corresponding task should be reverted from Processing back to Pending. This prevents the processing_attempts counter from being incremented and the task from being prematurely marked as failed due to a transient publish error.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: src/upkeep.rs#L322

Potential issue: For demoted namespace upkeep, tasks are marked as `Processing`. If the
subsequent Kafka publish fails, the task remains in this state. The
`handle_processing_deadline` function will then repeatedly increment the
`processing_attempts` counter on each upkeep cycle. Once `processing_attempts` reaches
`max_processing_attempts`, the task is incorrectly marked as `Failure` and discarded or
sent to a dead-letter queue, even though no worker ever attempted to process it. This
can lead to permanent task loss due to transient Kafka publish failures.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it worked before I started the push mode business so it's fine.

Comment thread src/push/mod.rs
Comment thread src/upkeep.rs
);
if let Ok(tasks) = store
.claim_activations(None, Some(&demoted_namespaces), None, None, false)
.claim_activations(None, Some(&demoted_namespaces), None, None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upkeep forwarding now incorrectly uses Processing state

High Severity

The upkeep demoted-namespace forwarding path previously called claim_activations with mark_processing: false, placing tasks in Claimed state. Now it always marks them as Processing. Tasks that fail to forward to Kafka remain in Processing state with no undo_claim_activation call. When handle_processing_deadline eventually catches them, it increments processing_attempts (which handle_claim_expiration did not), and for at_most_once tasks it moves them directly to Failure — causing silent task loss for a transient Kafka outage.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0a3be17. Configure here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it worked before I started the push mode business so it's fine.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 284e487. Configure here.

Comment thread src/push/mod.rs
.record(received_to_push_latency as f64);
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing undo on continue leaves tasks stuck Processing

High Severity

When a task's application has no worker pool mapping, both the main loop and the drain loop continue without calling undo_claim_activation. Since claim_activations_for_push now marks tasks as Processing (instead of the old Claimed status), these skipped tasks remain stuck in Processing. When handle_processing_deadline eventually fires, at-most-once tasks are moved to Failure (data loss — they were never actually sent), and regular tasks have processing_attempts incremented, burning retry budget for a configuration error.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 284e487. Configure here.

max_processing_attempts: config.max_processing_attempts,
vacuum_page_count: config.vacuum_page_count,
processing_deadline_grace_sec: config.processing_deadline_grace_sec,
processing_deadline_grace_sec,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead field claim_lease_ms no longer read anywhere

Low Severity

The claim_lease_ms field in both PostgresActivationStoreConfig and InflightActivationStoreConfig is initialized but never read. Its only consumer was the removed Claimed branch inside claim_activations. The field, its doc comment, and its initialization are now dead code.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 284e487. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant