Skip to content

Better replacement latancey#908

Merged
ZanCorDX merged 4 commits intodevelopfrom
dx/better-cancel-latency
Mar 27, 2026
Merged

Better replacement latancey#908
ZanCorDX merged 4 commits intodevelopfrom
dx/better-cancel-latency

Conversation

@ZanCorDX
Copy link
Copy Markdown
Collaborator

📝 Summary

Increassed the ORDERS_CONSUMED_PER_BATCH from 1024 to 4096.
Implemented cancellation in orders queued for simlulation.

✅ I have completed the following steps:

  • Run make lint
  • Run make test
  • Added tests (if applicable)

@ZanCorDX ZanCorDX requested a review from dvush as a code owner March 26, 2026 12:53
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 26, 2026

Claude finished @ZanCorDX's task in 1m 39s —— View job


PR Review: Better replacement latency

  • Read repo guidelines
  • Review changed files
  • Post review feedback

Summary

This PR increases ORDERS_CONSUMED_PER_BATCH from 1024 to 4096 and adds cancellation support for orders queued in the simulation pipeline via a shared AtomicBool flag (CancellableSimulationRequest). The approach is sound — using Arc<AtomicBool> to propagate cancellations to already-queued simulation requests avoids wasting sim worker time on stale orders.

Issues found

Metrics skew on cancelled tasks (sim_worker.rs:61-62) — When a cancelled task is dequeued, last_sim_finished is not updated, so the next real task's sim_thread_wait_time will include time spent dequeuing cancelled tasks, inflating utilization metrics.

Stale comment (simulation_job.rs:274-276) — The comment in process_order_cancellation says "if we removed from in_flight_orders it was never sent" but it's in the else branch (order was not found), so the logic described is inverted.

Silent drop of cancelled orders (simulation_job.rs:165) — Orders popped from SimTree that aren't in in_flight_orders are silently dropped. A trace! log here would help debuggability.

Typo (simulation_job.rs:59) — "asociated" → "associated"

See inline comments for details and suggestions.

Comment thread crates/rbuilder/src/live_builder/simulation/sim_worker.rs
Comment thread crates/rbuilder/src/live_builder/simulation/simulation_job.rs
Comment thread crates/rbuilder/src/live_builder/simulation/simulation_job.rs Outdated
error!("Sim channel is closed, dropping order");
false
// @Metric
if let Some(cancel_handle) = self.in_flight_orders.get(&order_id) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a popped order is not in in_flight_orders (already cancelled), it's silently dropped here. This is correct behavior, but it would benefit from a trace! log for debuggability — cancelled orders vanishing without a trace could make production issues harder to diagnose.

ZanCorDX and others added 2 commits March 26, 2026 10:06
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 26, 2026

Claude finished @ZanCorDX's task in 1m 36s —— View job


Re-review after update

  • Read repo guidelines
  • Review changed files
  • Check if previous review issues were addressed
  • Post review feedback

Summary

The latest commits addressed 2 of the 4 issues from the previous review:

Previous Issue Status
Stale comment in process_order_cancellation Fixed
Typo "asociated" → "associated" Fixed
Metrics skew on cancelled tasks (sim_worker.rs) Still open — see inline comment
Silent drop of cancelled orders (simulation_job.rs:165) Still open — see inline comment

Remaining issues

  1. Metrics skew on cancelled tasks (sim_worker.rs:60-63) — When a cancelled task is dequeued, last_sim_finished is not updated. The next real task's sim_thread_wait_time will include time spent dequeuing cancelled tasks, inflating utilization metrics. Under high cancellation rates this could significantly distort the dashboard.

  2. Silent drop of cancelled orders (simulation_job.rs:165) — Orders popped from SimTree that aren't in in_flight_orders are silently dropped. A trace! log would help debuggability.

No new issues found in the update — the core cancellation mechanism using Arc<AtomicBool> is sound, and Relaxed ordering is appropriate here since there's no data dependency (a missed cancellation just means one extra simulation, which is handled by the in_flight_orders check on the result path).

Comment on lines +60 to +63
while let Ok(cancellable_task) = current_sim_context.requests.recv() {
if let Some(task) = cancellable_task.into_request() {
let sim_thread_wait_time = last_sim_finished.elapsed();
let sim_start = Instant::now();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still an issue from previous review: when cancellable_task.into_request() returns None (cancelled), last_sim_finished is not updated. The next real task's sim_thread_wait_time will include the time spent dequeuing cancelled tasks, inflating utilization metrics.

Suggested change
while let Ok(cancellable_task) = current_sim_context.requests.recv() {
if let Some(task) = cancellable_task.into_request() {
let sim_thread_wait_time = last_sim_finished.elapsed();
let sim_start = Instant::now();
while let Ok(cancellable_task) = current_sim_context.requests.recv() {
if let Some(task) = cancellable_task.into_request() {
let sim_thread_wait_time = last_sim_finished.elapsed();
let sim_start = Instant::now();

Consider updating last_sim_finished in an else branch when skipping cancelled tasks, or moving the last_sim_finished = Instant::now() to just before into_request() is called.

Comment thread crates/rbuilder/src/live_builder/simulation/simulation_job.rs
Copy link
Copy Markdown
Contributor

@dvush dvush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem that this solves is that our cancellations are handled after simulations so its non-optimal. This makes sure that order is not simulated if its cancelled. We clear all cancelled order in this "pre flights state".

My main question were how do we handle this scenarios:

  1. make sure that when replacements are really fast, do we simulate anything at all?
  • we send O_1, O_2, O_3, ... that replace each other very fast
  • will simulation thread take the latest O_k that is available right now when its draining new orders and simulate it?
  • it seems like this is handled this way: sim threads will drain channel and skip every cancelled one so it should be able to reach O_k eventually that is not cancelled and process that.
  1. same as above but how do we cancel things that were already sent to the builder?
  • we send O_1, O_2, O_3 very fast but O_1 was simulated and sent to the builder
  • O_2, O_3 all cancelled in flight
  • Are we able to cancel O_1 in time?
  • it seems that we can: because O_1 will not be in the "in_flight_orders" map and in this case we send cancellation to the builder thread directly

@ZanCorDX
Copy link
Copy Markdown
Collaborator Author

The problem that this solves is that our cancellations are handled after simulations so its non-optimal. This makes sure that order is not simulated if its cancelled. We clear all cancelled order in this "pre flights state".

My main question were how do we handle this scenarios:

  1. make sure that when replacements are really fast, do we simulate anything at all?
  • we send O_1, O_2, O_3, ... that replace each other very fast
  • will simulation thread take the latest O_k that is available right now when its draining new orders and simulate it?
  • it seems like this is handled this way: sim threads will drain channel and skip every cancelled one so it should be able to reach O_k eventually that is not cancelled and process that.
  1. same as above but how do we cancel things that were already sent to the builder?
  • we send O_1, O_2, O_3 very fast but O_1 was simulated and sent to the builder
  • O_2, O_3 all cancelled in flight
  • Are we able to cancel O_1 in time?
  • it seems that we can: because O_1 will not be in the "in_flight_orders" map and in this case we send cancellation to the builder thread directly

1 - O_1, O_2, O_3 into 6 messages O_1, C_1 (cancel),O_2,C_2,O3.
If the Cs arrive before the Os starts to get simulated we are fine and only O_3 will be simulated.
If an O starts simulation it's not possible to cancel it since there is no cancellation in the EVM :(.

2 - The generation of the O and C are sequential so C_1 will always come between O_1 and O_2.
If O1 was sent to the builder then as soon as O_2 arrives C_1 is generated and reaches the builder asap.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 27, 2026

Claude finished @ZanCorDX's task in 1m 53s —— View job


PR Review: Better replacement latency

  • Read repo guidelines
  • Review changed files
  • Post review feedback

Summary

Re-reviewed after the latest push (4 commits total). The core cancellation mechanism using Arc<AtomicBool> shared between SimulationJob and sim workers is sound. Relaxed ordering is appropriate since a missed cancellation just means one extra simulation, which is already handled by the in_flight_orders check on the result path.

Issues found

  1. Incorrect comment (simulation_job.rs:59) — Comment says "set the associated bool to false" but the code stores true to indicate cancellation. See inline comment.

  2. PR description mismatch — Description says ORDERS_CONSUMED_PER_BATCH was increased to 4096, but the code sets it to 8192. See inline comment.

No other new issues. The two items flagged in the previous review (metrics skew on cancelled tasks, silent drop logging) were discussed and dismissed by the author — no objection to those decisions.

/// Orders we got via new_order_sub and are still being processed (they could be inside the SimTree or in the sim queue)
/// and were not cancelled.
in_flight_orders: HashSet<OrderId>,
/// When we remove the order we set the associated bool to false (which is checked before simulating)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect comment: the AtomicBool represents "is cancelled" and is set to true when cancelling (line 273: cancel_handle.store(true, ...)). The comment says "set the associated bool to false" which is the opposite of what happens.

Suggested change
/// When we remove the order we set the associated bool to false (which is checked before simulating)
/// When we remove the order we set the associated bool to true (cancelled) which is checked before simulating.

/// Orders that blocking_consume_next_commands will consume.
/// A slow algorithm would check approx every 200ms, to fill this batch size it would take
/// 8192/.2 = 40960 order/sec which is even more than what we see in the whole slot for a busy block.
const ORDERS_CONSUMED_PER_BATCH: usize = 8192;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says this was increased to 4096 but the actual value is 8192. Worth updating the PR description to match.

@ZanCorDX ZanCorDX merged commit 55bbd32 into develop Mar 27, 2026
13 checks passed
@ZanCorDX ZanCorDX deleted the dx/better-cancel-latency branch March 27, 2026 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants