Skip to content

feat: export serialized HDR Histogram to enable statistical significance testing of results #452

@SamBarker

Description

@SamBarker

Motivation

OMB is commonly used to compare two configurations — different broker settings, different message sizes, or middleware introduced into the messaging path. The result JSON provides useful summary statistics, but these make it difficult to answer a critical question: is the measured difference statistically significant, or within the noise floor of the measurement?

A result showing "configuration B p99 = 14.6ms vs configuration A p99 = 14.1ms" looks like a meaningful difference, but may be indistinguishable from run-to-run variability without a significance test.

Related

Issue #247 correctly identified and addressed coordinated omission in load generation by recording publishDelayLatency. That work ensures the latency data captured is honest. This issue is about making that data statistically comparable across runs.

Current state

The per-window latency arrays (e.g. endToEndLatency99pct[]) are a tempting input for a Mann-Whitney U test (MWU) — a non-parametric significance test well suited to distributions with long tails. However, each entry in those arrays is already an aggregated statistic (the p99 over one reporting window), not a raw latency observation. MWU on these values tests whether the distribution of per-window p99s differs between runs, which is a meaningful but weaker claim than testing individual message latencies. The number of windows is also modest — determined by test duration and window size — leaving limited statistical power.

Proposed change

OMB already maintains HDR Histograms internally throughout each run and discards them after extracting the summary scalars. Serializing them into the result JSON would be a small change with significant analytical value:

"publishLatencyHistogram": "<base64-encoded HdrHistogram>",
"endToEndLatencyHistogram": "<base64-encoded HdrHistogram>"

HdrHistogram provides encodeIntoCompressedByteBuffer() for this. The encoded size is compact (typically a few KB per histogram).

What this enables

With the full histogram from two runs, consumers can perform:

  • Mann-Whitney U test (MWU) — a non-parametric significance test on the full distributions without the aggregation problem of the per-window workaround
  • Kolmogorov-Smirnov test — compares the full CDFs directly, particularly sensitive to tail differences
  • Bootstrapped confidence intervals — resample from the histogram to produce honest confidence intervals on any percentile including p99
  • Effect size — fraction of requests where one configuration outperforms the other

Backwards compatibility

Adding optional fields to the result JSON is non-breaking. Existing consumers that do not understand the new fields will ignore them.

Context

We are using OMB as part of benchmarking work for Kroxylicious, a Kafka protocol proxy. Comparing proxy vs baseline runs is exactly the kind of comparison this would help, but the need is general to any comparative OMB usage.

I am happy to work on implementing this if the contribution would be welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions