feat: export serialized HDR Histogram to enable statistical significance testing of results

## Motivation

OMB is commonly used to compare two configurations — different broker settings, different message sizes, or middleware introduced into the messaging path. The result JSON provides useful summary statistics, but these make it difficult to answer a critical question: **is the measured difference statistically significant, or within the noise floor of the measurement?**

A result showing "configuration B p99 = 14.6ms vs configuration A p99 = 14.1ms" looks like a meaningful difference, but may be indistinguishable from run-to-run variability without a significance test.

## Related

Issue #247 correctly identified and addressed coordinated omission in load generation by recording `publishDelayLatency`. That work ensures the latency data captured is honest. This issue is about making that data statistically comparable across runs.

## Current state

The per-window latency arrays (e.g. `endToEndLatency99pct[]`) are a tempting input for a Mann-Whitney U test (MWU) — a non-parametric significance test well suited to distributions with long tails. However, each entry in those arrays is already an aggregated statistic (the p99 over one reporting window), not a raw latency observation. MWU on these values tests whether the distribution of per-window p99s differs between runs, which is a meaningful but weaker claim than testing individual message latencies. The number of windows is also modest — determined by test duration and window size —  leaving limited statistical power.

## Proposed change

OMB already maintains HDR Histograms internally throughout each run and discards them after extracting the summary scalars. Serializing them into the result JSON would be a small change with significant analytical value:

```json
"publishLatencyHistogram": "<base64-encoded HdrHistogram>",
"endToEndLatencyHistogram": "<base64-encoded HdrHistogram>"
```

HdrHistogram provides `encodeIntoCompressedByteBuffer()` for this. The encoded size is compact (typically a few KB per histogram).

## What this enables

With the full histogram from two runs, consumers can perform:

- **Mann-Whitney U test (MWU)** — a non-parametric significance test on the full distributions without the aggregation problem of the per-window workaround
- **Kolmogorov-Smirnov test** — compares the full CDFs directly, particularly sensitive to tail differences
- **Bootstrapped confidence intervals** — resample from the histogram to produce honest confidence intervals on any percentile including p99
- **Effect size** — fraction of requests where one configuration outperforms the other

## Backwards compatibility

Adding optional fields to the result JSON is non-breaking. Existing consumers that do not understand the new fields will ignore them.

## Context

We are using OMB as part of benchmarking work for [Kroxylicious](https://github.com/kroxylicious/kroxylicious), a Kafka protocol proxy. Comparing proxy vs baseline runs is exactly the kind of comparison this would help, but the need is general to any comparative OMB usage.

I am happy to work on implementing this if the contribution would be welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: export serialized HDR Histogram to enable statistical significance testing of results #452

Motivation

Related

Current state

Proposed change

What this enables

Backwards compatibility

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: export serialized HDR Histogram to enable statistical significance testing of results #452

Description

Motivation

Related

Current state

Proposed change

What this enables

Backwards compatibility

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions