[DRAFT] FEAT: Benchmark Scenario by ValbuenaVC · Pull Request #1662 · microsoft/PyRIT

ValbuenaVC · 2026-04-27T23:40:50Z

Description

Adds a benchmarking scenario to PyRIT to compare the performance between adversarial targets. This is currently a draft PR and there are several design conflicts to resolve before opening for review.

The largest design tension is that get_strategy_class doesn't work with the factory pattern for scenario strategy generation, because the scenario instance changes the scenario strategy for benchmarks. The working solution is to intercept the lifecycle at several points in the scenario (_build_benchmark_strategy => _prepare_strategies => _get_atomic_attacks). This works but is very brittle. Callers like registries see a "blank" version of the strategy while at runtime the strategy is populated fully with live adversarial targets.
We explicitly filter out non-adversarial attack strategies using a list of attack names in _build_benchmark_strategy, but this is also brittle. We have options for adding richer tagging. A cheap intervention could be to check if adversarial_target is an attribute of that attack type. Another could be to use TargetCapabilities and add an is_adversarial tag, which could pass through the attack to the caller in the scenario. But as-is we're just keeping a literal list of attacks we know have adversarial targets.
The original requirements asked to grab list[PromptChatTarget] in the constructor. The issue with this is that targets don't know they're adversarial, so we need to label them with a human-readable name. model_name isn't guaranteed and similar fields don't exist in the so we fall back on the identifier. Not a great design in my opinion. Inferring the model name from a private attribute is also a yellow flag. We could change the constructor to grab dict[str, PromptChatTarget] where str is a human-readable name, but that's less ergonomic.
There's explicitly no CLI support, and there can't be because of the get_strategy_class issue. This will have downstream implications for the GUI that I'd like to fix.
Scenarios are designed to be plug-and-play. Do we need a list of default adversarial targets?
_build_benchmark_strategy is a huge function and should be refactored since it returns a tuple of length 3. It does too much but I'm not sure how to refactor this while keeping it similar to rapid response.
TBD on if this should get an integration test in this PR.

Tests and Documentation

Added tests/unit/scenario/test_benchmark.py.

rlundeen2 · 2026-04-28T17:15:33Z

+        adversarial_models: list[PromptChatTarget] | None = None,
+    ) -> tuple[type[ScenarioStrategy], dict[str, str], list[AttackTechniqueSpec]]:
+        """
+        Build the Benchmark strategy class dynamically from SCENARIO_TECHNIQUES.


I think we can replace these at the factory level, and simplify things a bunch. I'm going to take a stab

#1664

There might be ways to simplify so we don't need to overwrite _get_atomic_attacks_async either, but for now I think something like this would be good.

The fundamental architectural difference: this PR treats models as a strategy dimension (permuting them into enum members), requiring two different strategy classes and a _prepare_strategies override to reconcile them.

#1664 treats models as a runtime parameter (looping at create-time), keeping the strategy axis purely about technique selection — which is what it was designed for.

jbolor21 · 2026-04-28T18:41:02Z

+
+        if adversarial_models:
+            permuted_specs = []
+            for model in adversarial_models:


are model names definitely unique? just thinking if we have 2 models w same name we have a slight issue I think currently - ie if we have 2 "gpt-4o" model names, we end up with two identical technique names that resolve, and so the 2nd model would get overwritten wo any warning/error. maybe we add a suffix to ensure unique names or we do checking for model label collisions early & raise warning early so its not silent?

(oh rich's suggestion might remove this issue)

Victor Valbuena added 2 commits April 23, 2026 17:33

notes

0e86b33

draft PR

42d3ab5

ValbuenaVC changed the title ~~Benchmark~~ [DRAFT] FEAT: Benchmark Scenario Apr 27, 2026

Victor Valbuena and others added 2 commits April 27, 2026 16:43

tests

f5f1563

Merge branch 'main' into benchmark

d36ced0

rlundeen2 reviewed Apr 28, 2026

View reviewed changes

Merge branch 'main' into benchmark

1c38950

jbolor21 reviewed Apr 28, 2026

View reviewed changes

Comment thread pyrit/scenario/scenarios/benchmark/benchmark.py Outdated

jbolor21 reviewed Apr 28, 2026

View reviewed changes

Comment thread pyrit/scenario/scenarios/benchmark/benchmark.py Outdated

jbolor21 reviewed Apr 28, 2026

View reviewed changes

ValbuenaVC and others added 2 commits April 29, 2026 10:04

Merge branch 'main' into benchmark

53e97d1

.

155dcf0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] FEAT: Benchmark Scenario#1662

[DRAFT] FEAT: Benchmark Scenario#1662
ValbuenaVC wants to merge 7 commits intomicrosoft:mainfrom
ValbuenaVC:benchmark

ValbuenaVC commented Apr 27, 2026

Uh oh!

rlundeen2 Apr 28, 2026

Uh oh!

rlundeen2 Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

jbolor21 Apr 28, 2026

Uh oh!

jbolor21 Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ValbuenaVC commented Apr 27, 2026

Description

Tests and Documentation

Uh oh!

rlundeen2 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

rlundeen2 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jbolor21 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

jbolor21 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants