Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion BENCHMARKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,23 @@ CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses

---

## 6. [Name of Next Benchmark]
## 6. MMLU (Massive Multitask Language Understanding) (Beta)

MMLU evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration includes anchor-point-based evaluation for DISCO prediction, allowing efficient estimation of full benchmark performance from a subset of tasks.

> **Beta:** This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

> **Implemented:** A ready-to-use implementation is available via `DefaultMMLUBenchmark` with HuggingFace model support. Install with `pip install maseval[mmlu]`. See the [MMLU documentation](docs/benchmark/mmlu.md) for usage details.

### Source and License

- **Original Paper:** [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021)
- **DISCO Paper:** [DISCO: Diversifying Sample Condensation for Efficient Model Evaluation](https://arxiv.org/abs/2510.07959) (Rubinstein et al., ICLR 2026)
- **Dataset:** [arubique/flattened-MMLU](https://huggingface.co/datasets/arubique/flattened-MMLU)

---

## 7. [Name of Next Benchmark]

(Description for the next benchmark...)

Expand Down
12 changes: 10 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

**Benchmarks**

- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `AnchorPointsTaskQueue`, `load_tasks()`, and `compute_benchmark_metrics()`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `DefaultMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `DefaultMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)

- CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)

Expand All @@ -35,21 +35,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
**Examples**

- MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
- MMLU benchmark documentation at `docs/benchmark/mmlu.md` with installation, quick start, and API reference. (PR: #34)
- Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
- Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)

**Core**

- Added `InformativeSubsetQueue` and `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). `DISCOQueue` accepts `anchor_points_path` to load indices from a `.json`/`.pkl` file via `DISCOQueue.load_anchor_points()`. Available via `from maseval import DISCOQueue, InformativeSubsetQueue`. (PR: #34)
- Added `get_with_assert()` utility in `maseval.core.exceptions` for strict dictionary access that raises `KeyError` instead of silently returning a default. Supports nested key lookups. (PR: #34)
- Added `ModelScorer` abstract base class in `maseval.core.scorer` for log-likelihood scoring, with `loglikelihood()`, `loglikelihood_batch()`, and `loglikelihood_choices()` methods. (PR: #34)
- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
- Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24)
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFacePipelineModelAdapter` pass seeds to underlying APIs (PR: #24)
- Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39)

**Interface**

- Added `HuggingFaceModelScorer` in `maseval.interface.inference` — log-likelihood scorer backed by a HuggingFace `AutoModelForCausalLM`, with single-token optimisation for MCQ evaluation. Implements the `ModelScorer` interface. (PR: #34)
- Renamed `HuggingFaceModelAdapter` → `HuggingFacePipelineModelAdapter` to distinguish it from the new scorer. The old name remains as a backwards-compatible alias. (PR: #34)

- CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
Expand Down Expand Up @@ -86,6 +93,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

**Benchmarks**

- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `DefaultMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). `DefaultMMLUBenchmark` now delegates log-likelihood computation to `HuggingFaceModelScorer` and uses a scorer-backed adapter instead of the MMLU-specific `MMLUModelAgent`/`MMLUAgentAdapter` (removed). (PR: #34)
- `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)
- `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge`
- `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr`
Expand Down
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,13 @@ pip install "maseval[langgraph]"
pip install "maseval[llamaindex]"
```

Or install benchmark-specific dependencies:

```bash
# MMLU (HuggingFace models)
pip install "maseval[mmlu]"
```

## Example

Examples are available in the [Documentation](https://maseval.readthedocs.io/en/stable/).
Expand Down
144 changes: 144 additions & 0 deletions docs/benchmark/mmlu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# MMLU: Massive Multitask Language Understanding (Beta)

!!! warning "Beta"
This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

The **MMLU Benchmark** evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration supports anchor-point-based evaluation for [DISCO](https://arxiv.org/abs/2510.07959) prediction, enabling efficient estimation of full benchmark performance from a subset of tasks.

## Overview

[MMLU](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) is a widely used benchmark for measuring knowledge and reasoning across diverse domains. The MASEval implementation features:

- **Log-likelihood MCQ evaluation** matching [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) methodology
- **Anchor-point task selection** via `DISCOQueue` for DISCO-style subset evaluation
- **HuggingFace integration** with batched log-probability computation
- **lm-eval compatibility** mode for exact numerical reproduction

Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.

## Installation

Install MMLU with all dependencies needed to run the HuggingFace benchmark and example script:

```bash
pip install maseval[mmlu]
```

Or with uv:

```bash
uv sync --extra mmlu
```

This installs `transformers`, `torch`, `numpy`, and `huggingface_hub` (the latter two via `transformers`). You can then run the example:

```bash
python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full
```

For DISCO prediction support:

```bash
pip install maseval[disco]
```

For exact lm-evaluation-harness reproduction:

```bash
pip install maseval[lm-eval]
```

## Quick Start

```python
from maseval.benchmark.mmlu import (
DefaultMMLUBenchmark,
load_tasks,
compute_benchmark_metrics,
)

# Load tasks (downloads from HuggingFace automatically)
tasks = load_tasks(data_path="/path/to/mmlu_prompts_examples.json")

# Create benchmark with HuggingFace model
benchmark = DefaultMMLUBenchmark(
model_id="meta-llama/Llama-2-7b-hf",
device="cuda:0",
)

# Run evaluation
results = benchmark.run(
tasks=tasks,
agent_data={"model_id": "meta-llama/Llama-2-7b-hf"},
)

# Compute metrics
metrics = compute_benchmark_metrics(results)
print(f"Accuracy: {metrics['acc']:.4f}")
```

### With Anchor Points (DISCO)

```python
from maseval.benchmark.mmlu import load_tasks

# Load tasks filtered to anchor points
tasks = load_tasks(
data_path="/path/to/mmlu_prompts_examples.json",
anchor_points_path="/path/to/anchor_points.json",
)

# tasks is a DISCOQueue — only anchor tasks are evaluated
print(f"Evaluating {len(tasks)} anchor tasks")
```

## Custom Benchmark Subclass

`MMLUBenchmark` is a framework-agnostic base class. To use a different model backend, subclass it and implement `setup_agents()` and `get_model_adapter()`:

```python
from maseval import AgentAdapter
from maseval.core.history import MessageHistory
from maseval.benchmark.mmlu import MMLUBenchmark

class MyAgentAdapter(AgentAdapter):
def __init__(self, model, name):
super().__init__(model, name)
self._messages = []

def _run_agent(self, query):
self._messages.append({"role": "user", "content": query})
response = self.agent.generate(query)
self._messages.append({"role": "assistant", "content": response})
return response

def get_messages(self):
return MessageHistory(self._messages)

class MyMMLUBenchmark(MMLUBenchmark):
def setup_agents(self, agent_data, environment, task, user, seed_generator):
model = self.get_model_adapter(agent_data["model_id"])
adapter = MyAgentAdapter(model, name="mmlu_agent")
return [adapter], {"mmlu_agent": adapter}

def get_model_adapter(self, model_id, **kwargs):
adapter = MyModelAdapter(model_id)
register_name = kwargs.get("register_name")
if register_name:
self.register("models", register_name, adapter)
return adapter
```

## API Reference

::: maseval.benchmark.mmlu.MMLUBenchmark

::: maseval.benchmark.mmlu.DefaultMMLUBenchmark

::: maseval.benchmark.mmlu.MMLUEnvironment

::: maseval.benchmark.mmlu.MMLUEvaluator

::: maseval.benchmark.mmlu.load_tasks

::: maseval.benchmark.mmlu.compute_benchmark_metrics
17 changes: 14 additions & 3 deletions docs/interface/inference/huggingface.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,18 @@
# HuggingFace Inference Adapter
# HuggingFace Inference Adapters

This page documents the HuggingFace model adapter for MASEval.
This page documents the HuggingFace model adapters for MASEval.

## Pipeline Model Adapter (Text Generation)

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface.py){ .md-source-file }

::: maseval.interface.inference.huggingface.HuggingFaceModelAdapter
::: maseval.interface.inference.huggingface.HuggingFacePipelineModelAdapter

!!! note
`HuggingFaceModelAdapter` is a backwards-compatible alias for `HuggingFacePipelineModelAdapter`.

## Model Scorer (Log-Likelihood)

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface_scorer.py){ .md-source-file }

::: maseval.interface.inference.huggingface_scorer.HuggingFaceModelScorer
10 changes: 4 additions & 6 deletions docs/reference/environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,8 @@ Environments define the execution context for agents, including available tools,

## Tools and agent-provided helpers

Some agent adapters expose helper tools or user-simulation tools that can be used by the Environment. For example:
Some agent adapters expose helper tools or user-simulation tools that can be used by the Environment. See the framework-specific interface pages for details:

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/smolagents.py){ .md-source-file }

::: maseval.interface.agents.smolagents.SmolAgentAdapter

::: maseval.interface.agents.smolagents.SmolAgentLLMUser
- [SmolAgents](../interface/agents/smolagents.md) — `SmolAgentAdapter`, `SmolAgentLLMUser`
- [LangGraph](../interface/agents/langgraph.md) — `LangGraphAgentAdapter`
- [LlamaIndex](../interface/agents/llamaindex.md) — `LlamaIndexAgentAdapter`
4 changes: 4 additions & 0 deletions docs/reference/exceptions.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ SimulatorError (base for simulators)

::: maseval.core.simulator.UserSimulatorError

## Data Access Helpers

::: maseval.core.exceptions.get_with_assert

## Validation Helpers

These functions simplify input validation and raise `AgentError` with helpful suggestions:
Expand Down
2 changes: 1 addition & 1 deletion docs/reference/model.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ The following adapter classes implement the ModelAdapter interface for specific

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface.py){ .md-source-file }

::: maseval.interface.inference.huggingface.HuggingFaceModelAdapter
::: maseval.interface.inference.huggingface.HuggingFacePipelineModelAdapter

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/google_genai.py){ .md-source-file }

Expand Down
19 changes: 19 additions & 0 deletions docs/reference/scorer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Model Scorers

Model Scorers provide a uniform interface for log-likelihood computation across model providers. Unlike `ModelAdapter` (which handles text generation and chat), scorers evaluate how likely a model considers a given continuation given some context.

!!! note

`ModelScorer` is the scoring counterpart to `ModelAdapter`. Use it when you need log-likelihood evaluation (e.g., multiple-choice benchmarks) rather than text generation.

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/scorer.py){ .md-source-file }

::: maseval.core.scorer.ModelScorer

## Interfaces

The following scorer classes implement the ModelScorer interface for specific providers.

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface_scorer.py){ .md-source-file }

::: maseval.interface.inference.huggingface_scorer.HuggingFaceModelScorer
22 changes: 15 additions & 7 deletions docs/reference/task.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,42 @@

Tasks define individual benchmark scenarios including inputs, expected outputs, and metadata for evaluation. Task queues control execution order and scheduling strategy.

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L55){ .md-source-file }
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L56){ .md-source-file }

::: maseval.core.task.Task

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L27){ .md-source-file }
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L28){ .md-source-file }

::: maseval.core.task.TaskProtocol

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L18){ .md-source-file }
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L19){ .md-source-file }

::: maseval.core.task.TimeoutAction

## Task Queues

Task queues determine the order in which tasks are executed. Pass a queue to `Benchmark.run(queue=...)` to customize scheduling.

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L86){ .md-source-file }
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L87){ .md-source-file }

::: maseval.core.task.BaseTaskQueue

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L256){ .md-source-file }
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L257){ .md-source-file }

::: maseval.core.task.SequentialTaskQueue

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L276){ .md-source-file }
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L277){ .md-source-file }

::: maseval.core.task.InformativeSubsetQueue

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L325){ .md-source-file }

::: maseval.core.task.DISCOQueue

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L413){ .md-source-file }

::: maseval.core.task.PriorityTaskQueue

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L322){ .md-source-file }
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L459){ .md-source-file }

::: maseval.core.task.AdaptiveTaskQueue
Loading
Loading