parameterlab · arubique · Mar 8, 2026 · Mar 9, 2026 · Mar 9, 2026 · Mar 11, 2026
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -79,7 +79,23 @@ CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses
 
 ---
 
-## 6. [Name of Next Benchmark]
+## 6. MMLU (Massive Multitask Language Understanding) (Beta)
+
+MMLU evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration includes anchor-point-based evaluation for DISCO prediction, allowing efficient estimation of full benchmark performance from a subset of tasks.
+
+> **Beta:** This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
+
+> **Implemented:** A ready-to-use implementation is available via `DefaultMMLUBenchmark` with HuggingFace model support. Install with `pip install maseval[mmlu]`. See the [MMLU documentation](docs/benchmark/mmlu.md) for usage details.
+
+### Source and License
+
+- **Original Paper:** [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021)
+- **DISCO Paper:** [DISCO: Diversifying Sample Condensation for Efficient Model Evaluation](https://arxiv.org/abs/2510.07959) (Rubinstein et al., ICLR 2026)
+- **Dataset:** [arubique/flattened-MMLU](https://huggingface.co/datasets/arubique/flattened-MMLU)
+
+---
+
+## 7. [Name of Next Benchmark]
 
 (Description for the next benchmark...)
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Benchmarks**
 
-- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `AnchorPointsTaskQueue`, `load_tasks()`, and `compute_benchmark_metrics()`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
+- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `DefaultMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `DefaultMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
 
 - CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)
 
@@ -35,21 +35,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 **Examples**
 
 - MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
+- MMLU benchmark documentation at `docs/benchmark/mmlu.md` with installation, quick start, and API reference. (PR: #34)
 - Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
 - Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
 
 **Core**
 
+- Added `InformativeSubsetQueue` and `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). `DISCOQueue` accepts `anchor_points_path` to load indices from a `.json`/`.pkl` file via `DISCOQueue.load_anchor_points()`. Available via `from maseval import DISCOQueue, InformativeSubsetQueue`. (PR: #34)
+- Added `get_with_assert()` utility in `maseval.core.exceptions` for strict dictionary access that raises `KeyError` instead of silently returning a default. Supports nested key lookups. (PR: #34)
+- Added `ModelScorer` abstract base class in `maseval.core.scorer` for log-likelihood scoring, with `loglikelihood()`, `loglikelihood_batch()`, and `loglikelihood_choices()` methods. (PR: #34)
 - Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
 - Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
 - Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
 - Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
 - Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
-- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24)
+- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFacePipelineModelAdapter` pass seeds to underlying APIs (PR: #24)
 - Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39)
 
 **Interface**
 
+- Added `HuggingFaceModelScorer` in `maseval.interface.inference` — log-likelihood scorer backed by a HuggingFace `AutoModelForCausalLM`, with single-token optimisation for MCQ evaluation. Implements the `ModelScorer` interface. (PR: #34)
+- Renamed `HuggingFaceModelAdapter` → `HuggingFacePipelineModelAdapter` to distinguish it from the new scorer. The old name remains as a backwards-compatible alias. (PR: #34)
+
 - CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
   - Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
   - Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
@@ -86,6 +93,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Benchmarks**
 
+- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `DefaultMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). `DefaultMMLUBenchmark` now delegates log-likelihood computation to `HuggingFaceModelScorer` and uses a scorer-backed adapter instead of the MMLU-specific `MMLUModelAgent`/`MMLUAgentAdapter` (removed). (PR: #34)
 - `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)
   - `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge`
   - `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr`

diff --git a/README.md b/README.md
@@ -109,6 +109,13 @@ pip install "maseval[langgraph]"
 pip install "maseval[llamaindex]"
 ```
 
+Or install benchmark-specific dependencies:
+
+```bash
+# MMLU (HuggingFace models)
+pip install "maseval[mmlu]"
+```
+
 ## Example
 
 Examples are available in the [Documentation](https://maseval.readthedocs.io/en/stable/).

diff --git a/docs/benchmark/mmlu.md b/docs/benchmark/mmlu.md
@@ -0,0 +1,144 @@
+# MMLU: Massive Multitask Language Understanding (Beta)
+
+!!! warning "Beta"
+    This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
+
+The **MMLU Benchmark** evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration supports anchor-point-based evaluation for [DISCO](https://arxiv.org/abs/2510.07959) prediction, enabling efficient estimation of full benchmark performance from a subset of tasks.
+
+## Overview
+
+[MMLU](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) is a widely used benchmark for measuring knowledge and reasoning across diverse domains. The MASEval implementation features:
+
+- **Log-likelihood MCQ evaluation** matching [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) methodology
+- **Anchor-point task selection** via `DISCOQueue` for DISCO-style subset evaluation
+- **HuggingFace integration** with batched log-probability computation
+- **lm-eval compatibility** mode for exact numerical reproduction
+
+Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.
+
+## Installation
+
+Install MMLU with all dependencies needed to run the HuggingFace benchmark and example script:
+
+```bash
+pip install maseval[mmlu]
+```
+
+Or with uv:
+
+```bash
+uv sync --extra mmlu
+```
+
+This installs `transformers`, `torch`, `numpy`, and `huggingface_hub` (the latter two via `transformers`). You can then run the example:
+
+```bash
+python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full
+```
+
+For DISCO prediction support:
+
+```bash
+pip install maseval[disco]
+```
+
+For exact lm-evaluation-harness reproduction:
+
+```bash
+pip install maseval[lm-eval]
+```
+
+## Quick Start
+
+```python
+from maseval.benchmark.mmlu import (
+    DefaultMMLUBenchmark,
+    load_tasks,
+    compute_benchmark_metrics,
+)
+
+# Load tasks (downloads from HuggingFace automatically)
+tasks = load_tasks(data_path="/path/to/mmlu_prompts_examples.json")
+
+# Create benchmark with HuggingFace model
+benchmark = DefaultMMLUBenchmark(
+    model_id="meta-llama/Llama-2-7b-hf",
+    device="cuda:0",
+)
+
+# Run evaluation
+results = benchmark.run(
+    tasks=tasks,
+    agent_data={"model_id": "meta-llama/Llama-2-7b-hf"},
+)
+
+# Compute metrics
+metrics = compute_benchmark_metrics(results)
+print(f"Accuracy: {metrics['acc']:.4f}")
+```
+
+### With Anchor Points (DISCO)
+
+```python
+from maseval.benchmark.mmlu import load_tasks
+
+# Load tasks filtered to anchor points
+tasks = load_tasks(
+    data_path="/path/to/mmlu_prompts_examples.json",
+    anchor_points_path="/path/to/anchor_points.json",
+)
+
+# tasks is a DISCOQueue — only anchor tasks are evaluated
+print(f"Evaluating {len(tasks)} anchor tasks")
+```
+
+## Custom Benchmark Subclass
+
+`MMLUBenchmark` is a framework-agnostic base class. To use a different model backend, subclass it and implement `setup_agents()` and `get_model_adapter()`:
+
+```python
+from maseval import AgentAdapter
+from maseval.core.history import MessageHistory
+from maseval.benchmark.mmlu import MMLUBenchmark
+
+class MyAgentAdapter(AgentAdapter):
+    def __init__(self, model, name):
+        super().__init__(model, name)
+        self._messages = []
+
+    def _run_agent(self, query):
+        self._messages.append({"role": "user", "content": query})
+        response = self.agent.generate(query)
+        self._messages.append({"role": "assistant", "content": response})
+        return response
+
+    def get_messages(self):
+        return MessageHistory(self._messages)
+
+class MyMMLUBenchmark(MMLUBenchmark):
+    def setup_agents(self, agent_data, environment, task, user, seed_generator):
+        model = self.get_model_adapter(agent_data["model_id"])
+        adapter = MyAgentAdapter(model, name="mmlu_agent")
+        return [adapter], {"mmlu_agent": adapter}
+
+    def get_model_adapter(self, model_id, **kwargs):
+        adapter = MyModelAdapter(model_id)
+        register_name = kwargs.get("register_name")
+        if register_name:
+            self.register("models", register_name, adapter)
+        return adapter
+```
+
+## API Reference
+
+::: maseval.benchmark.mmlu.MMLUBenchmark
+
+::: maseval.benchmark.mmlu.DefaultMMLUBenchmark
+
+::: maseval.benchmark.mmlu.MMLUEnvironment
+
+::: maseval.benchmark.mmlu.MMLUEvaluator
+
+::: maseval.benchmark.mmlu.load_tasks
+
+::: maseval.benchmark.mmlu.compute_benchmark_metrics
diff --git a/docs/interface/inference/huggingface.md b/docs/interface/inference/huggingface.md
@@ -1,7 +1,18 @@
-# HuggingFace Inference Adapter
+# HuggingFace Inference Adapters
 
-This page documents the HuggingFace model adapter for MASEval.
+This page documents the HuggingFace model adapters for MASEval.
+
+## Pipeline Model Adapter (Text Generation)
 
 [:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface.py){ .md-source-file }
 
-::: maseval.interface.inference.huggingface.HuggingFaceModelAdapter
+::: maseval.interface.inference.huggingface.HuggingFacePipelineModelAdapter
+
+!!! note
+    `HuggingFaceModelAdapter` is a backwards-compatible alias for `HuggingFacePipelineModelAdapter`.
+
+## Model Scorer (Log-Likelihood)
+
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface_scorer.py){ .md-source-file }
+
+::: maseval.interface.inference.huggingface_scorer.HuggingFaceModelScorer
diff --git a/docs/reference/environment.md b/docs/reference/environment.md
@@ -8,10 +8,8 @@ Environments define the execution context for agents, including available tools,
 
 ## Tools and agent-provided helpers
 
-Some agent adapters expose helper tools or user-simulation tools that can be used by the Environment. For example:
+Some agent adapters expose helper tools or user-simulation tools that can be used by the Environment. See the framework-specific interface pages for details:
 
-[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/smolagents.py){ .md-source-file }
-
-::: maseval.interface.agents.smolagents.SmolAgentAdapter
-
-::: maseval.interface.agents.smolagents.SmolAgentLLMUser
+- [SmolAgents](../interface/agents/smolagents.md) — `SmolAgentAdapter`, `SmolAgentLLMUser`
+- [LangGraph](../interface/agents/langgraph.md) — `LangGraphAgentAdapter`
+- [LlamaIndex](../interface/agents/llamaindex.md) — `LlamaIndexAgentAdapter`
diff --git a/docs/reference/exceptions.md b/docs/reference/exceptions.md
@@ -38,6 +38,10 @@ SimulatorError (base for simulators)
 
 ::: maseval.core.simulator.UserSimulatorError
 
+## Data Access Helpers
+
+::: maseval.core.exceptions.get_with_assert
+
 ## Validation Helpers
 
 These functions simplify input validation and raise `AgentError` with helpful suggestions:

diff --git a/docs/reference/model.md b/docs/reference/model.md
@@ -20,7 +20,7 @@ The following adapter classes implement the ModelAdapter interface for specific
 
 [:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface.py){ .md-source-file }
 
-::: maseval.interface.inference.huggingface.HuggingFaceModelAdapter
+::: maseval.interface.inference.huggingface.HuggingFacePipelineModelAdapter
 
 [:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/google_genai.py){ .md-source-file }
 

diff --git a/docs/reference/scorer.md b/docs/reference/scorer.md
@@ -0,0 +1,19 @@
+# Model Scorers
+
+Model Scorers provide a uniform interface for log-likelihood computation across model providers. Unlike `ModelAdapter` (which handles text generation and chat), scorers evaluate how likely a model considers a given continuation given some context.
+
+!!! note
+
+    `ModelScorer` is the scoring counterpart to `ModelAdapter`. Use it when you need log-likelihood evaluation (e.g., multiple-choice benchmarks) rather than text generation.
+
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/scorer.py){ .md-source-file }
+
+::: maseval.core.scorer.ModelScorer
+
+## Interfaces
+
+The following scorer classes implement the ModelScorer interface for specific providers.
+
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface_scorer.py){ .md-source-file }
+
+::: maseval.interface.inference.huggingface_scorer.HuggingFaceModelScorer
diff --git a/docs/reference/task.md b/docs/reference/task.md
@@ -2,34 +2,42 @@
 
 Tasks define individual benchmark scenarios including inputs, expected outputs, and metadata for evaluation. Task queues control execution order and scheduling strategy.
 
-[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L55){ .md-source-file }
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L56){ .md-source-file }
 
 ::: maseval.core.task.Task
 
-[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L27){ .md-source-file }
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L28){ .md-source-file }
 
 ::: maseval.core.task.TaskProtocol
 
-[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L18){ .md-source-file }
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L19){ .md-source-file }
 
 ::: maseval.core.task.TimeoutAction
 
 ## Task Queues
 
 Task queues determine the order in which tasks are executed. Pass a queue to `Benchmark.run(queue=...)` to customize scheduling.
 
-[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L86){ .md-source-file }
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L87){ .md-source-file }
 
 ::: maseval.core.task.BaseTaskQueue
 
-[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L256){ .md-source-file }
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L257){ .md-source-file }
 
 ::: maseval.core.task.SequentialTaskQueue
 
-[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L276){ .md-source-file }
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L277){ .md-source-file }
+
+::: maseval.core.task.InformativeSubsetQueue
+
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L325){ .md-source-file }
+
+::: maseval.core.task.DISCOQueue
+
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L413){ .md-source-file }
 
 ::: maseval.core.task.PriorityTaskQueue
 
-[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L322){ .md-source-file }
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L459){ .md-source-file }
 
 ::: maseval.core.task.AdaptiveTaskQueue