From 304e54c0b4ae02f677b970a64c76003e31b67759 Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Sun, 8 Mar 2026 12:40:45 +0100 Subject: [PATCH 01/17] [Move DISCO queue to core]: - AnchorPointsTaskQueue moved to core (maseval/core/task.py) - MMLUBenchmark no longer implements agents - Remove silent .get() fallbacks for required fields - Add mmlu = [] extra to pyproject.toml - Add MMLU entry to BENCHMARKS.md - Update documentation with MMLU - Update CHANGELOG.md --- BENCHMARKS.md | 16 +- CHANGELOG.md | 5 +- docs/benchmark/mmlu.md | 127 ++++++++++++++ maseval/__init__.py | 2 + maseval/benchmark/mmlu/__init__.py | 17 +- maseval/benchmark/mmlu/mmlu.py | 267 +++++++++-------------------- maseval/core/task.py | 45 +++++ mkdocs.yml | 3 +- pyproject.toml | 1 + 9 files changed, 280 insertions(+), 203 deletions(-) create mode 100644 docs/benchmark/mmlu.md diff --git a/BENCHMARKS.md b/BENCHMARKS.md index fcbde7d3..0916ef69 100644 --- a/BENCHMARKS.md +++ b/BENCHMARKS.md @@ -79,7 +79,21 @@ CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses --- -## 6. [Name of Next Benchmark] +## 6. MMLU (Massive Multitask Language Understanding) (Beta) + +MMLU evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration includes anchor-point-based evaluation for DISCO prediction, allowing efficient estimation of full benchmark performance from a subset of tasks. + +> **Beta:** This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome! + +### Source and License + +- **Original Paper:** [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) +- **DISCO Paper:** [DISCO: DISCOvering key features for accurate prediction of LLM abilities on benchmarks](https://arxiv.org/abs/2407.12890) (Rubinstein et al., 2025) +- **Dataset:** [arubique/flattened-MMLU](https://huggingface.co/datasets/arubique/flattened-MMLU) + +--- + +## 7. [Name of Next Benchmark] (Description for the next benchmark...) diff --git a/CHANGELOG.md b/CHANGELOG.md index c3f11572..c1427ccd 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Benchmarks** -- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `AnchorPointsTaskQueue`, `load_tasks()`, and `compute_benchmark_metrics()`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34) +- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34) - CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28) @@ -35,11 +35,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Examples** - MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34) +- MMLU benchmark documentation at `docs/benchmark/mmlu.md` with installation, quick start, and API reference. (PR: #34) - Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28) - Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26) **Core** +- Added `AnchorPointsTaskQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). Available via `from maseval import AnchorPointsTaskQueue`. (PR: #34) - Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24) - Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24) - Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24) @@ -86,6 +88,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Benchmarks** +- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `HuggingFaceMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `AnchorPointsTaskQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). (PR: #34) - `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26) - `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge` - `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr` diff --git a/docs/benchmark/mmlu.md b/docs/benchmark/mmlu.md new file mode 100644 index 00000000..623cf63d --- /dev/null +++ b/docs/benchmark/mmlu.md @@ -0,0 +1,127 @@ +# MMLU: Massive Multitask Language Understanding (Beta) + +!!! warning "Beta" + This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome! + +The **MMLU Benchmark** evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration supports anchor-point-based evaluation for [DISCO](https://arxiv.org/abs/2407.12890) prediction, enabling efficient estimation of full benchmark performance from a subset of tasks. + +## Overview + +[MMLU](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) is a widely used benchmark for measuring knowledge and reasoning across diverse domains. The MASEval implementation features: + +- **Log-likelihood MCQ evaluation** matching lm-evaluation-harness methodology +- **Anchor-point task selection** via `AnchorPointsTaskQueue` for DISCO-style subset evaluation +- **HuggingFace integration** with batched log-probability computation +- **lm-eval compatibility** mode for exact numerical reproduction + +Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses. + +## Installation + +MMLU has an optional dependency extra (currently empty, as core MMLU requires no additional packages): + +```bash +pip install maseval[mmlu] +``` + +For the HuggingFace implementation, also install transformers: + +```bash +pip install maseval[mmlu,transformers] +``` + +For DISCO prediction support: + +```bash +pip install maseval[disco] +``` + +For exact lm-evaluation-harness reproduction: + +```bash +pip install maseval[lm-eval] +``` + +## Quick Start + +```python +from maseval.benchmark.mmlu import ( + HuggingFaceMMLUBenchmark, + load_tasks, + compute_benchmark_metrics, +) + +# Load tasks (downloads from HuggingFace automatically) +tasks = load_tasks(data_path="/path/to/mmlu_prompts_examples.json") + +# Create benchmark with HuggingFace model +benchmark = HuggingFaceMMLUBenchmark( + model_id="meta-llama/Llama-2-7b-hf", + device="cuda:0", +) + +# Run evaluation +results = benchmark.run( + tasks=tasks, + agent_data={"model_id": "meta-llama/Llama-2-7b-hf"}, +) + +# Compute metrics +metrics = compute_benchmark_metrics(results) +print(f"Accuracy: {metrics['acc']:.4f}") +``` + +### With Anchor Points (DISCO) + +```python +from maseval.benchmark.mmlu import load_tasks + +# Load tasks filtered to anchor points +tasks = load_tasks( + data_path="/path/to/mmlu_prompts_examples.json", + anchor_points_path="/path/to/anchor_points.json", +) + +# tasks is an AnchorPointsTaskQueue — only anchor tasks are evaluated +print(f"Evaluating {len(tasks)} anchor tasks") +``` + +## Custom Benchmark Subclass + +`MMLUBenchmark` is a framework-agnostic base class. To use a different model backend, subclass it and implement `setup_agents()` and `get_model_adapter()`: + +```python +from maseval.benchmark.mmlu import MMLUBenchmark, MMLUModelAgent, MMLUAgentAdapter + +class MyMMLUBenchmark(MMLUBenchmark): + def setup_agents(self, agent_data, environment, task, user, seed_generator): + model = self.get_model_adapter(agent_data["model_id"]) + agent = MMLUModelAgent(model, name="mmlu_agent") + adapter = MMLUAgentAdapter(agent, "mmlu_agent") + return [adapter], {"mmlu_agent": adapter} + + def get_model_adapter(self, model_id, **kwargs): + adapter = MyModelAdapter(model_id) + register_name = kwargs.get("register_name") + if register_name: + self.register("models", register_name, adapter) + return adapter +``` + +## API Reference + +::: maseval.benchmark.mmlu.MMLUBenchmark + +::: maseval.benchmark.mmlu.HuggingFaceMMLUBenchmark + +::: maseval.benchmark.mmlu.MMLUEnvironment + +::: maseval.benchmark.mmlu.MMLUEvaluator + +::: maseval.benchmark.mmlu.MMLUModelAgent + +::: maseval.benchmark.mmlu.MMLUAgentAdapter + +::: maseval.benchmark.mmlu.load_tasks + +::: maseval.benchmark.mmlu.compute_benchmark_metrics diff --git a/maseval/__init__.py b/maseval/__init__.py index 90d52cfa..387a3345 100644 --- a/maseval/__init__.py +++ b/maseval/__init__.py @@ -16,6 +16,7 @@ BaseTaskQueue, TaskQueue, SequentialTaskQueue, + AnchorPointsTaskQueue, PriorityTaskQueue, AdaptiveTaskQueue, ) @@ -93,6 +94,7 @@ "BaseTaskQueue", "TaskQueue", "SequentialTaskQueue", + "AnchorPointsTaskQueue", "PriorityTaskQueue", "AdaptiveTaskQueue", # Model adapters diff --git a/maseval/benchmark/mmlu/__init__.py b/maseval/benchmark/mmlu/__init__.py index 19e8fd32..ac5ac154 100644 --- a/maseval/benchmark/mmlu/__init__.py +++ b/maseval/benchmark/mmlu/__init__.py @@ -4,12 +4,10 @@ Usage: from maseval.benchmark.mmlu import ( - MMLUBenchmark, - MMLUEnvironment, - MMLUEvaluator, + HuggingFaceMMLUBenchmark, load_tasks, - AnchorPointsTaskQueue, ) + from maseval import AnchorPointsTaskQueue # Load tasks and anchor points tasks = load_tasks( @@ -17,18 +15,19 @@ anchor_points_path="path/to/anchor_points.pkl", # Optional ) - # Create benchmark - benchmark = MMLUBenchmark() - results = benchmark.run(tasks=tasks, agent_data={"model_id": "gpt-4"}) + # Run benchmark + benchmark = HuggingFaceMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf") + results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b-hf"}) """ +from maseval import AnchorPointsTaskQueue + from .mmlu import ( DEFAULT_AGENT_NAME, DEFAULT_BATCH_SIZE, DEFAULT_CHOICES, DEFAULT_DEVICE, DEFAULT_MODEL_REGISTER_NAME, - FALLBACK_MODEL_ID, MMLU_TASK_NAME, STATUS_SUCCESS, TARGET_DELIMITER, @@ -39,7 +38,6 @@ MMLUEvaluator, MMLUModelAgent, MMLUAgentAdapter, - AnchorPointsTaskQueue, load_tasks, compute_benchmark_metrics, ) @@ -50,7 +48,6 @@ "DEFAULT_CHOICES", "DEFAULT_DEVICE", "DEFAULT_MODEL_REGISTER_NAME", - "FALLBACK_MODEL_ID", "MMLU_TASK_NAME", "STATUS_SUCCESS", "TARGET_DELIMITER", diff --git a/maseval/benchmark/mmlu/mmlu.py b/maseval/benchmark/mmlu/mmlu.py index 6506402c..0b6de68a 100644 --- a/maseval/benchmark/mmlu/mmlu.py +++ b/maseval/benchmark/mmlu/mmlu.py @@ -8,32 +8,25 @@ Usage: from maseval.benchmark.mmlu import ( - MMLUBenchmark, load_tasks, AnchorPointsTaskQueue + HuggingFaceMMLUBenchmark, load_tasks, ) + from maseval import AnchorPointsTaskQueue - # Load tasks filtered to anchor points + # Load tasks (optionally filtered to anchor points) tasks = load_tasks( data_path="/path/to/mmlu_prompts_examples.json", anchor_points_path="/path/to/anchor_points.pkl", ) - # Create benchmark with HuggingFace model - class MyMMLUBenchmark(MMLUBenchmark): - def get_model_adapter(self, model_id, **kwargs): - from transformers import pipeline - from maseval.interface.inference import HuggingFaceModelAdapter - pipe = pipeline("text-generation", model=model_id) - return HuggingFaceModelAdapter(model=pipe, model_id=model_id) - - benchmark = MyMMLUBenchmark() - results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b"}) + # Run with the HuggingFace concrete implementation + benchmark = HuggingFaceMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf") + results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b-hf"}) """ import json import pickle -from abc import abstractmethod from pathlib import Path -from typing import Any, Dict, Iterator, List, Optional, Sequence, Tuple, Union, cast +from typing import Any, Dict, List, Optional, Sequence, Tuple, Union, cast # numpy is optional - only needed for anchor points processing try: @@ -46,6 +39,7 @@ def get_model_adapter(self, model_id, **kwargs): from maseval import ( AgentAdapter, + AnchorPointsTaskQueue, Benchmark, Environment, Evaluator, @@ -55,7 +49,7 @@ def get_model_adapter(self, model_id, **kwargs): User, SeedGenerator, ) -from maseval.core.task import AdaptiveTaskQueue, SequentialTaskQueue +from maseval.core.task import SequentialTaskQueue from maseval.core.tracing import TraceableMixin from maseval.core.config import ConfigurableMixin @@ -72,109 +66,9 @@ def get_model_adapter(self, model_id, **kwargs): TARGET_DELIMITER = " " # lm-eval convention for MCQ MMLU_TASK_NAME = "mmlu_prompts" TASK_TYPE_MMLU = "mmlu" -FALLBACK_MODEL_ID = "unknown" STATUS_SUCCESS = "success" -# ============================================================================= -# Task Queue -# ============================================================================= - - -class AnchorPointsTaskQueue(AdaptiveTaskQueue): - """Task queue that iterates through tasks in anchor points order. - - This queue is used for DISCO-based evaluation where we only evaluate - on a subset of anchor tasks and predict performance on the full dataset. - - The queue iterates through tasks in the order specified by anchor_points, - and stops when all anchor tasks have been processed. - """ - - def __init__(self, tasks: List[Task], anchor_points: Optional[List[int]] = None): - """Initialize anchor points task queue. - - Args: - tasks: Full list of tasks (ordered by doc_id). - anchor_points: Optional list of task indices (doc_ids) to evaluate. - If None, evaluates all tasks in order. - """ - # If anchor_points provided, filter tasks to only include anchor tasks - # This dramatically improves performance by avoiding O(n²) iteration - if anchor_points is not None: - # Build index mapping for quick lookup - task_by_doc_id: Dict[int, Task] = {} - for i, task in enumerate(tasks): - doc_id = task.metadata.get("doc_id", i) - task_by_doc_id[doc_id] = task - - # Filter to only anchor tasks, preserving anchor order - anchor_tasks = [] - for doc_id in anchor_points: - task = task_by_doc_id.get(doc_id) - if task is not None: - anchor_tasks.append(task) - - # Store original for reference - self._all_tasks = tasks - self._task_by_doc_id = task_by_doc_id - tasks = anchor_tasks - - super().__init__(tasks) - self._anchor_points = anchor_points - self._anchor_idx = 0 - - # Initialize state immediately (since __iter__ is overridden and skips initial_state()) - self._state = self.initial_state() - - def __iter__(self) -> Iterator[Task]: - """Yield tasks in anchor point order. - - Since tasks are pre-filtered during __init__, we simply iterate - over the stored tasks in order. This avoids the infinite loop - issue in AdaptiveTaskQueue.__iter__ which relies on on_task_repeat_end - to remove tasks from _remaining. - """ - return iter(self._tasks) - - def initial_state(self) -> Dict[str, Any]: - """Initialize state for anchor point iteration.""" - return { - "anchor_idx": 0, - "completed_anchors": [], - } - - def select_next_task(self, remaining: Sequence[Task], state: Dict[str, Any]) -> Optional[Task]: - """Select the next anchor task to execute. - - Args: - remaining: Tasks not yet executed. - state: Current state with anchor_idx. - - Returns: - Next anchor task, or None if all anchors processed. - """ - # Simply return the first remaining task since we pre-filtered to anchor tasks only - return remaining[0] if remaining else None - - def update_state(self, task: Task, report: Dict[str, Any], state: Dict[str, Any]) -> Dict[str, Any]: - """Update state after task completion. - - Args: - task: Completed task. - report: Execution report. - state: Current state. - - Returns: - Updated state. - """ - doc_id = task.metadata.get("doc_id") - state["completed_anchors"].append(doc_id) - state["anchor_idx"] += 1 - - return state - - # ============================================================================= # Environment # ============================================================================= @@ -188,12 +82,18 @@ class MMLUEnvironment(Environment): """ def setup_state(self, task_data: Dict[str, Any]) -> Dict[str, Any]: - """Initialize state from task data.""" + """Initialize state from task data. + + Args: + task_data: Must contain ``"query"`` (str) and ``"environment_data"`` + (dict with optional ``"choices"``, ``"full_prompt"``, ``"use_full_prompt"``). + """ + env_data = task_data["environment_data"] return { - "query": task_data.get("query", ""), - "choices": task_data.get("environment_data", {}).get("choices", []), - "full_prompt": task_data.get("environment_data", {}).get("full_prompt", ""), - "use_full_prompt": task_data.get("environment_data", {}).get("use_full_prompt", False), + "query": task_data["query"], + "choices": env_data.get("choices", DEFAULT_CHOICES), + "full_prompt": env_data.get("full_prompt", ""), + "use_full_prompt": env_data.get("use_full_prompt", False), } def create_tools(self) -> Dict[str, Any]: @@ -203,11 +103,11 @@ def create_tools(self) -> Dict[str, Any]: def get_prompt(self) -> str: """Get the prompt to send to the model. - Returns full_prompt if use_full_prompt is True, otherwise query. + Returns ``full_prompt`` if ``use_full_prompt`` is True, otherwise ``query``. """ - if self.state.get("use_full_prompt", False): - return self.state.get("full_prompt", self.state.get("query", "")) - return self.state.get("query", "") + if self.state["use_full_prompt"]: + return self.state["full_prompt"] + return self.state["query"] # ============================================================================= @@ -231,13 +131,14 @@ def __init__( """Initialize MMLU evaluator. Args: - task: Task being evaluated (contains gold answer). + task: Task being evaluated. Must have ``evaluation_data["gold"]`` (int) + with the correct answer index. environment: Environment (provides choices). user: Unused for MMLU. """ self.task = task self.environment = environment - self.gold = task.evaluation_data.get("gold", 0) + self.gold = task.evaluation_data["gold"] self.choices = task.environment_data.get("choices", DEFAULT_CHOICES) def filter_traces(self, traces: Dict[str, Any]) -> Dict[str, Any]: @@ -436,19 +337,12 @@ class MMLUBenchmark(Benchmark): Evaluates language models on MMLU multiple choice questions. Supports anchor point-based evaluation for DISCO prediction. - Users must subclass and implement: - - get_model_adapter() to provide model adapters + Subclasses must implement: - Usage: - class MyMMLUBenchmark(MMLUBenchmark): - def get_model_adapter(self, model_id, **kwargs): - from transformers import pipeline - from maseval.interface.inference import HuggingFaceModelAdapter - pipe = pipeline("text-generation", model=model_id) - return HuggingFaceModelAdapter(model=pipe, model_id=model_id) + - ``setup_agents()`` - create agents for MCQ evaluation + - ``get_model_adapter()`` - provide model adapters - benchmark = MyMMLUBenchmark() - results = benchmark.run(tasks=tasks, agent_data={"model_id": "llama-7b"}) + For a ready-to-use implementation, see ``HuggingFaceMMLUBenchmark``. """ def __init__( @@ -480,7 +374,7 @@ def setup_environment( "query": task.query, "environment_data": { **task.environment_data, - "use_full_prompt": self.use_full_prompt or agent_data.get("use_full_prompt", False), + "use_full_prompt": self.use_full_prompt, }, } return MMLUEnvironment(task_data) @@ -495,33 +389,6 @@ def setup_user( """MMLU doesn't use a user simulator.""" return None - def setup_agents( - self, - agent_data: Dict[str, Any], - environment: Environment, - task: Task, - user: Optional[User], - seed_generator: SeedGenerator, - ) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]: - """Create model agent for MCQ evaluation. - - Args: - agent_data: Agent config with model_id. - environment: MMLU environment. - task: Current task. - user: Unused. - - Returns: - Tuple of (agents_to_run, agents_dict). - """ - model_id = agent_data.get("model_id", FALLBACK_MODEL_ID) - model = self.get_model_adapter(model_id, register_name=DEFAULT_MODEL_REGISTER_NAME) - - agent = MMLUModelAgent(model, name=DEFAULT_AGENT_NAME) - adapter = MMLUAgentAdapter(agent, DEFAULT_AGENT_NAME) - - return [adapter], {DEFAULT_AGENT_NAME: adapter} - def setup_evaluators( self, environment: Environment, @@ -548,21 +415,6 @@ def run_agents( agent = agents[0] return agent.run(prompt) - @abstractmethod - def get_model_adapter(self, model_id: str, **kwargs) -> ModelAdapter: - """Provide a ModelAdapter for the model. - - Must be implemented by subclass. - - Args: - model_id: Model identifier. - **kwargs: Additional arguments (e.g., register_name for tracing). - - Returns: - ModelAdapter instance. - """ - pass - def evaluate( self, evaluators: Sequence[Evaluator], @@ -598,7 +450,7 @@ def __init__( trust_remote_code: bool = True, use_full_prompt: bool = True, batch_size: int = DEFAULT_BATCH_SIZE, - **kwargs, + **kwargs: Any, ): """Initialize HuggingFace MMLU benchmark. @@ -618,6 +470,34 @@ def __init__( self._model = None self._tokenizer = None + def setup_agents( + self, + agent_data: Dict[str, Any], + environment: Environment, + task: Task, + user: Optional[User], + seed_generator: SeedGenerator, + ) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]: + """Create model agent for MCQ evaluation. + + Args: + agent_data: Agent config. Must contain ``"model_id"`` (str). + environment: MMLU environment. + task: Current task. + user: Unused. + seed_generator: Seed generator (unused for MMLU). + + Returns: + Tuple of (agents_to_run, agents_dict). + """ + model_id = agent_data["model_id"] + model = self.get_model_adapter(model_id, register_name=DEFAULT_MODEL_REGISTER_NAME) + + agent = MMLUModelAgent(model, name=DEFAULT_AGENT_NAME) + adapter = MMLUAgentAdapter(agent, DEFAULT_AGENT_NAME) + + return [adapter], {DEFAULT_AGENT_NAME: adapter} + def _load_model(self): """Lazy load the model and tokenizer for log-likelihood computation.""" if self._model is None: @@ -795,7 +675,7 @@ def _compute_logprobs_batched(self, prompts: list, choices_list: list) -> list: return all_logprobs - def precompute_all_logprobs_lmeval(self, tasks) -> dict: + def precompute_all_logprobs_lmeval(self, tasks: Sequence[Task]) -> Dict[Any, List[float]]: """Precompute log-likelihoods for ALL tasks using lm-eval's batching. CRITICAL: lm-evaluation-harness batches ALL requests together and uses @@ -931,11 +811,11 @@ def _compute_logprobs_multi_token(self, prompt: str, choices: list) -> list: def run_agents( self, - agents, - task, - environment, + agents: Sequence[AgentAdapter], + task: Task, + environment: Environment, query: str = "", - ): + ) -> Any: """Execute log-likelihood based MCQ evaluation. Uses precomputed logprobs if available (for exact lm-eval match), @@ -1017,7 +897,7 @@ def run_agents( return answer - def get_model_adapter(self, model_id: str, **kwargs): + def get_model_adapter(self, model_id: str, **kwargs: Any) -> ModelAdapter: """Provide a HuggingFace ModelAdapter. Note: For logprobs-based evaluation, we don't actually use the adapter @@ -1028,7 +908,7 @@ def get_model_adapter(self, model_id: str, **kwargs): **kwargs: Additional arguments (e.g., register_name). Returns: - HuggingFaceModelAdapter instance. + ``HuggingFaceModelAdapter`` instance. """ from maseval.interface.inference import HuggingFaceModelAdapter @@ -1112,8 +992,15 @@ def load_tasks( # Convert to Tasks tasks = [] for i, item in enumerate(data): + query = item.get("query") or item.get("example") + if query is None: + raise ValueError(f"MMLU task at index {i} has neither 'query' nor 'example' field") + + if "gold" not in item: + raise ValueError(f"MMLU task at index {i} missing required 'gold' field (correct answer index)") + task = Task( - query=item.get("query", item.get("example", "")), + query=query, id=f"mmlu_{i}", environment_data={ "choices": item.get("choices", DEFAULT_CHOICES), @@ -1121,7 +1008,7 @@ def load_tasks( "example": item.get("example", ""), }, evaluation_data={ - "gold": item.get("gold", 0), + "gold": item["gold"], }, metadata={ "doc_id": i, diff --git a/maseval/core/task.py b/maseval/core/task.py index ed617943..081bba6b 100644 --- a/maseval/core/task.py +++ b/maseval/core/task.py @@ -273,6 +273,51 @@ def __iter__(self) -> Iterator[Task]: return iter(self._tasks) +class AnchorPointsTaskQueue(SequentialTaskQueue): + """Task queue that evaluates a specified subset of tasks in a given order. + + Used for anchor-point-based evaluation where performance on a full dataset + is predicted from results on a carefully selected subset. Anchor points are + integer indices into the original task list. Only tasks at those indices are + yielded, in the order specified by ``anchor_points``. + + When ``anchor_points`` is ``None``, all tasks are yielded in their original order + (equivalent to ``SequentialTaskQueue``). + + Attributes: + _all_tasks: The complete, unfiltered task list. + _anchor_points: The anchor-point indices, or ``None``. + + Example: + ```python + # Evaluate only tasks at indices 0, 5, 12 + queue = AnchorPointsTaskQueue(tasks, anchor_points=[0, 5, 12]) + + for task in queue: + result = execute(task) # Only 3 tasks + ``` + """ + + def __init__(self, tasks: Iterable[Task], anchor_points: Optional[List[int]] = None) -> None: + """Initialize anchor-points task queue. + + Args: + tasks: Full list of tasks (ordered by index). + anchor_points: Indices into ``tasks`` selecting which tasks to evaluate + and in what order. If ``None``, evaluates all tasks in order. + """ + all_tasks = list(tasks) + self._all_tasks: List[Task] = all_tasks + self._anchor_points: Optional[List[int]] = anchor_points + + if anchor_points is not None: + task_by_index: Dict[int, Task] = {i: task for i, task in enumerate(all_tasks)} + filtered = [task_by_index[idx] for idx in anchor_points if idx in task_by_index] + super().__init__(filtered) + else: + super().__init__(all_tasks) + + class PriorityTaskQueue(BaseTaskQueue): """Execute tasks ordered by priority. diff --git a/mkdocs.yml b/mkdocs.yml index 4b489f50..153215e9 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -129,7 +129,8 @@ nav: - OpenAI: interface/inference/openai.md - Benchmarks: - ConVerse: benchmark/converse.md + - GAIA2: benchmark/gaia2.md - MACS: benchmark/macs.md + - MMLU: benchmark/mmlu.md - MultiAgentBench: benchmark/multiagentbench.md - Tau2: benchmark/tau2.md - - GAIA2: benchmark/gaia2.md diff --git a/pyproject.toml b/pyproject.toml index 51227d46..dc644b10 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -82,6 +82,7 @@ multiagentbench = [ ] tau2 = ["docstring-parser>=0.16", "addict>=2.4.0"] converse = [] +mmlu = [] # LM Evaluation Harness (for HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval) lm-eval = ["lm-eval @ git+https://github.com/arubique/lm-evaluation-harness.git@main"] From c0f81b9f71c882dbbc8b019ebbc8cd1c485afe5c Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Mon, 9 Mar 2026 12:03:37 +0100 Subject: [PATCH 02/17] [Move DISCO queue to core]: - Update dependencies --- docs/benchmark/mmlu.md | 12 +++++++++--- pyproject.toml | 16 +++++++++++++--- 2 files changed, 22 insertions(+), 6 deletions(-) diff --git a/docs/benchmark/mmlu.md b/docs/benchmark/mmlu.md index 623cf63d..bcb54c11 100644 --- a/docs/benchmark/mmlu.md +++ b/docs/benchmark/mmlu.md @@ -18,16 +18,22 @@ Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/ ## Installation -MMLU has an optional dependency extra (currently empty, as core MMLU requires no additional packages): +Install MMLU with all dependencies needed to run the HuggingFace benchmark and example script: ```bash pip install maseval[mmlu] ``` -For the HuggingFace implementation, also install transformers: +Or with uv: ```bash -pip install maseval[mmlu,transformers] +uv sync --extra mmlu +``` + +This installs `transformers`, `torch`, `numpy`, and `huggingface_hub` (the latter two via `transformers`). You can then run the example: + +```bash +python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full ``` For DISCO prediction support: diff --git a/pyproject.toml b/pyproject.toml index dc644b10..c252adeb 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -82,10 +82,20 @@ multiagentbench = [ ] tau2 = ["docstring-parser>=0.16", "addict>=2.4.0"] converse = [] -mmlu = [] +# HuggingFace model + tokenizer, default dataset download; numpy for example script and anchor-point loading; +# lm-eval for --use_lmeval_batching (exact lm-evaluation-harness reproduction); aiohttp required by lm_eval.models.api_models +mmlu = [ + "transformers>=4.37.0", + "numpy>=1.20.0", + "aiohttp>=3.9.0", + "lm-eval @ git+https://github.com/arubique/lm-evaluation-harness.git@main", +] -# LM Evaluation Harness (for HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval) -lm-eval = ["lm-eval @ git+https://github.com/arubique/lm-evaluation-harness.git@main"] +# LM Evaluation Harness (same as in mmlu; aiohttp required by lm_eval.models.api_models) +lm-eval = [ + "aiohttp>=3.9.0", + "lm-eval @ git+https://github.com/arubique/lm-evaluation-harness.git@main", +] # DISCO prediction (for MMLU benchmark example) disco = [ From 6ad80a8d7bde522da002a2aa74bfa7bbb6d0ef7e Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Mon, 9 Mar 2026 12:36:44 +0100 Subject: [PATCH 03/17] [Move DISCO queue to core]: - Add InformativeSubsetQueue - Rename AnchorPointsTaskQueue to DISCOQueue - Make DISCOQueue a subclass of InformativeSubsetQueue --- CHANGELOG.md | 4 +- docs/benchmark/mmlu.md | 4 +- maseval/__init__.py | 6 ++- maseval/benchmark/mmlu/__init__.py | 7 +-- maseval/benchmark/mmlu/mmlu.py | 10 ++-- maseval/core/task.py | 73 +++++++++++++++++++++++------- 6 files changed, 74 insertions(+), 30 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index c1427ccd..aec9785b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -41,7 +41,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Core** -- Added `AnchorPointsTaskQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). Available via `from maseval import AnchorPointsTaskQueue`. (PR: #34) +- Added `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). Available via `from maseval import DISCOQueue`. (PR: #34) - Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24) - Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24) - Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24) @@ -88,7 +88,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Benchmarks** -- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `HuggingFaceMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `AnchorPointsTaskQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). (PR: #34) +- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `HuggingFaceMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). (PR: #34) - `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26) - `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge` - `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr` diff --git a/docs/benchmark/mmlu.md b/docs/benchmark/mmlu.md index bcb54c11..d3c8c88d 100644 --- a/docs/benchmark/mmlu.md +++ b/docs/benchmark/mmlu.md @@ -10,7 +10,7 @@ The **MMLU Benchmark** evaluates language models on multiple-choice questions sp [MMLU](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) is a widely used benchmark for measuring knowledge and reasoning across diverse domains. The MASEval implementation features: - **Log-likelihood MCQ evaluation** matching lm-evaluation-harness methodology -- **Anchor-point task selection** via `AnchorPointsTaskQueue` for DISCO-style subset evaluation +- **Anchor-point task selection** via `DISCOQueue` for DISCO-style subset evaluation - **HuggingFace integration** with batched log-probability computation - **lm-eval compatibility** mode for exact numerical reproduction @@ -88,7 +88,7 @@ tasks = load_tasks( anchor_points_path="/path/to/anchor_points.json", ) -# tasks is an AnchorPointsTaskQueue — only anchor tasks are evaluated +# tasks is an DISCOQueue — only anchor tasks are evaluated print(f"Evaluating {len(tasks)} anchor tasks") ``` diff --git a/maseval/__init__.py b/maseval/__init__.py index 387a3345..957fee9b 100644 --- a/maseval/__init__.py +++ b/maseval/__init__.py @@ -16,7 +16,8 @@ BaseTaskQueue, TaskQueue, SequentialTaskQueue, - AnchorPointsTaskQueue, + InformativeSubsetQueue, + DISCOQueue, PriorityTaskQueue, AdaptiveTaskQueue, ) @@ -94,7 +95,8 @@ "BaseTaskQueue", "TaskQueue", "SequentialTaskQueue", - "AnchorPointsTaskQueue", + "InformativeSubsetQueue", + "DISCOQueue", "PriorityTaskQueue", "AdaptiveTaskQueue", # Model adapters diff --git a/maseval/benchmark/mmlu/__init__.py b/maseval/benchmark/mmlu/__init__.py index ac5ac154..bc7b4360 100644 --- a/maseval/benchmark/mmlu/__init__.py +++ b/maseval/benchmark/mmlu/__init__.py @@ -7,7 +7,7 @@ HuggingFaceMMLUBenchmark, load_tasks, ) - from maseval import AnchorPointsTaskQueue + from maseval import DISCOQueue, InformativeSubsetQueue # Load tasks and anchor points tasks = load_tasks( @@ -20,7 +20,7 @@ results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b-hf"}) """ -from maseval import AnchorPointsTaskQueue +from maseval import DISCOQueue from .mmlu import ( DEFAULT_AGENT_NAME, @@ -58,7 +58,8 @@ "MMLUEvaluator", "MMLUModelAgent", "MMLUAgentAdapter", - "AnchorPointsTaskQueue", + "InformativeSubsetQueue", + "DISCOQueue", "load_tasks", "compute_benchmark_metrics", ] diff --git a/maseval/benchmark/mmlu/mmlu.py b/maseval/benchmark/mmlu/mmlu.py index 0b6de68a..11159ebb 100644 --- a/maseval/benchmark/mmlu/mmlu.py +++ b/maseval/benchmark/mmlu/mmlu.py @@ -10,7 +10,7 @@ from maseval.benchmark.mmlu import ( HuggingFaceMMLUBenchmark, load_tasks, ) - from maseval import AnchorPointsTaskQueue + from maseval import DISCOQueue # Load tasks (optionally filtered to anchor points) tasks = load_tasks( @@ -39,7 +39,7 @@ from maseval import ( AgentAdapter, - AnchorPointsTaskQueue, + DISCOQueue, Benchmark, Environment, Evaluator, @@ -963,13 +963,13 @@ def load_tasks( data_path: Union[str, Path], anchor_points_path: Optional[Union[str, Path]] = None, limit: Optional[int] = None, -) -> Union[AnchorPointsTaskQueue, SequentialTaskQueue]: +) -> Union[DISCOQueue, SequentialTaskQueue]: """Load MMLU tasks from JSON file. Args: data_path: Path to MMLU prompts JSON file (mmlu_prompts_examples.json format). anchor_points_path: Optional path to anchor points pickle file. - If provided, returns an AnchorPointsTaskQueue that evaluates + If provided, returns an DISCOQueue that evaluates only the anchor tasks in order. limit: Optional limit on number of tasks to load. @@ -1024,7 +1024,7 @@ def load_tasks( # Create appropriate queue if anchor_points is not None: - return AnchorPointsTaskQueue(tasks, anchor_points) + return DISCOQueue(tasks, anchor_points) else: return SequentialTaskQueue(tasks) diff --git a/maseval/core/task.py b/maseval/core/task.py index 081bba6b..22ec5e0f 100644 --- a/maseval/core/task.py +++ b/maseval/core/task.py @@ -273,51 +273,92 @@ def __iter__(self) -> Iterator[Task]: return iter(self._tasks) -class AnchorPointsTaskQueue(SequentialTaskQueue): - """Task queue that evaluates a specified subset of tasks in a given order. +class InformativeSubsetQueue(SequentialTaskQueue): + """Evaluates an informative subset of tasks in a specified order. - Used for anchor-point-based evaluation where performance on a full dataset - is predicted from results on a carefully selected subset. Anchor points are - integer indices into the original task list. Only tasks at those indices are - yielded, in the order specified by ``anchor_points``. + Used for efficient evaluation where a carefully selected subset of tasks + can predict performance on the full dataset. The subset is defined by + ``indices`` — integer positions into the original task list. Only tasks + at those positions are yielded, in the order given by ``indices``. - When ``anchor_points`` is ``None``, all tasks are yielded in their original order - (equivalent to ``SequentialTaskQueue``). + The informativeness criterion (how the indices were chosen) is determined + by the caller or by a subclass. This base class is criterion-agnostic. + + When ``indices`` is ``None``, all tasks are yielded in their original + order (equivalent to ``SequentialTaskQueue``). Attributes: _all_tasks: The complete, unfiltered task list. - _anchor_points: The anchor-point indices, or ``None``. + _indices: The subset indices, or ``None``. Example: ```python # Evaluate only tasks at indices 0, 5, 12 - queue = AnchorPointsTaskQueue(tasks, anchor_points=[0, 5, 12]) + queue = InformativeSubsetQueue(tasks, indices=[0, 5, 12]) for task in queue: result = execute(task) # Only 3 tasks ``` """ - def __init__(self, tasks: Iterable[Task], anchor_points: Optional[List[int]] = None) -> None: - """Initialize anchor-points task queue. + def __init__(self, tasks: Iterable[Task], indices: Optional[List[int]] = None) -> None: + """Initialize informative-subset task queue. Args: tasks: Full list of tasks (ordered by index). - anchor_points: Indices into ``tasks`` selecting which tasks to evaluate + indices: Positions into ``tasks`` selecting which tasks to evaluate and in what order. If ``None``, evaluates all tasks in order. """ all_tasks = list(tasks) self._all_tasks: List[Task] = all_tasks - self._anchor_points: Optional[List[int]] = anchor_points + self._indices: Optional[List[int]] = indices - if anchor_points is not None: + if indices is not None: task_by_index: Dict[int, Task] = {i: task for i, task in enumerate(all_tasks)} - filtered = [task_by_index[idx] for idx in anchor_points if idx in task_by_index] + filtered = [task_by_index[idx] for idx in indices if idx in task_by_index] super().__init__(filtered) else: super().__init__(all_tasks) +class DISCOQueue(InformativeSubsetQueue): + """Diversity-based informative subset using DISCO anchor points. + + Selects a diverse subset of tasks (anchor points) for evaluation. Full + benchmark performance is then predicted from results on this subset using + DISCO (DISCOvering key features for accurate prediction of LLM abilities + on benchmarks). + + The informativeness criterion is **diversity**: anchor points are chosen + to maximise disagreement across models, so that a small evaluation set + captures the discriminative structure of the full benchmark. + + Reference: `DISCO: DISCOvering key features for accurate prediction of + LLM abilities on benchmarks `_ + + Example: + ```python + queue = DISCOQueue(tasks, anchor_points=[0, 5, 12]) + + for task in queue: + result = execute(task) # Only 3 tasks + ``` + """ + + def __init__(self, tasks: Iterable[Task], anchor_points: Optional[List[int]] = None) -> None: + """Initialize DISCO task queue. + + Args: + tasks: Full list of tasks (ordered by index). + anchor_points: Diversity-selected indices into ``tasks``. + Typically loaded from a DISCO anchor-points file or + downloaded from a HuggingFace DISCO model repo. + If ``None``, evaluates all tasks in order. + """ + self._anchor_points: Optional[List[int]] = anchor_points + super().__init__(tasks, indices=anchor_points) + + class PriorityTaskQueue(BaseTaskQueue): """Execute tasks ordered by priority. From b498ce7c08da89156a785218483e2e6ad6ee413b Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Wed, 11 Mar 2026 17:28:24 +0100 Subject: [PATCH 04/17] [Move DISCO queue to core]: - Rename HuggingFaceMMLUBenchmark to DefaultMMLUBenchmark for consistency with other benchmarks --- CHANGELOG.md | 4 ++-- docs/benchmark/mmlu.md | 6 +++--- examples/mmlu_benchmark/mmlu_benchmark.py | 4 ++-- maseval/benchmark/mmlu/__init__.py | 8 ++++---- maseval/benchmark/mmlu/mmlu.py | 8 ++++---- 5 files changed, 15 insertions(+), 15 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index aec9785b..40d441bb 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Benchmarks** -- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34) +- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `DefaultMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `DefaultMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34) - CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28) @@ -88,7 +88,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Benchmarks** -- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `HuggingFaceMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). (PR: #34) +- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `DefaultMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). (PR: #34) - `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26) - `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge` - `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr` diff --git a/docs/benchmark/mmlu.md b/docs/benchmark/mmlu.md index d3c8c88d..965d1a5f 100644 --- a/docs/benchmark/mmlu.md +++ b/docs/benchmark/mmlu.md @@ -52,7 +52,7 @@ pip install maseval[lm-eval] ```python from maseval.benchmark.mmlu import ( - HuggingFaceMMLUBenchmark, + DefaultMMLUBenchmark, load_tasks, compute_benchmark_metrics, ) @@ -61,7 +61,7 @@ from maseval.benchmark.mmlu import ( tasks = load_tasks(data_path="/path/to/mmlu_prompts_examples.json") # Create benchmark with HuggingFace model -benchmark = HuggingFaceMMLUBenchmark( +benchmark = DefaultMMLUBenchmark( model_id="meta-llama/Llama-2-7b-hf", device="cuda:0", ) @@ -118,7 +118,7 @@ class MyMMLUBenchmark(MMLUBenchmark): ::: maseval.benchmark.mmlu.MMLUBenchmark -::: maseval.benchmark.mmlu.HuggingFaceMMLUBenchmark +::: maseval.benchmark.mmlu.DefaultMMLUBenchmark ::: maseval.benchmark.mmlu.MMLUEnvironment diff --git a/examples/mmlu_benchmark/mmlu_benchmark.py b/examples/mmlu_benchmark/mmlu_benchmark.py index 023915bd..101aeeba 100644 --- a/examples/mmlu_benchmark/mmlu_benchmark.py +++ b/examples/mmlu_benchmark/mmlu_benchmark.py @@ -52,7 +52,7 @@ # MMLU benchmark imports from maseval.benchmark.mmlu import ( DEFAULT_DEVICE, - HuggingFaceMMLUBenchmark, + DefaultMMLUBenchmark, load_tasks, compute_benchmark_metrics, ) @@ -691,7 +691,7 @@ def main(): ) # Create benchmark - benchmark = HuggingFaceMMLUBenchmark( + benchmark = DefaultMMLUBenchmark( model_id=args.model_id, device=args.device, trust_remote_code=True, diff --git a/maseval/benchmark/mmlu/__init__.py b/maseval/benchmark/mmlu/__init__.py index bc7b4360..dd9fd3dc 100644 --- a/maseval/benchmark/mmlu/__init__.py +++ b/maseval/benchmark/mmlu/__init__.py @@ -4,7 +4,7 @@ Usage: from maseval.benchmark.mmlu import ( - HuggingFaceMMLUBenchmark, + DefaultMMLUBenchmark, load_tasks, ) from maseval import DISCOQueue, InformativeSubsetQueue @@ -16,7 +16,7 @@ ) # Run benchmark - benchmark = HuggingFaceMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf") + benchmark = DefaultMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf") results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b-hf"}) """ @@ -33,7 +33,7 @@ TARGET_DELIMITER, TASK_TYPE_MMLU, MMLUBenchmark, - HuggingFaceMMLUBenchmark, + DefaultMMLUBenchmark, MMLUEnvironment, MMLUEvaluator, MMLUModelAgent, @@ -53,7 +53,7 @@ "TARGET_DELIMITER", "TASK_TYPE_MMLU", "MMLUBenchmark", - "HuggingFaceMMLUBenchmark", + "DefaultMMLUBenchmark", "MMLUEnvironment", "MMLUEvaluator", "MMLUModelAgent", diff --git a/maseval/benchmark/mmlu/mmlu.py b/maseval/benchmark/mmlu/mmlu.py index 11159ebb..870781ff 100644 --- a/maseval/benchmark/mmlu/mmlu.py +++ b/maseval/benchmark/mmlu/mmlu.py @@ -8,7 +8,7 @@ Usage: from maseval.benchmark.mmlu import ( - HuggingFaceMMLUBenchmark, load_tasks, + DefaultMMLUBenchmark, load_tasks, ) from maseval import DISCOQueue @@ -19,7 +19,7 @@ ) # Run with the HuggingFace concrete implementation - benchmark = HuggingFaceMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf") + benchmark = DefaultMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf") results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b-hf"}) """ @@ -342,7 +342,7 @@ class MMLUBenchmark(Benchmark): - ``setup_agents()`` - create agents for MCQ evaluation - ``get_model_adapter()`` - provide model adapters - For a ready-to-use implementation, see ``HuggingFaceMMLUBenchmark``. + For a ready-to-use implementation, see ``DefaultMMLUBenchmark``. """ def __init__( @@ -431,7 +431,7 @@ def evaluate( return results -class HuggingFaceMMLUBenchmark(MMLUBenchmark): +class DefaultMMLUBenchmark(MMLUBenchmark): """MMLU Benchmark using HuggingFace transformers models. This concrete implementation uses log-likelihood based MCQ evaluation From 14bcb3f7d6d4c56d67d739de039b3d7fa8edd3ee Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Wed, 11 Mar 2026 17:51:22 +0100 Subject: [PATCH 05/17] [Move DISCO queue to core] Add ModelScorer, ModelAgentAdapter, and rename HuggingFaceModelAdapter Introduce two new core abstractions and refactor the HuggingFace inference layer: - ModelScorer (maseval.core.scorer): ABC for log-likelihood scoring, parallel to ModelAdapter for generation. Methods: loglikelihood(), loglikelihood_batch(), loglikelihood_choices(). - ModelAgentAdapter (maseval.core.agent): generic adapter wrapping any ModelAdapter as an AgentAdapter, replacing benchmark-specific wrappers like MMLUModelAgent/MMLUAgentAdapter. - HuggingFaceModelAdapter renamed to HuggingFacePipelineModelAdapter (old name kept as backwards-compatible alias). - HuggingFaceModelScorer (maseval.interface.inference): concrete ModelScorer backed by AutoModelForCausalLM, with single-token optimisation for MCQ evaluation. Extracted from DefaultMMLUBenchmark. - DefaultMMLUBenchmark refactored to delegate scoring to HuggingFaceModelScorer and use ModelAgentAdapter. --- CHANGELOG.md | 11 +- docs/benchmark/mmlu.md | 10 +- maseval/__init__.py | 7 +- maseval/benchmark/mmlu/__init__.py | 4 - maseval/benchmark/mmlu/mmlu.py | 448 ++---------------- maseval/core/agent.py | 76 ++- maseval/core/model.py | 2 +- maseval/core/scorer.py | 276 +++++++++++ maseval/interface/inference/__init__.py | 39 +- maseval/interface/inference/huggingface.py | 35 +- .../interface/inference/huggingface_scorer.py | 264 +++++++++++ 11 files changed, 722 insertions(+), 450 deletions(-) create mode 100644 maseval/core/scorer.py create mode 100644 maseval/interface/inference/huggingface_scorer.py diff --git a/CHANGELOG.md b/CHANGELOG.md index 40d441bb..c6508428 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Benchmarks** -- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `DefaultMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `DefaultMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34) +- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `DefaultMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `DefaultMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34) - CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28) @@ -42,16 +42,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Core** - Added `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). Available via `from maseval import DISCOQueue`. (PR: #34) +- Added `ModelScorer` abstract base class in `maseval.core.scorer` for log-likelihood scoring, with `loglikelihood()`, `loglikelihood_batch()`, and `loglikelihood_choices()` methods. (PR: #PR_NUMBER_PLACEHOLDER) +- Added `ModelAgentAdapter` in `maseval.core.agent` — a generic adapter that wraps any `ModelAdapter` as an `AgentAdapter` for direct model evaluation (replaces benchmark-specific agent wrappers). (PR: #PR_NUMBER_PLACEHOLDER) - Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24) - Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24) - Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24) - Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24) - Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24) -- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24) +- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFacePipelineModelAdapter` pass seeds to underlying APIs (PR: #24) - Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39) **Interface** +- Added `HuggingFaceModelScorer` in `maseval.interface.inference` — log-likelihood scorer backed by a HuggingFace `AutoModelForCausalLM`, with single-token optimisation for MCQ evaluation. Implements the `ModelScorer` interface. (PR: #PR_NUMBER_PLACEHOLDER) +- Renamed `HuggingFaceModelAdapter` → `HuggingFacePipelineModelAdapter` to distinguish it from the new scorer. The old name remains as a backwards-compatible alias. (PR: #PR_NUMBER_PLACEHOLDER) + - CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22) - Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22) - Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22) @@ -88,7 +93,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Benchmarks** -- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `DefaultMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). (PR: #34) +- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `DefaultMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). `DefaultMMLUBenchmark` now delegates log-likelihood computation to `HuggingFaceModelScorer` and uses `ModelAgentAdapter` instead of the MMLU-specific `MMLUModelAgent`/`MMLUAgentAdapter` (removed). (PR: #34) - `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26) - `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge` - `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr` diff --git a/docs/benchmark/mmlu.md b/docs/benchmark/mmlu.md index 965d1a5f..7514ad18 100644 --- a/docs/benchmark/mmlu.md +++ b/docs/benchmark/mmlu.md @@ -97,13 +97,13 @@ print(f"Evaluating {len(tasks)} anchor tasks") `MMLUBenchmark` is a framework-agnostic base class. To use a different model backend, subclass it and implement `setup_agents()` and `get_model_adapter()`: ```python -from maseval.benchmark.mmlu import MMLUBenchmark, MMLUModelAgent, MMLUAgentAdapter +from maseval import ModelAgentAdapter +from maseval.benchmark.mmlu import MMLUBenchmark class MyMMLUBenchmark(MMLUBenchmark): def setup_agents(self, agent_data, environment, task, user, seed_generator): model = self.get_model_adapter(agent_data["model_id"]) - agent = MMLUModelAgent(model, name="mmlu_agent") - adapter = MMLUAgentAdapter(agent, "mmlu_agent") + adapter = ModelAgentAdapter(model, name="mmlu_agent") return [adapter], {"mmlu_agent": adapter} def get_model_adapter(self, model_id, **kwargs): @@ -124,10 +124,6 @@ class MyMMLUBenchmark(MMLUBenchmark): ::: maseval.benchmark.mmlu.MMLUEvaluator -::: maseval.benchmark.mmlu.MMLUModelAgent - -::: maseval.benchmark.mmlu.MMLUAgentAdapter - ::: maseval.benchmark.mmlu.load_tasks ::: maseval.benchmark.mmlu.compute_benchmark_metrics diff --git a/maseval/__init__.py b/maseval/__init__.py index 957fee9b..2aa5b927 100644 --- a/maseval/__init__.py +++ b/maseval/__init__.py @@ -22,7 +22,7 @@ AdaptiveTaskQueue, ) from .core.environment import Environment -from .core.agent import AgentAdapter +from .core.agent import AgentAdapter, ModelAgentAdapter from .core.benchmark import Benchmark, TaskExecutionStatus from .core.callback_handler import CallbackHandler from .core.callback import BenchmarkCallback, EnvironmentCallback, AgentCallback @@ -35,6 +35,7 @@ UserSimulatorError, ) from .core.model import ModelAdapter, ChatResponse +from .core.scorer import ModelScorer from .core.user import User, LLMUser, AgenticLLMUser, TerminationReason from .core.evaluator import Evaluator from .core.history import MessageHistory, ToolInvocationHistory @@ -63,6 +64,7 @@ # Core abstractions "Environment", "AgentAdapter", + "ModelAgentAdapter", "Benchmark", "TaskExecutionStatus", # Callbacks @@ -99,9 +101,10 @@ "DISCOQueue", "PriorityTaskQueue", "AdaptiveTaskQueue", - # Model adapters + # Model adapters and scorers "ModelAdapter", "ChatResponse", + "ModelScorer", # Exceptions and validation "MASEvalError", "AgentError", diff --git a/maseval/benchmark/mmlu/__init__.py b/maseval/benchmark/mmlu/__init__.py index dd9fd3dc..6c6f751c 100644 --- a/maseval/benchmark/mmlu/__init__.py +++ b/maseval/benchmark/mmlu/__init__.py @@ -36,8 +36,6 @@ DefaultMMLUBenchmark, MMLUEnvironment, MMLUEvaluator, - MMLUModelAgent, - MMLUAgentAdapter, load_tasks, compute_benchmark_metrics, ) @@ -56,8 +54,6 @@ "DefaultMMLUBenchmark", "MMLUEnvironment", "MMLUEvaluator", - "MMLUModelAgent", - "MMLUAgentAdapter", "InformativeSubsetQueue", "DISCOQueue", "load_tasks", diff --git a/maseval/benchmark/mmlu/mmlu.py b/maseval/benchmark/mmlu/mmlu.py index 870781ff..ef895e65 100644 --- a/maseval/benchmark/mmlu/mmlu.py +++ b/maseval/benchmark/mmlu/mmlu.py @@ -43,15 +43,13 @@ Benchmark, Environment, Evaluator, - MessageHistory, ModelAdapter, + ModelAgentAdapter, Task, User, SeedGenerator, ) from maseval.core.task import SequentialTaskQueue -from maseval.core.tracing import TraceableMixin -from maseval.core.config import ConfigurableMixin # ============================================================================= @@ -229,103 +227,6 @@ def _parse_answer(self, response: str) -> int: return -1 -# ============================================================================= -# Model Adapter Wrapper for MCQ -# ============================================================================= - - -class MMLUModelAgent(TraceableMixin, ConfigurableMixin): - """Simple agent wrapper that passes prompts to a model for MCQ evaluation. - - This is a minimal agent that just forwards prompts to the model - and returns the response. It supports tracing for MASEval integration. - """ - - def __init__(self, model: ModelAdapter, name: str = DEFAULT_AGENT_NAME): - """Initialize MMLU model agent. - - Args: - model: ModelAdapter to use for generation. - name: Agent name for tracing. - """ - super().__init__() - self.model = model - self.name = name - self._messages: List[Dict[str, Any]] = [] - - def run(self, prompt: str) -> str: - """Run the model on a prompt. - - Args: - prompt: The prompt to send to the model. - - Returns: - Model's response string. - """ - # Record input message - self._messages.append({"role": "user", "content": prompt}) - - # Generate response - response = self.model.generate(prompt) - - # Record output message - self._messages.append({"role": "assistant", "content": response}) - - return response - - def gather_traces(self) -> Dict[str, Any]: - """Gather traces for this agent.""" - return { - **super().gather_traces(), - "name": self.name, - "messages": list(self._messages), - } - - def gather_config(self) -> Dict[str, Any]: - """Gather configuration.""" - return { - **super().gather_config(), - "name": self.name, - "model_id": self.model.model_id, - } - - -class MMLUAgentAdapter(AgentAdapter): - """AgentAdapter wrapper for MMLUModelAgent.""" - - def __init__(self, agent: MMLUModelAgent, name: str): - """Initialize adapter. - - Args: - agent: MMLUModelAgent instance. - name: Adapter name. - """ - super().__init__(agent, name) - - def _run_agent(self, query: str) -> Any: - """Execute the agent.""" - return self.agent.run(query) - - def get_messages(self) -> MessageHistory: - """Get agent messages.""" - return MessageHistory(self.agent._messages) - - def gather_traces(self) -> Dict[str, Any]: - """Gather execution traces from this agent.""" - from maseval.core.tracing import TraceableMixin - - messages = self.get_messages() - return { - **TraceableMixin.gather_traces(self), - "name": self.name, - "agent_type": type(self.agent).__name__, - "message_count": len(messages), - "messages": messages.to_list(), - "callbacks": [type(cb).__name__ for cb in self.callbacks], - "logs": self.logs, - } - - # ============================================================================= # Benchmark # ============================================================================= @@ -435,12 +336,14 @@ class DefaultMMLUBenchmark(MMLUBenchmark): """MMLU Benchmark using HuggingFace transformers models. This concrete implementation uses log-likelihood based MCQ evaluation - with the same optimizations as lm-evaluation-harness: + via ``HuggingFaceModelScorer``, with the same optimisations as + lm-evaluation-harness: - 1. Single forward pass per question (one-token continuation optimization) - 2. Batching multiple questions together - 3. Efficient log-softmax computation - 4. Proper left-padding for batch processing + 1. Single forward pass per question (one-token continuation optimisation) + 2. Efficient log-softmax computation + 3. Proper left-padding for batch processing + + Agents are created using the generic ``ModelAgentAdapter``. """ def __init__( @@ -459,16 +362,22 @@ def __init__( device: Device to run model on. trust_remote_code: Trust remote code when loading model (default True). use_full_prompt: Use full prompt with few-shot examples (default True). - batch_size: Batch size for evaluation (number of questions per batch). - **kwargs: Additional arguments passed to MMLUBenchmark. + batch_size: Batch size for lm-eval batching (number of questions per batch). + **kwargs: Additional arguments passed to ``MMLUBenchmark``. """ super().__init__(use_full_prompt=use_full_prompt, **kwargs) self._model_id = model_id self._device = device self._trust_remote_code = trust_remote_code self._batch_size = batch_size - self._model = None - self._tokenizer = None + + from maseval.interface.inference.huggingface_scorer import HuggingFaceModelScorer + + self._scorer = HuggingFaceModelScorer( + model_id=model_id, + device=device, + trust_remote_code=trust_remote_code, + ) def setup_agents( self, @@ -492,189 +401,9 @@ def setup_agents( """ model_id = agent_data["model_id"] model = self.get_model_adapter(model_id, register_name=DEFAULT_MODEL_REGISTER_NAME) - - agent = MMLUModelAgent(model, name=DEFAULT_AGENT_NAME) - adapter = MMLUAgentAdapter(agent, DEFAULT_AGENT_NAME) - + adapter = ModelAgentAdapter(model, DEFAULT_AGENT_NAME) return [adapter], {DEFAULT_AGENT_NAME: adapter} - def _load_model(self): - """Lazy load the model and tokenizer for log-likelihood computation.""" - if self._model is None: - from transformers import AutoModelForCausalLM, AutoTokenizer - - print(f"Loading model: {self._model_id}") - self._tokenizer = AutoTokenizer.from_pretrained( - self._model_id, - trust_remote_code=self._trust_remote_code, - ) - self._tokenizer.padding_side = "left" - if self._tokenizer.pad_token is None: - self._tokenizer.pad_token = self._tokenizer.eos_token - - # Load model with torch_dtype="auto" to match lm-evaluation-harness exactly - # This uses the model's native dtype (bfloat16 for most modern models) - # Then move to device manually - self._model = AutoModelForCausalLM.from_pretrained( - self._model_id, - trust_remote_code=self._trust_remote_code, - torch_dtype="auto", - ) - self._model = self._model.to(self._device) - self._model.eval() - - # Note: We don't pre-cache choice token IDs here because they depend on context. - # Token IDs are computed dynamically in _get_choice_token_id_in_context() - # to match lm-evaluation-harness behavior exactly. - - return self._model, self._tokenizer - - def _get_choice_token_id_separate(self, choice: str) -> Optional[int]: - """Get the token ID for a choice when tokenized SEPARATELY. - - CRITICAL: lm-evaluation-harness encodes context and continuation separately, - then concatenates. This means "A" is always tokenized standalone (token 330), - NOT in context after "Answer:" (which would be token 28741). - - We must match this behavior to get identical log-likelihood values. - - Args: - choice: The choice string (e.g., "A"). - - Returns: - Token ID for the choice (standalone tokenization), or None if multi-token. - """ - _, tokenizer = self._load_model() - - # Tokenize choice ALONE (not in context) - this is how lm-eval does it - choice_tokens = tokenizer.encode(choice, add_special_tokens=False) - - if len(choice_tokens) == 1: - return choice_tokens[0] - else: - # Multi-token choice - return None to trigger multi-token fallback - return None - - def _encode_pair(self, context: str, continuation: str) -> tuple: - """Encode a context-continuation pair like lm-evaluation-harness. - - This matches lm-eval's _encode_pair method exactly: - 1. Encode whole = context + continuation - 2. Encode context alone - 3. continuation_enc = whole[len(context_enc):] - - This handles tokenization boundary effects correctly. - - Args: - context: The context/prompt string. - continuation: The continuation string (e.g., " A" with target_delimiter). - - Returns: - Tuple of (context_enc, continuation_enc) token lists. - """ - _, tokenizer = self._load_model() - - # Handle trailing spaces in context (move to continuation) - n_spaces = len(context) - len(context.rstrip()) - if n_spaces > 0: - continuation = context[-n_spaces:] + continuation - context = context[:-n_spaces] - - # Encode whole string together, then split - whole_enc = tokenizer.encode(context + continuation, add_special_tokens=True) - context_enc = tokenizer.encode(context, add_special_tokens=True) - - # Continuation tokens are what's left after context - continuation_enc = whole_enc[len(context_enc) :] - - return context_enc, continuation_enc - - def _compute_logprobs_single_token(self, prompt: str, choices: list) -> list: - """Compute log-likelihoods using single-token optimization. - - For MCQ with single-letter answers (A, B, C, D), we can compute all - choices in one forward pass since they share the same context. - - IMPORTANT: To match lm-evaluation-harness EXACTLY: - 1. Use target_delimiter=" " before choices (e.g., " A" not "A") - 2. Use _encode_pair to handle tokenization boundaries correctly - 3. Input = (context + continuation)[:-1] - 4. Apply log_softmax to get log probabilities - - Args: - prompt: The prompt/question text. - choices: List of answer choice strings (e.g., ["A", "B", "C", "D"]). - - Returns: - List of log-likelihoods, one per choice. - """ - import torch - - model, _ = self._load_model() - - # lm-eval uses target_delimiter=" " for multiple choice tasks - target_delimiter = TARGET_DELIMITER - - # Encode first choice to get the shared context - first_continuation = f"{target_delimiter}{choices[0]}" - context_enc, first_cont_enc = self._encode_pair(prompt, first_continuation) - - # Build input: (context + continuation)[:-1] - full_sequence = context_enc + first_cont_enc - input_tokens = full_sequence[:-1] # Remove last token - - input_ids = torch.tensor([input_tokens], dtype=torch.long, device=self._device) - - with torch.no_grad(): - outputs = model(input_ids) - logits = outputs.logits[0] # (seq_len, vocab_size) - - # Select logits at position where continuation is predicted - # For single-token continuation, this is the last position - inplen = len(input_tokens) - contlen = len(first_cont_enc) - selected_logits = logits[inplen - contlen : inplen] - - # Compute log-softmax - log_probs = torch.nn.functional.log_softmax(selected_logits, dim=-1) - - # Get log prob for each choice's continuation token - logprobs = [] - for choice in choices: - continuation = f"{target_delimiter}{choice}" - _, cont_enc = self._encode_pair(prompt, continuation) - - # Sum log probs for multi-token continuations - total = 0.0 - for i, token_id in enumerate(cont_enc): - total += log_probs[i, token_id].item() - logprobs.append(total) - - return logprobs - - def _compute_logprobs_batched(self, prompts: list, choices_list: list) -> list: - """Compute log-likelihoods for a batch of prompts. - - For exact match with lm-evaluation-harness, we process each prompt - individually using _compute_logprobs_single_token which uses the - correct _encode_pair tokenization logic. - - Args: - prompts: List of prompt strings. - choices_list: List of choice lists (one per prompt). - - Returns: - List of log-likelihood lists, one per prompt. - """ - # For exact match with lm-eval, process individually - # This ensures correct tokenization via _encode_pair - all_logprobs = [] - for prompt, choices in zip(prompts, choices_list): - logprobs = self._compute_logprobs_single_token(prompt, choices) - all_logprobs.append(logprobs) - - return all_logprobs - def precompute_all_logprobs_lmeval(self, tasks: Sequence[Task]) -> Dict[Any, List[float]]: """Precompute log-likelihoods for ALL tasks using lm-eval's batching. @@ -755,60 +484,6 @@ def precompute_all_logprobs_lmeval(self, tasks: Sequence[Task]) -> Dict[Any, Lis return doc_logprobs - def _compute_logprobs_multi_token(self, prompt: str, choices: list) -> list: - """Compute log-likelihoods for multi-token continuations. - - This is the fallback for when answer choices have multiple tokens. - Uses _encode_pair to match lm-evaluation-harness exactly. - - Args: - prompt: The prompt/question text. - choices: List of answer choice strings. - - Returns: - List of log-likelihoods, one per choice. - """ - import torch - - model, _ = self._load_model() - - # lm-eval uses target_delimiter=" " for multiple choice tasks - target_delimiter = TARGET_DELIMITER - - all_logprobs = [] - for choice in choices: - continuation = f"{target_delimiter}{choice}" - - # Use _encode_pair for correct tokenization - context_enc, continuation_enc = self._encode_pair(prompt, continuation) - - # Build input: (context + continuation)[:-1] - full_sequence = context_enc + continuation_enc - input_tokens = full_sequence[:-1] - - input_ids = torch.tensor([input_tokens], dtype=torch.long, device=self._device) - - with torch.no_grad(): - outputs = model(input_ids) - logits = outputs.logits[0] # (seq_len, vocab_size) - - # Select continuation logits - inplen = len(input_tokens) - contlen = len(continuation_enc) - selected = logits[inplen - contlen : inplen] - - # Compute log-softmax - log_probs = torch.nn.functional.log_softmax(selected, dim=-1) - - # Sum log probs for all continuation tokens - total = 0.0 - for i, token_id in enumerate(continuation_enc): - total += log_probs[i, token_id].item() - - all_logprobs.append(total) - - return all_logprobs - def run_agents( self, agents: Sequence[AgentAdapter], @@ -819,111 +494,62 @@ def run_agents( """Execute log-likelihood based MCQ evaluation. Uses precomputed logprobs if available (for exact lm-eval match), - otherwise falls back to single-forward-pass optimization for - single-token answers, or multi-token batched computation. + otherwise delegates to ``HuggingFaceModelScorer.loglikelihood_choices()`` + which automatically picks single-token or multi-token scoring. """ - # Get the prompt from environment prompt = environment.get_prompt() choices = environment.state.get("choices", DEFAULT_CHOICES) doc_id = task.metadata.get("doc_id") if task else None - # Check if we have precomputed logprobs (for exact lm-eval match) if hasattr(self, "_precomputed_logprobs") and doc_id is not None: logprobs = self._precomputed_logprobs.get(doc_id) if logprobs is not None: - # Use precomputed values for exact match best_idx = logprobs.index(max(logprobs)) answer = choices[best_idx] - - # Store logprobs in environment for later retrieval environment.state["logprobs"] = logprobs environment.state["predicted_idx"] = best_idx - - # Record in agent messages for tracing agent = agents[0] - agent.agent._messages.append({"role": "user", "content": prompt}) - agent.agent._messages.append( - { - "role": "assistant", - "content": answer, - "logprobs": logprobs, - } - ) - + agent._messages.append({"role": "user", "content": prompt}) + agent._messages.append({"role": "assistant", "content": answer, "logprobs": logprobs}) return answer - # Fall back to computing logprobs on-the-fly - # Load model - self._load_model() - - # lm-eval uses target_delimiter=" " for multiple choice tasks - target_delimiter = TARGET_DELIMITER - - # Check if all choices result in single-token continuations - # using _encode_pair to get the correct tokenization - all_single_token = True - for choice in choices: - continuation = f"{target_delimiter}{choice}" - _, cont_enc = self._encode_pair(prompt, continuation) - if len(cont_enc) != 1: - all_single_token = False - break + logprobs = self._scorer.loglikelihood_choices(prompt, choices, delimiter=TARGET_DELIMITER) - if all_single_token: - # Use optimized single-token path (one forward pass) - logprobs = self._compute_logprobs_single_token(prompt, choices) - else: - # Fall back to multi-token computation - logprobs = self._compute_logprobs_multi_token(prompt, choices) - - # Select the choice with highest log-probability best_idx = logprobs.index(max(logprobs)) answer = choices[best_idx] - - # Store logprobs in environment for later retrieval if needed environment.state["logprobs"] = logprobs environment.state["predicted_idx"] = best_idx - # Record in agent messages for tracing agent = agents[0] - agent.agent._messages.append({"role": "user", "content": prompt}) - agent.agent._messages.append( - { - "role": "assistant", - "content": answer, - "logprobs": logprobs, - } - ) - + agent._messages.append({"role": "user", "content": prompt}) + agent._messages.append({"role": "assistant", "content": answer, "logprobs": logprobs}) return answer def get_model_adapter(self, model_id: str, **kwargs: Any) -> ModelAdapter: - """Provide a HuggingFace ModelAdapter. + """Provide a HuggingFace ``ModelAdapter``. - Note: For logprobs-based evaluation, we don't actually use the adapter - for generation. This is kept for API compatibility. + The returned adapter is a placeholder — actual evaluation uses + ``HuggingFaceModelScorer`` for log-likelihood scoring. The adapter + is required by the ``Benchmark`` contract for ``setup_agents()``. Args: model_id: Model identifier (ignored, uses instance model_id). - **kwargs: Additional arguments (e.g., register_name). + **kwargs: Additional arguments (e.g., ``register_name``). Returns: - ``HuggingFaceModelAdapter`` instance. + ``HuggingFacePipelineModelAdapter`` instance. """ - from maseval.interface.inference import HuggingFaceModelAdapter + from maseval.interface.inference import HuggingFacePipelineModelAdapter - # Create a minimal adapter for compatibility - # The actual evaluation uses _compute_logprobs_* - class DummyCallable: - def __call__(self, prompt, **kwargs): + class _DummyCallable: + def __call__(self, prompt: str, **kw: Any) -> str: return "" - adapter = HuggingFaceModelAdapter( - model=DummyCallable(), + adapter = HuggingFacePipelineModelAdapter( + model=_DummyCallable(), model_id=self._model_id, ) - # Register for tracing if requested register_name = kwargs.get("register_name") if register_name: self.register("models", register_name, adapter) diff --git a/maseval/core/agent.py b/maseval/core/agent.py index 97011527..e76a3ea6 100644 --- a/maseval/core/agent.py +++ b/maseval/core/agent.py @@ -1,11 +1,16 @@ +from __future__ import annotations + from abc import ABC, abstractmethod -from typing import List, Any, Optional, Dict +from typing import TYPE_CHECKING, List, Any, Optional, Dict from .callback import AgentCallback from .history import MessageHistory from .tracing import TraceableMixin from .config import ConfigurableMixin +if TYPE_CHECKING: + from .model import ModelAdapter + class AgentAdapter(ABC, TraceableMixin, ConfigurableMixin): """Wraps an agent from any framework to provide a standard interface. @@ -186,3 +191,72 @@ def gather_config(self) -> Dict[str, Any]: def __repr__(self): return f"AgentAdapter(name={self.name}, agent_type={type(self.agent).__name__})" + + +class ModelAgentAdapter(AgentAdapter): + """Wraps a ``ModelAdapter`` as an ``AgentAdapter`` for direct model evaluation. + + Use this when a benchmark needs to plug a model directly into the agent + slot without an agentic framework. The adapter forwards queries to + ``ModelAdapter.generate()`` and records the conversation for tracing. + + Example: + ```python + from maseval import ModelAgentAdapter + from maseval.interface.inference import LiteLLMModelAdapter + + model = LiteLLMModelAdapter(model_id="gpt-4") + agent = ModelAgentAdapter(model, name="evaluator") + result = agent.run("What is the capital of France?") + ``` + """ + + def __init__( + self, + model: ModelAdapter, + name: str, + callbacks: Optional[List[AgentCallback]] = None, + ): + """Initialize a model-backed agent adapter. + + Args: + model: ``ModelAdapter`` instance used for generation. + name: Agent name for tracing and identification. + callbacks: Optional agent callbacks. + """ + super().__init__(model, name, callbacks) + self._messages: List[Dict[str, Any]] = [] + + @property + def model(self) -> ModelAdapter: + """The underlying ``ModelAdapter``.""" + return self.agent + + def _run_agent(self, query: str) -> str: + """Generate a response by forwarding the query to the model. + + Args: + query: The prompt to send to the model. + + Returns: + The model's text response. + """ + self._messages.append({"role": "user", "content": query}) + response = self.agent.generate(query) + self._messages.append({"role": "assistant", "content": response}) + return response + + def get_messages(self) -> MessageHistory: + """Return the recorded conversation history.""" + return MessageHistory(self._messages) + + def gather_config(self) -> Dict[str, Any]: + """Gather configuration including model identifier. + + Returns: + Dictionary containing agent and model configuration. + """ + return { + **super().gather_config(), + "model_id": self.agent.model_id, + } diff --git a/maseval/core/model.py b/maseval/core/model.py index cac1c2ed..d62d204c 100644 --- a/maseval/core/model.py +++ b/maseval/core/model.py @@ -155,7 +155,7 @@ class ModelAdapter(ABC, TraceableMixin, ConfigurableMixin): See maseval.interface.inference for concrete implementations: - AnthropicModelAdapter - GoogleGenAIModelAdapter - - HuggingFaceModelAdapter + - HuggingFacePipelineModelAdapter (alias: HuggingFaceModelAdapter) - LiteLLMModelAdapter - OpenAIModelAdapter diff --git a/maseval/core/scorer.py b/maseval/core/scorer.py new file mode 100644 index 00000000..aed7d672 --- /dev/null +++ b/maseval/core/scorer.py @@ -0,0 +1,276 @@ +"""Core model scorer abstractions for likelihood-based evaluation. + +This module provides the base `ModelScorer` class for computing token-level +scores (log-likelihoods) from language models. While `ModelAdapter` handles +text generation (``chat``, ``generate``), ``ModelScorer`` handles scoring by +computing how likely a model considers a given continuation. + +See `maseval.interface.inference` for concrete implementations. + +Example: + ```python + from maseval.interface.inference import HuggingFaceModelScorer + + scorer = HuggingFaceModelScorer( + model_id="meta-llama/Llama-2-7b-hf", + device="cuda:0", + ) + + # Single pair + ll = scorer.loglikelihood("The capital of France is", " Paris") + + # MCQ evaluation + logprobs = scorer.loglikelihood_choices( + "What is 2+2?\\nA) 3\\nB) 4\\nC) 5\\nD) 6\\nAnswer:", + choices=["A", "B", "C", "D"], + ) + best = ["A", "B", "C", "D"][logprobs.index(max(logprobs))] + ``` +""" + +from __future__ import annotations + +import time +from abc import ABC, abstractmethod +from datetime import datetime +from typing import Any, Dict, List, Optional, Tuple + +from .config import ConfigurableMixin +from .tracing import TraceableMixin + + +class ModelScorer(ABC, TraceableMixin, ConfigurableMixin): + """Abstract base class for model scorers. + + ``ModelScorer`` provides a consistent interface for computing token-level + log-likelihoods from language models. All scorers implement the same + methods, so you can swap providers without changing evaluation code. + + To use a scorer: + + 1. Create an instance with provider-specific configuration + 2. Call ``loglikelihood()`` for single context-continuation pairs + 3. Call ``loglikelihood_batch()`` for efficient batch computation + 4. Call ``loglikelihood_choices()`` for MCQ evaluation + + Implementing a custom scorer: + + Subclass ``ModelScorer`` and implement: + + - ``model_id`` property: Return the model identifier string + - ``_loglikelihood_impl()``: Score a single (context, continuation) pair + + Optionally override: + + - ``_loglikelihood_batch_impl()``: Optimised batch scoring + - ``loglikelihood_choices()``: MCQ-specific optimisations (e.g. shared-context single-pass) + """ + + def __init__(self, seed: Optional[int] = None): + """Initialize the model scorer. + + Args: + seed: Seed for deterministic scoring. Passed to the underlying + model if supported. + """ + super().__init__() + self._seed = seed + self.logs: List[Dict[str, Any]] = [] + + @property + def seed(self) -> Optional[int]: + """Seed for deterministic scoring, or None if unseeded.""" + return self._seed + + @property + @abstractmethod + def model_id(self) -> str: + """The identifier for the underlying model. + + Returns: + A string identifying the model (e.g., ``"meta-llama/Llama-2-7b-hf"``). + """ + + def loglikelihood(self, context: str, continuation: str) -> float: + """Compute the log-likelihood of ``continuation`` given ``context``. + + Args: + context: The conditioning text (prompt). + continuation: The text whose likelihood is scored. + + Returns: + Log-likelihood (negative float; higher = more likely). + """ + start_time = time.time() + try: + result = self._loglikelihood_impl(context, continuation) + duration = time.time() - start_time + self.logs.append( + { + "timestamp": datetime.now().isoformat(), + "type": "loglikelihood", + "duration_seconds": duration, + "status": "success", + } + ) + return result + except Exception as e: + duration = time.time() - start_time + self.logs.append( + { + "timestamp": datetime.now().isoformat(), + "type": "loglikelihood", + "duration_seconds": duration, + "status": "error", + "error": str(e), + "error_type": type(e).__name__, + } + ) + raise + + @abstractmethod + def _loglikelihood_impl(self, context: str, continuation: str) -> float: + """Internal implementation for single-pair scoring. + + Subclasses must implement this. The base class handles timing + and error logging. + + Args: + context: The conditioning text. + continuation: The text to score. + + Returns: + Log-likelihood of the continuation. + """ + + def loglikelihood_batch(self, pairs: List[Tuple[str, str]]) -> List[float]: + """Compute log-likelihoods for a batch of (context, continuation) pairs. + + Override ``_loglikelihood_batch_impl`` for provider-specific batching + optimisations. The default loops over ``_loglikelihood_impl``. + + Args: + pairs: List of (context, continuation) tuples. + + Returns: + List of log-likelihoods, one per pair. + """ + start_time = time.time() + try: + results = self._loglikelihood_batch_impl(pairs) + duration = time.time() - start_time + self.logs.append( + { + "timestamp": datetime.now().isoformat(), + "type": "loglikelihood_batch", + "batch_size": len(pairs), + "duration_seconds": duration, + "status": "success", + } + ) + return results + except Exception as e: + duration = time.time() - start_time + self.logs.append( + { + "timestamp": datetime.now().isoformat(), + "type": "loglikelihood_batch", + "batch_size": len(pairs), + "duration_seconds": duration, + "status": "error", + "error": str(e), + "error_type": type(e).__name__, + } + ) + raise + + def _loglikelihood_batch_impl(self, pairs: List[Tuple[str, str]]) -> List[float]: + """Default batch implementation — loops over ``_loglikelihood_impl``. + + Override in subclasses for provider-specific batching. + + Args: + pairs: List of (context, continuation) tuples. + + Returns: + List of log-likelihoods. + """ + return [self._loglikelihood_impl(ctx, cont) for ctx, cont in pairs] + + def loglikelihood_choices( + self, + context: str, + choices: List[str], + delimiter: str = " ", + ) -> List[float]: + """Compute log-likelihoods for multiple-choice continuations. + + Convenience method for MCQ evaluation. Each choice is prepended with + ``delimiter`` before scoring (e.g. ``" A"``, ``" B"``). + + Subclasses may override this for optimised shared-context scoring + (e.g. single forward pass for single-token choices). + + Args: + context: The question/prompt text. + choices: Answer choice strings (e.g. ``["A", "B", "C", "D"]``). + delimiter: String prepended to each choice (default ``" "``). + + Returns: + List of log-likelihoods, one per choice. + """ + pairs = [(context, f"{delimiter}{c}") for c in choices] + return self.loglikelihood_batch(pairs) + + def gather_traces(self) -> Dict[str, Any]: + """Gather execution traces from this scorer. + + Output fields: + + - ``type`` - Component class name + - ``gathered_at`` - ISO timestamp + - ``model_id`` - Model identifier + - ``total_calls`` - Number of scoring calls + - ``successful_calls`` - Number of successful calls + - ``failed_calls`` - Number of failed calls + - ``total_duration_seconds`` - Total time spent in calls + - ``logs`` - List of individual call records + + Returns: + Dictionary containing scorer execution traces. + """ + total_calls = len(self.logs) + successful_calls = sum(1 for call in self.logs if call["status"] == "success") + failed_calls = total_calls - successful_calls + total_duration = sum(call["duration_seconds"] for call in self.logs) + + return { + **super().gather_traces(), + "model_id": self.model_id, + "total_calls": total_calls, + "successful_calls": successful_calls, + "failed_calls": failed_calls, + "total_duration_seconds": total_duration, + "logs": self.logs, + } + + def gather_config(self) -> Dict[str, Any]: + """Gather configuration from this scorer. + + Output fields: + + - ``type`` - Component class name + - ``gathered_at`` - ISO timestamp + - ``model_id`` - Model identifier + - ``scorer_type`` - The specific scorer class name + - ``seed`` - Seed for deterministic scoring, or None if unseeded + + Returns: + Dictionary containing scorer configuration. + """ + return { + **super().gather_config(), + "model_id": self.model_id, + "scorer_type": type(self).__name__, + "seed": self._seed, + } diff --git a/maseval/interface/inference/__init__.py b/maseval/interface/inference/__init__.py index e6765d1e..549c719b 100644 --- a/maseval/interface/inference/__init__.py +++ b/maseval/interface/inference/__init__.py @@ -1,14 +1,20 @@ -"""Inference model adapters for various providers. +"""Inference model adapters and scorers for various providers. -This package contains concrete implementations of ModelAdapter for different -inference providers. Each adapter requires the corresponding optional dependency. +This package contains concrete implementations of ``ModelAdapter`` and +``ModelScorer`` for different inference providers. Each adapter/scorer +requires the corresponding optional dependency. -Available adapters: - - AnthropicModelAdapter: Anthropic Claude models (requires anthropic) - - GoogleGenAIModelAdapter: Google Gemini models (requires google-genai) - - HuggingFaceModelAdapter: HuggingFace transformers (requires transformers) - - LiteLLMModelAdapter: 100+ providers via LiteLLM (requires litellm) - - OpenAIModelAdapter: OpenAI and compatible APIs (requires openai) +Available adapters (text generation): + +- ``AnthropicModelAdapter``: Anthropic Claude models (requires ``anthropic``) +- ``GoogleGenAIModelAdapter``: Google Gemini models (requires ``google-genai``) +- ``HuggingFacePipelineModelAdapter``: HuggingFace pipelines (requires ``transformers``) +- ``LiteLLMModelAdapter``: 100+ providers via LiteLLM (requires ``litellm``) +- ``OpenAIModelAdapter``: OpenAI and compatible APIs (requires ``openai``) + +Available scorers (log-likelihood): + +- ``HuggingFaceModelScorer``: HuggingFace causal LMs (requires ``transformers``) Example: ```python @@ -49,13 +55,26 @@ # Conditionally import HuggingFace adapter try: - from .huggingface import HuggingFaceModelAdapter, ToolCallingNotSupportedError # noqa: F401 + from .huggingface import ( # noqa: F401 + HuggingFacePipelineModelAdapter, + HuggingFaceModelAdapter, + ToolCallingNotSupportedError, + ) + __all__.append("HuggingFacePipelineModelAdapter") __all__.append("HuggingFaceModelAdapter") __all__.append("ToolCallingNotSupportedError") except ImportError: pass +# Conditionally import HuggingFace scorer +try: + from .huggingface_scorer import HuggingFaceModelScorer # noqa: F401 + + __all__.append("HuggingFaceModelScorer") +except ImportError: + pass + # Conditionally import LiteLLM adapter try: from .litellm import LiteLLMModelAdapter # noqa: F401 diff --git a/maseval/interface/inference/huggingface.py b/maseval/interface/inference/huggingface.py index 45fac7e8..f765eb49 100644 --- a/maseval/interface/inference/huggingface.py +++ b/maseval/interface/inference/huggingface.py @@ -1,7 +1,10 @@ -"""HuggingFace model adapter. +"""HuggingFace pipeline model adapter. -This adapter works with HuggingFace transformers pipelines and models. -It supports both simple callable models and full pipeline objects. +This adapter works with HuggingFace transformers pipelines and callables +for text generation via ``chat()`` and ``generate()``. + +For log-likelihood scoring (e.g. MCQ evaluation), see +``HuggingFaceModelScorer`` in ``maseval.interface.inference.huggingface_scorer``. Requires transformers to be installed: pip install maseval[transformers] @@ -9,11 +12,11 @@ Example: ```python from transformers import pipeline - from maseval.interface.inference import HuggingFaceModelAdapter + from maseval.interface.inference import HuggingFacePipelineModelAdapter # Using a pipeline pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct") - model = HuggingFaceModelAdapter(model=pipe, model_id="llama-3.1-8b") + model = HuggingFacePipelineModelAdapter(model=pipe, model_id="llama-3.1-8b") # Simple generation response = model.generate("Hello!") @@ -42,12 +45,18 @@ class ToolCallingNotSupportedError(Exception): pass -class HuggingFaceModelAdapter(ModelAdapter): - """Adapter for HuggingFace transformers models and pipelines. +class HuggingFacePipelineModelAdapter(ModelAdapter): + """Adapter for HuggingFace transformers pipelines and callables. + + Wraps a HuggingFace ``pipeline()`` object (or any text-generation callable) + for use with the ``ModelAdapter`` interface (``chat()``, ``generate()``). + + For log-likelihood scoring, see ``HuggingFaceModelScorer``. Works with: - - transformers.pipeline() objects - - Any callable that accepts a prompt and returns text + + - ``transformers.pipeline()`` objects + - Any callable that accepts a prompt and returns text For chat functionality, the adapter uses the tokenizer's chat template if available. This provides proper formatting for instruction-tuned models. @@ -55,8 +64,8 @@ class HuggingFaceModelAdapter(ModelAdapter): Tool calling support: Tool calling is only supported if the model's chat template explicitly supports it. If you pass tools and the model doesn't support them, - a ToolCallingNotSupportedError is raised. For reliable tool calling, - consider using LiteLLMModelAdapter instead. + a ``ToolCallingNotSupportedError`` is raised. For reliable tool calling, + consider using ``LiteLLMModelAdapter`` instead. """ def __init__( @@ -378,3 +387,7 @@ def gather_config(self) -> Dict[str, Any]: base_config["pipeline_config"] = pipeline_config return base_config + + +# Backwards compatibility alias +HuggingFaceModelAdapter = HuggingFacePipelineModelAdapter diff --git a/maseval/interface/inference/huggingface_scorer.py b/maseval/interface/inference/huggingface_scorer.py new file mode 100644 index 00000000..53d43dad --- /dev/null +++ b/maseval/interface/inference/huggingface_scorer.py @@ -0,0 +1,264 @@ +"""HuggingFace model scorer for log-likelihood evaluation. + +Wraps a raw HuggingFace ``AutoModelForCausalLM`` (not a pipeline) and +exposes ``loglikelihood()`` for scoring context-continuation pairs. Designed +for MCQ-style evaluation where the best answer is chosen by highest +log-likelihood. + +For text generation (``chat()``, ``generate()``), see +``HuggingFacePipelineModelAdapter`` in ``maseval.interface.inference.huggingface``. + +Requires transformers and torch: + pip install maseval[transformers] + +Example: + ```python + from maseval.interface.inference import HuggingFaceModelScorer + + scorer = HuggingFaceModelScorer( + model_id="meta-llama/Llama-2-7b-hf", + device="cuda:0", + ) + + # Score a single continuation + ll = scorer.loglikelihood("The capital of France is", " Paris") + + # MCQ: pick the most likely answer + logprobs = scorer.loglikelihood_choices( + context="What is 2+2? Answer:", + choices=["A", "B", "C", "D"], + ) + best_idx = logprobs.index(max(logprobs)) + ``` +""" + +from __future__ import annotations + +from typing import Any, Dict, List, Optional, Tuple + +from maseval.core.scorer import ModelScorer + + +class HuggingFaceModelScorer(ModelScorer): + """Log-likelihood scorer backed by a HuggingFace causal language model. + + Loads the model lazily on first use. Supports: + + - Single-token optimisation: when all continuations map to a single token, + one forward pass scores every choice. + - Multi-token fallback: separate forward pass per continuation. + - ``loglikelihood_choices()`` override that picks the optimal path + automatically. + + The tokenisation strategy matches ``lm-evaluation-harness``: context and + continuation are encoded separately, then concatenated to handle + tokenisation-boundary effects correctly. + """ + + def __init__( + self, + model_id: str, + device: str = "cuda:0", + trust_remote_code: bool = True, + seed: Optional[int] = None, + ): + """Initialize HuggingFace model scorer. + + Args: + model_id: HuggingFace model identifier + (e.g. ``"meta-llama/Llama-2-7b-hf"``). + device: Torch device string (e.g. ``"cuda:0"``, ``"cpu"``). + trust_remote_code: Trust remote code when loading the model. + seed: Seed for deterministic scoring. + """ + super().__init__(seed=seed) + self._model_id = model_id + self._device = device + self._trust_remote_code = trust_remote_code + self._model: Any = None + self._tokenizer: Any = None + + @property + def model_id(self) -> str: + return self._model_id + + # ------------------------------------------------------------------ + # Model loading + # ------------------------------------------------------------------ + + def _load_model(self) -> Tuple[Any, Any]: + """Lazy-load the model and tokenizer. + + Returns: + Tuple of (model, tokenizer). + """ + if self._model is None: + from transformers import AutoModelForCausalLM, AutoTokenizer + + self._tokenizer = AutoTokenizer.from_pretrained( + self._model_id, + trust_remote_code=self._trust_remote_code, + ) + self._tokenizer.padding_side = "left" + if self._tokenizer.pad_token is None: + self._tokenizer.pad_token = self._tokenizer.eos_token + + self._model = AutoModelForCausalLM.from_pretrained( + self._model_id, + trust_remote_code=self._trust_remote_code, + torch_dtype="auto", + ) + self._model = self._model.to(self._device) + self._model.eval() + + return self._model, self._tokenizer + + # ------------------------------------------------------------------ + # Tokenisation helpers (matches lm-evaluation-harness) + # ------------------------------------------------------------------ + + def _encode_pair(self, context: str, continuation: str) -> Tuple[List[int], List[int]]: + """Encode a context-continuation pair like lm-evaluation-harness. + + 1. Encode ``whole = context + continuation`` + 2. Encode ``context`` alone + 3. ``continuation_enc = whole[len(context_enc):]`` + + Args: + context: The context/prompt string. + continuation: The continuation string. + + Returns: + Tuple of (context_enc, continuation_enc) token lists. + """ + _, tokenizer = self._load_model() + + n_spaces = len(context) - len(context.rstrip()) + if n_spaces > 0: + continuation = context[-n_spaces:] + continuation + context = context[:-n_spaces] + + whole_enc = tokenizer.encode(context + continuation, add_special_tokens=True) + context_enc = tokenizer.encode(context, add_special_tokens=True) + + continuation_enc = whole_enc[len(context_enc) :] + return context_enc, continuation_enc + + # ------------------------------------------------------------------ + # Core scoring + # ------------------------------------------------------------------ + + def _loglikelihood_impl(self, context: str, continuation: str) -> float: + """Score a single (context, continuation) pair. + + Uses ``_encode_pair`` for correct tokenisation, then computes the + sum of per-token log-probabilities over the continuation. + """ + import torch + + model, _ = self._load_model() + + context_enc, continuation_enc = self._encode_pair(context, continuation) + full_sequence = context_enc + continuation_enc + input_tokens = full_sequence[:-1] + + input_ids = torch.tensor([input_tokens], dtype=torch.long, device=self._device) + + with torch.no_grad(): + logits = model(input_ids).logits[0] + inplen = len(input_tokens) + contlen = len(continuation_enc) + selected = logits[inplen - contlen : inplen] + log_probs = torch.nn.functional.log_softmax(selected, dim=-1) + + total = 0.0 + for i, token_id in enumerate(continuation_enc): + total += log_probs[i, token_id].item() + + return total + + # ------------------------------------------------------------------ + # MCQ optimisation + # ------------------------------------------------------------------ + + def loglikelihood_choices( + self, + context: str, + choices: List[str], + delimiter: str = " ", + ) -> List[float]: + """Score multiple-choice continuations with shared-context optimisation. + + When every ``delimiter + choice`` maps to a single continuation token, + all choices are scored in **one** forward pass. Otherwise falls back to + per-choice scoring via ``_loglikelihood_impl``. + + Args: + context: The question/prompt text. + choices: Answer choice strings (e.g. ``["A", "B", "C", "D"]``). + delimiter: String prepended to each choice (default ``" "``). + + Returns: + List of log-likelihoods, one per choice. + """ + model, _ = self._load_model() + + continuations = [f"{delimiter}{c}" for c in choices] + encoded_continuations = [self._encode_pair(context, cont) for cont in continuations] + + all_single_token = all(len(cont_enc) == 1 for _, cont_enc in encoded_continuations) + + if all_single_token: + return self._score_single_token(context, choices, delimiter, encoded_continuations) + + return [self._loglikelihood_impl(context, cont) for cont in continuations] + + def _score_single_token( + self, + context: str, + choices: List[str], + delimiter: str, + encoded_continuations: List[Tuple[List[int], List[int]]], + ) -> List[float]: + """One-forward-pass scoring for single-token continuations.""" + import torch + + model, _ = self._load_model() + + context_enc, first_cont_enc = encoded_continuations[0] + full_sequence = context_enc + first_cont_enc + input_tokens = full_sequence[:-1] + + input_ids = torch.tensor([input_tokens], dtype=torch.long, device=self._device) + + with torch.no_grad(): + logits = model(input_ids).logits[0] + inplen = len(input_tokens) + contlen = len(first_cont_enc) + selected_logits = logits[inplen - contlen : inplen] + log_probs = torch.nn.functional.log_softmax(selected_logits, dim=-1) + + logprobs: List[float] = [] + for _, cont_enc in encoded_continuations: + total = 0.0 + for i, token_id in enumerate(cont_enc): + total += log_probs[i, token_id].item() + logprobs.append(total) + + return logprobs + + # ------------------------------------------------------------------ + # Tracing + # ------------------------------------------------------------------ + + def gather_config(self) -> Dict[str, Any]: + """Gather configuration including device and model settings. + + Returns: + Dictionary containing scorer configuration. + """ + return { + **super().gather_config(), + "device": self._device, + "trust_remote_code": self._trust_remote_code, + } From 079ef47fae7769bca32dc4bb1aee33134f49288b Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Thu, 12 Mar 2026 09:00:35 +0100 Subject: [PATCH 06/17] [Move DISCO queue to core]: - Replace all .get() calls on required fields by explicit dict lookup. --- maseval/__init__.py | 2 ++ maseval/benchmark/mmlu/mmlu.py | 55 +++++++++++++++++----------------- maseval/core/exceptions.py | 38 +++++++++++++++++++++++ 3 files changed, 67 insertions(+), 28 deletions(-) diff --git a/maseval/__init__.py b/maseval/__init__.py index 2aa5b927..bde2e121 100644 --- a/maseval/__init__.py +++ b/maseval/__init__.py @@ -49,6 +49,7 @@ UserError, UserExhaustedError, TaskTimeoutError, + get_with_assert, validate_argument_type, validate_required_arguments, validate_no_extra_arguments, @@ -106,6 +107,7 @@ "ChatResponse", "ModelScorer", # Exceptions and validation + "get_with_assert", "MASEvalError", "AgentError", "EnvironmentError", diff --git a/maseval/benchmark/mmlu/mmlu.py b/maseval/benchmark/mmlu/mmlu.py index ef895e65..79ff8ce3 100644 --- a/maseval/benchmark/mmlu/mmlu.py +++ b/maseval/benchmark/mmlu/mmlu.py @@ -84,14 +84,14 @@ def setup_state(self, task_data: Dict[str, Any]) -> Dict[str, Any]: Args: task_data: Must contain ``"query"`` (str) and ``"environment_data"`` - (dict with optional ``"choices"``, ``"full_prompt"``, ``"use_full_prompt"``). + (dict with ``"choices"``, ``"full_prompt"``, ``"use_full_prompt"``). """ env_data = task_data["environment_data"] return { "query": task_data["query"], - "choices": env_data.get("choices", DEFAULT_CHOICES), - "full_prompt": env_data.get("full_prompt", ""), - "use_full_prompt": env_data.get("use_full_prompt", False), + "choices": env_data["choices"], + "full_prompt": env_data["full_prompt"], + "use_full_prompt": env_data["use_full_prompt"], } def create_tools(self) -> Dict[str, Any]: @@ -137,7 +137,7 @@ def __init__( self.task = task self.environment = environment self.gold = task.evaluation_data["gold"] - self.choices = task.environment_data.get("choices", DEFAULT_CHOICES) + self.choices = task.environment_data["choices"] def filter_traces(self, traces: Dict[str, Any]) -> Dict[str, Any]: """Extract relevant traces for evaluation. @@ -175,11 +175,11 @@ def __call__(self, traces: Dict[str, Any], final_answer: Optional[str] = None) - "predicted": predicted, "gold": self.gold, "correct": correct, - "doc_id": self.task.metadata.get("doc_id"), + "doc_id": self.task.metadata["doc_id"], } # Extract logprobs from traces if available (for logprobs-based evaluation) - messages = traces.get("messages", []) + messages = traces["messages"] for msg in messages: if isinstance(msg, dict) and "logprobs" in msg: result["logprobs"] = msg["logprobs"] @@ -445,7 +445,7 @@ def precompute_all_logprobs_lmeval(self, tasks: Sequence[Task]) -> Dict[Any, Lis instance_map = {} # (doc_id, choice_idx) -> position in results for task in tasks: - doc_id = task.metadata.get("doc_id") + doc_id = task.metadata["doc_id"] # Get prompt from task - use full_prompt from environment_data if available if self.use_full_prompt and "full_prompt" in task.environment_data: prompt = task.environment_data["full_prompt"] @@ -471,7 +471,7 @@ def precompute_all_logprobs_lmeval(self, tasks: Sequence[Task]) -> Dict[Any, Lis # Map results back to doc_ids doc_logprobs = {} for task in tasks: - doc_id = task.metadata.get("doc_id") + doc_id = task.metadata["doc_id"] logprobs = [] for i in range(len(choices)): pos = instance_map[(doc_id, i)] @@ -498,20 +498,19 @@ def run_agents( which automatically picks single-token or multi-token scoring. """ prompt = environment.get_prompt() - choices = environment.state.get("choices", DEFAULT_CHOICES) - doc_id = task.metadata.get("doc_id") if task else None - - if hasattr(self, "_precomputed_logprobs") and doc_id is not None: - logprobs = self._precomputed_logprobs.get(doc_id) - if logprobs is not None: - best_idx = logprobs.index(max(logprobs)) - answer = choices[best_idx] - environment.state["logprobs"] = logprobs - environment.state["predicted_idx"] = best_idx - agent = agents[0] - agent._messages.append({"role": "user", "content": prompt}) - agent._messages.append({"role": "assistant", "content": answer, "logprobs": logprobs}) - return answer + choices = environment.state["choices"] + doc_id = task.metadata["doc_id"] + + if hasattr(self, "_precomputed_logprobs") and doc_id in self._precomputed_logprobs: + logprobs = self._precomputed_logprobs[doc_id] + best_idx = logprobs.index(max(logprobs)) + answer = choices[best_idx] + environment.state["logprobs"] = logprobs + environment.state["predicted_idx"] = best_idx + agent = agents[0] + agent._messages.append({"role": "user", "content": prompt}) + agent._messages.append({"role": "assistant", "content": answer, "logprobs": logprobs}) + return answer logprobs = self._scorer.loglikelihood_choices(prompt, choices, delimiter=TARGET_DELIMITER) @@ -677,14 +676,14 @@ def compute_benchmark_metrics(results: List[Dict[str, Any]]) -> Dict[str, Any]: acc_norm_sum = 0.0 for res in results: - if res.get("status") != STATUS_SUCCESS: + if res["status"] != STATUS_SUCCESS: continue - evals = res.get("eval") or [] + evals = res["eval"] or [] for entry in evals: - acc_sum += entry.get("acc", 0.0) - acc_norm_sum += entry.get("acc_norm", 0.0) - if entry.get("correct", False): + acc_sum += entry["acc"] + acc_norm_sum += entry["acc_norm"] + if entry["correct"]: correct_count += 1 return { diff --git a/maseval/core/exceptions.py b/maseval/core/exceptions.py index e4c8c0f1..b3e297c0 100644 --- a/maseval/core/exceptions.py +++ b/maseval/core/exceptions.py @@ -308,6 +308,44 @@ def __init__( # ============================================================================= +def get_with_assert(container: Any, key: Any, error_msg: Optional[str] = None) -> Any: + """Get a value from a container, raising ``KeyError`` if not found. + + Use instead of ``dict.get(key, default)`` when the key is **required**. + A missing key means a bug — not a case to paper over with a fallback. + + Supports nested access via a list of keys:: + + get_with_assert(task, ["metadata", "doc_id"]) + # equivalent to: task["metadata"]["doc_id"] but with a clear error + + Args: + container: Dictionary or other container supporting ``in`` and ``[]``. + key: Key to look up. Pass a list for nested access. + error_msg: Custom error message. If ``None``, a descriptive default + is generated. + + Returns: + The value at the given key. + + Raises: + KeyError: If the key is not found in the container. + """ + if isinstance(key, list): + assert len(key) > 0 + value = get_with_assert(container, key[0], error_msg) + if len(key) == 1: + return value + return get_with_assert(value, key[1:], error_msg) + + if key not in container: + if error_msg is None: + error_msg = f'Required key "{key}" not in container: {container}' + raise KeyError(error_msg) + + return container[key] + + def validate_argument_type( value: Any, expected_type: str, From e23b1df36cc468c7b9108add2afe32f72ed69a70 Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Thu, 12 Mar 2026 10:04:39 +0100 Subject: [PATCH 07/17] [Move DISCO queue to core] Remove dummy implementations and tighten data access in MMLU benchmark - Replace silent .get() fallbacks with direct dict access for required fields (choices, doc_id, gold, acc, etc.) so missing data fails fast - Add get_with_assert utility to maseval.core.exceptions for required key lookups with clear error messages - Remove _DummyCallable from DefaultMMLUBenchmark.get_model_adapter(); raise NotImplementedError instead since scoring uses HuggingFaceModelScorer - Restructure DefaultMMLUBenchmark.setup_agents() to use a scorer-backed adapter directly instead of routing through get_model_adapter() - Remove redundant MMLUBenchmark.setup_user() override (base class already returns None) - Remove ModelAgentAdapter (no consumers) from core, exports, and docs --- CHANGELOG.md | 3 +- docs/benchmark/mmlu.md | 19 +++++++- maseval/__init__.py | 3 +- maseval/benchmark/mmlu/mmlu.py | 86 +++++++++++++++++----------------- maseval/core/agent.py | 74 +---------------------------- 5 files changed, 63 insertions(+), 122 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index c6508428..bea3eedd 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -43,7 +43,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Added `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). Available via `from maseval import DISCOQueue`. (PR: #34) - Added `ModelScorer` abstract base class in `maseval.core.scorer` for log-likelihood scoring, with `loglikelihood()`, `loglikelihood_batch()`, and `loglikelihood_choices()` methods. (PR: #PR_NUMBER_PLACEHOLDER) -- Added `ModelAgentAdapter` in `maseval.core.agent` — a generic adapter that wraps any `ModelAdapter` as an `AgentAdapter` for direct model evaluation (replaces benchmark-specific agent wrappers). (PR: #PR_NUMBER_PLACEHOLDER) - Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24) - Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24) - Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24) @@ -93,7 +92,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Benchmarks** -- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `DefaultMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). `DefaultMMLUBenchmark` now delegates log-likelihood computation to `HuggingFaceModelScorer` and uses `ModelAgentAdapter` instead of the MMLU-specific `MMLUModelAgent`/`MMLUAgentAdapter` (removed). (PR: #34) +- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `DefaultMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). `DefaultMMLUBenchmark` now delegates log-likelihood computation to `HuggingFaceModelScorer` and uses a scorer-backed adapter instead of the MMLU-specific `MMLUModelAgent`/`MMLUAgentAdapter` (removed). (PR: #34) - `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26) - `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge` - `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr` diff --git a/docs/benchmark/mmlu.md b/docs/benchmark/mmlu.md index 7514ad18..d2e58544 100644 --- a/docs/benchmark/mmlu.md +++ b/docs/benchmark/mmlu.md @@ -97,13 +97,28 @@ print(f"Evaluating {len(tasks)} anchor tasks") `MMLUBenchmark` is a framework-agnostic base class. To use a different model backend, subclass it and implement `setup_agents()` and `get_model_adapter()`: ```python -from maseval import ModelAgentAdapter +from maseval import AgentAdapter +from maseval.core.history import MessageHistory from maseval.benchmark.mmlu import MMLUBenchmark +class MyAgentAdapter(AgentAdapter): + def __init__(self, model, name): + super().__init__(model, name) + self._messages = [] + + def _run_agent(self, query): + self._messages.append({"role": "user", "content": query}) + response = self.agent.generate(query) + self._messages.append({"role": "assistant", "content": response}) + return response + + def get_messages(self): + return MessageHistory(self._messages) + class MyMMLUBenchmark(MMLUBenchmark): def setup_agents(self, agent_data, environment, task, user, seed_generator): model = self.get_model_adapter(agent_data["model_id"]) - adapter = ModelAgentAdapter(model, name="mmlu_agent") + adapter = MyAgentAdapter(model, name="mmlu_agent") return [adapter], {"mmlu_agent": adapter} def get_model_adapter(self, model_id, **kwargs): diff --git a/maseval/__init__.py b/maseval/__init__.py index bde2e121..c6fa6cec 100644 --- a/maseval/__init__.py +++ b/maseval/__init__.py @@ -22,7 +22,7 @@ AdaptiveTaskQueue, ) from .core.environment import Environment -from .core.agent import AgentAdapter, ModelAgentAdapter +from .core.agent import AgentAdapter from .core.benchmark import Benchmark, TaskExecutionStatus from .core.callback_handler import CallbackHandler from .core.callback import BenchmarkCallback, EnvironmentCallback, AgentCallback @@ -65,7 +65,6 @@ # Core abstractions "Environment", "AgentAdapter", - "ModelAgentAdapter", "Benchmark", "TaskExecutionStatus", # Callbacks diff --git a/maseval/benchmark/mmlu/mmlu.py b/maseval/benchmark/mmlu/mmlu.py index 79ff8ce3..e59f41b8 100644 --- a/maseval/benchmark/mmlu/mmlu.py +++ b/maseval/benchmark/mmlu/mmlu.py @@ -44,11 +44,11 @@ Environment, Evaluator, ModelAdapter, - ModelAgentAdapter, Task, User, SeedGenerator, ) +from maseval.core.history import MessageHistory from maseval.core.task import SequentialTaskQueue @@ -67,6 +67,34 @@ STATUS_SUCCESS = "success" +# ============================================================================= +# Agent adapter for scorer-based evaluation +# ============================================================================= + + +class _ScorerBackedAdapter(AgentAdapter): + """Agent adapter for benchmarks that use scorer-based evaluation. + + This adapter is a message container for tracing — the benchmark's + ``run_agents()`` drives evaluation via a ``ModelScorer`` and records + results here. Calling ``agent.run()`` directly is an error because + there is no generation model behind this adapter. + """ + + def __init__(self, scorer: Any, name: str) -> None: + super().__init__(agent_instance=scorer, name=name) + self._messages: List[Dict[str, Any]] = [] + + def _run_agent(self, query: str) -> Any: + raise NotImplementedError( + f"{type(self).__name__} is backed by a ModelScorer, not a generation model. " + "Use benchmark.run_agents() instead of calling agent.run() directly." + ) + + def get_messages(self) -> MessageHistory: + return MessageHistory(self._messages) + + # ============================================================================= # Environment # ============================================================================= @@ -280,16 +308,6 @@ def setup_environment( } return MMLUEnvironment(task_data) - def setup_user( - self, - agent_data: Dict[str, Any], - environment: Environment, - task: Task, - seed_generator: SeedGenerator, - ) -> Optional[User]: - """MMLU doesn't use a user simulator.""" - return None - def setup_evaluators( self, environment: Environment, @@ -343,7 +361,7 @@ class DefaultMMLUBenchmark(MMLUBenchmark): 2. Efficient log-softmax computation 3. Proper left-padding for batch processing - Agents are created using the generic ``ModelAgentAdapter``. + Agents are created using a scorer-backed adapter (see ``_ScorerBackedAdapter``). """ def __init__( @@ -387,10 +405,13 @@ def setup_agents( user: Optional[User], seed_generator: SeedGenerator, ) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]: - """Create model agent for MCQ evaluation. + """Create scorer-backed agent for MCQ evaluation. + + The returned adapter is a tracing container — actual evaluation is + driven by ``self._scorer`` in ``run_agents()``. Args: - agent_data: Agent config. Must contain ``"model_id"`` (str). + agent_data: Agent config (unused; model is set at ``__init__``). environment: MMLU environment. task: Current task. user: Unused. @@ -399,9 +420,7 @@ def setup_agents( Returns: Tuple of (agents_to_run, agents_dict). """ - model_id = agent_data["model_id"] - model = self.get_model_adapter(model_id, register_name=DEFAULT_MODEL_REGISTER_NAME) - adapter = ModelAgentAdapter(model, DEFAULT_AGENT_NAME) + adapter = _ScorerBackedAdapter(self._scorer, DEFAULT_AGENT_NAME) return [adapter], {DEFAULT_AGENT_NAME: adapter} def precompute_all_logprobs_lmeval(self, tasks: Sequence[Task]) -> Dict[Any, List[float]]: @@ -525,36 +544,17 @@ def run_agents( return answer def get_model_adapter(self, model_id: str, **kwargs: Any) -> ModelAdapter: - """Provide a HuggingFace ``ModelAdapter``. - - The returned adapter is a placeholder — actual evaluation uses - ``HuggingFaceModelScorer`` for log-likelihood scoring. The adapter - is required by the ``Benchmark`` contract for ``setup_agents()``. + """Not used — ``DefaultMMLUBenchmark`` uses ``HuggingFaceModelScorer``. - Args: - model_id: Model identifier (ignored, uses instance model_id). - **kwargs: Additional arguments (e.g., ``register_name``). - - Returns: - ``HuggingFacePipelineModelAdapter`` instance. + Raises: + NotImplementedError: Always. Use ``HuggingFaceModelScorer`` via + ``self._scorer`` for log-likelihood evaluation. """ - from maseval.interface.inference import HuggingFacePipelineModelAdapter - - class _DummyCallable: - def __call__(self, prompt: str, **kw: Any) -> str: - return "" - - adapter = HuggingFacePipelineModelAdapter( - model=_DummyCallable(), - model_id=self._model_id, + raise NotImplementedError( + "DefaultMMLUBenchmark uses HuggingFaceModelScorer for log-likelihood " + "evaluation, not a generation ModelAdapter. Access the scorer via self._scorer." ) - register_name = kwargs.get("register_name") - if register_name: - self.register("models", register_name, adapter) - - return adapter - # ============================================================================= # Data Loading diff --git a/maseval/core/agent.py b/maseval/core/agent.py index e76a3ea6..1f0aeb9b 100644 --- a/maseval/core/agent.py +++ b/maseval/core/agent.py @@ -1,16 +1,13 @@ from __future__ import annotations from abc import ABC, abstractmethod -from typing import TYPE_CHECKING, List, Any, Optional, Dict +from typing import List, Any, Optional, Dict from .callback import AgentCallback from .history import MessageHistory from .tracing import TraceableMixin from .config import ConfigurableMixin -if TYPE_CHECKING: - from .model import ModelAdapter - class AgentAdapter(ABC, TraceableMixin, ConfigurableMixin): """Wraps an agent from any framework to provide a standard interface. @@ -191,72 +188,3 @@ def gather_config(self) -> Dict[str, Any]: def __repr__(self): return f"AgentAdapter(name={self.name}, agent_type={type(self.agent).__name__})" - - -class ModelAgentAdapter(AgentAdapter): - """Wraps a ``ModelAdapter`` as an ``AgentAdapter`` for direct model evaluation. - - Use this when a benchmark needs to plug a model directly into the agent - slot without an agentic framework. The adapter forwards queries to - ``ModelAdapter.generate()`` and records the conversation for tracing. - - Example: - ```python - from maseval import ModelAgentAdapter - from maseval.interface.inference import LiteLLMModelAdapter - - model = LiteLLMModelAdapter(model_id="gpt-4") - agent = ModelAgentAdapter(model, name="evaluator") - result = agent.run("What is the capital of France?") - ``` - """ - - def __init__( - self, - model: ModelAdapter, - name: str, - callbacks: Optional[List[AgentCallback]] = None, - ): - """Initialize a model-backed agent adapter. - - Args: - model: ``ModelAdapter`` instance used for generation. - name: Agent name for tracing and identification. - callbacks: Optional agent callbacks. - """ - super().__init__(model, name, callbacks) - self._messages: List[Dict[str, Any]] = [] - - @property - def model(self) -> ModelAdapter: - """The underlying ``ModelAdapter``.""" - return self.agent - - def _run_agent(self, query: str) -> str: - """Generate a response by forwarding the query to the model. - - Args: - query: The prompt to send to the model. - - Returns: - The model's text response. - """ - self._messages.append({"role": "user", "content": query}) - response = self.agent.generate(query) - self._messages.append({"role": "assistant", "content": response}) - return response - - def get_messages(self) -> MessageHistory: - """Return the recorded conversation history.""" - return MessageHistory(self._messages) - - def gather_config(self) -> Dict[str, Any]: - """Gather configuration including model identifier. - - Returns: - Dictionary containing agent and model configuration. - """ - return { - **super().gather_config(), - "model_id": self.agent.model_id, - } From dd46f1a35a9fff535167cc7b896a3990cc41fe08 Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Fri, 13 Mar 2026 07:54:59 +0100 Subject: [PATCH 08/17] [Move DISCO queue to core]: - Update BENCHMARKS.md --- BENCHMARKS.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/BENCHMARKS.md b/BENCHMARKS.md index 0916ef69..0cc5473c 100644 --- a/BENCHMARKS.md +++ b/BENCHMARKS.md @@ -81,10 +81,12 @@ CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses ## 6. MMLU (Massive Multitask Language Understanding) (Beta) -MMLU evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration includes anchor-point-based evaluation for DISCO prediction, allowing efficient estimation of full benchmark performance from a subset of tasks. +MMLU evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration includes anchor-point-based evaluation for DISCO prediction, allowing efficient estimation of full benchmark performance from a subset of tasks. > **Beta:** This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome! +> **Implemented:** A ready-to-use implementation is available via `DefaultMMLUBenchmark` with HuggingFace model support. Install with `pip install maseval[mmlu]`. See the [MMLU documentation](docs/benchmark/mmlu.md) for usage details. + ### Source and License - **Original Paper:** [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) From 3779e2e84f304d5895432ba1e6ef66659b6f3b7c Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Fri, 13 Mar 2026 10:18:48 +0100 Subject: [PATCH 09/17] [Move DISCO queue to core]: - Update links in mmlu.md --- docs/benchmark/mmlu.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/benchmark/mmlu.md b/docs/benchmark/mmlu.md index d2e58544..348f7aaa 100644 --- a/docs/benchmark/mmlu.md +++ b/docs/benchmark/mmlu.md @@ -3,13 +3,13 @@ !!! warning "Beta" This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome! -The **MMLU Benchmark** evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration supports anchor-point-based evaluation for [DISCO](https://arxiv.org/abs/2407.12890) prediction, enabling efficient estimation of full benchmark performance from a subset of tasks. +The **MMLU Benchmark** evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration supports anchor-point-based evaluation for [DISCO](https://arxiv.org/abs/2510.07959) prediction, enabling efficient estimation of full benchmark performance from a subset of tasks. ## Overview [MMLU](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) is a widely used benchmark for measuring knowledge and reasoning across diverse domains. The MASEval implementation features: -- **Log-likelihood MCQ evaluation** matching lm-evaluation-harness methodology +- **Log-likelihood MCQ evaluation** matching [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) methodology - **Anchor-point task selection** via `DISCOQueue` for DISCO-style subset evaluation - **HuggingFace integration** with batched log-probability computation - **lm-eval compatibility** mode for exact numerical reproduction From bf4abbb6ab51b00f6ca88119d673d0d2a34776b8 Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Sat, 14 Mar 2026 08:36:39 +0100 Subject: [PATCH 10/17] [Move DISCO queue to core]: - Update mmlu and disco dependencies - Add installation guide to mmlu example --- examples/mmlu_benchmark/README.md | 14 ++++++++++++++ pyproject.toml | 13 +++++++------ 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/examples/mmlu_benchmark/README.md b/examples/mmlu_benchmark/README.md index 62c6bafc..0e90291b 100644 --- a/examples/mmlu_benchmark/README.md +++ b/examples/mmlu_benchmark/README.md @@ -2,6 +2,20 @@ Evaluate language models on [MMLU (Massive Multitask Language Understanding)](https://arxiv.org/abs/2009.03300) with optional efficient evaluation via [DISCO](https://arxiv.org/abs/2510.07959). +## Installation + +For basic MMLU evaluation: + +```bash +uv pip install .[mmlu] +``` + +For DISCO prediction (includes DISCO dependencies): + +```bash +uv pip install .[disco] +``` + ## Run without DISCO (full evaluation) From the project root: diff --git a/pyproject.toml b/pyproject.toml index c252adeb..59e7eb02 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -82,22 +82,22 @@ multiagentbench = [ ] tau2 = ["docstring-parser>=0.16", "addict>=2.4.0"] converse = [] -# HuggingFace model + tokenizer, default dataset download; numpy for example script and anchor-point loading; -# lm-eval for --use_lmeval_batching (exact lm-evaluation-harness reproduction); aiohttp required by lm_eval.models.api_models +# HuggingFace model + tokenizer, default dataset download; numpy for example script and anchor-point loading. +# For exact lm-evaluation-harness reproduction (--use_lmeval_batching), also install maseval[lm-eval]. mmlu = [ + "torch>=2.0.0", "transformers>=4.37.0", "numpy>=1.20.0", - "aiohttp>=3.9.0", - "lm-eval @ git+https://github.com/arubique/lm-evaluation-harness.git@main", ] -# LM Evaluation Harness (same as in mmlu; aiohttp required by lm_eval.models.api_models) +# LM Evaluation Harness — requires transformers 4.x (lm-eval uses APIs removed in 5.x) lm-eval = [ "aiohttp>=3.9.0", + "transformers>=4.37.0,<5.0.0", "lm-eval @ git+https://github.com/arubique/lm-evaluation-harness.git@main", ] -# DISCO prediction (for MMLU benchmark example) +# DISCO prediction (for MMLU benchmark example) — requires transformers 4.x via lm-eval disco = [ "aiohttp>=3.9.0", "click>=8.1.0", @@ -108,6 +108,7 @@ disco = [ "jsonlines>=4.0.0", "lm-eval @ git+https://github.com/arubique/lm-evaluation-harness.git@main", "matplotlib>=3.5.0", + "transformers>=4.37.0,<5.0.0", "scikit-learn>=1.7.2", "scipy>=1.11.0", "stnd @ git+https://github.com/arubique/stnd.git@0d23b52f7742c08b28be560d2d52d450fcd274b7", From f6a5885c8762a9d49540f7dcb64a0219fefb09af Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Sat, 14 Mar 2026 08:50:12 +0100 Subject: [PATCH 11/17] [Move DISCO queue to core]: - Update DefaultMMLUBenchmark.run_agents to pass type checks. --- maseval/benchmark/mmlu/mmlu.py | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/maseval/benchmark/mmlu/mmlu.py b/maseval/benchmark/mmlu/mmlu.py index e59f41b8..1e778169 100644 --- a/maseval/benchmark/mmlu/mmlu.py +++ b/maseval/benchmark/mmlu/mmlu.py @@ -516,17 +516,18 @@ def run_agents( otherwise delegates to ``HuggingFaceModelScorer.loglikelihood_choices()`` which automatically picks single-token or multi-token scoring. """ - prompt = environment.get_prompt() - choices = environment.state["choices"] + mmlu_env = cast(MMLUEnvironment, environment) + prompt = mmlu_env.get_prompt() + choices = mmlu_env.state["choices"] doc_id = task.metadata["doc_id"] + agent = cast(_ScorerBackedAdapter, agents[0]) if hasattr(self, "_precomputed_logprobs") and doc_id in self._precomputed_logprobs: logprobs = self._precomputed_logprobs[doc_id] best_idx = logprobs.index(max(logprobs)) answer = choices[best_idx] - environment.state["logprobs"] = logprobs - environment.state["predicted_idx"] = best_idx - agent = agents[0] + mmlu_env.state["logprobs"] = logprobs + mmlu_env.state["predicted_idx"] = best_idx agent._messages.append({"role": "user", "content": prompt}) agent._messages.append({"role": "assistant", "content": answer, "logprobs": logprobs}) return answer @@ -535,10 +536,9 @@ def run_agents( best_idx = logprobs.index(max(logprobs)) answer = choices[best_idx] - environment.state["logprobs"] = logprobs - environment.state["predicted_idx"] = best_idx + mmlu_env.state["logprobs"] = logprobs + mmlu_env.state["predicted_idx"] = best_idx - agent = agents[0] agent._messages.append({"role": "user", "content": prompt}) agent._messages.append({"role": "assistant", "content": answer, "logprobs": logprobs}) return answer From 26931972ab9801d6789570a3f5cf5a3eb849a61e Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Mon, 16 Mar 2026 08:50:56 +0100 Subject: [PATCH 12/17] [Move DISCO queue to core]: - Add tests for get_with_assert, ModelScorer, InformativeSubsetQueue, and DISCOQueue --- tests/test_core/test_exceptions.py | 47 +++++++ tests/test_core/test_queue.py | 106 ++++++++++++++++ tests/test_core/test_scorer.py | 191 +++++++++++++++++++++++++++++ 3 files changed, 344 insertions(+) create mode 100644 tests/test_core/test_scorer.py diff --git a/tests/test_core/test_exceptions.py b/tests/test_core/test_exceptions.py index 416ebb7e..1698fa61 100644 --- a/tests/test_core/test_exceptions.py +++ b/tests/test_core/test_exceptions.py @@ -14,6 +14,7 @@ AgentError, EnvironmentError, UserError, + get_with_assert, validate_argument_type, validate_required_arguments, validate_no_extra_arguments, @@ -370,6 +371,52 @@ def test_validate_arguments_from_schema_strict_mode(self): validate_arguments_from_schema({"name": "test", "extra": 1}, schema, strict=True) +@pytest.mark.core +class TestGetWithAssert: + """Tests for get_with_assert required-key lookup.""" + + def test_single_key_present(self): + """Returns value when key exists.""" + assert get_with_assert({"a": 1}, "a") == 1 + + def test_single_key_missing_raises_key_error(self): + """Raises KeyError with descriptive message when key is missing.""" + with pytest.raises(KeyError, match='Required key "x"'): + get_with_assert({"a": 1}, "x") + + def test_nested_key_access(self): + """Supports nested access via a list of keys.""" + data = {"level1": {"level2": {"level3": "value"}}} + assert get_with_assert(data, ["level1", "level2", "level3"]) == "value" + + def test_nested_key_missing_raises_key_error(self): + """Raises KeyError when a nested key is missing.""" + data = {"level1": {"level2": {}}} + with pytest.raises(KeyError): + get_with_assert(data, ["level1", "level2", "level3"]) + + def test_custom_error_message(self): + """Uses custom error message when provided.""" + with pytest.raises(KeyError, match="MMLU task missing query"): + get_with_assert({}, "query", error_msg="MMLU task missing query") + + def test_single_element_list_key(self): + """List with one key behaves like a single key.""" + assert get_with_assert({"a": 42}, ["a"]) == 42 + + def test_falsy_values_returned(self): + """Falsy values (0, empty string, False, None) are returned, not treated as missing.""" + assert get_with_assert({"k": 0}, "k") == 0 + assert get_with_assert({"k": ""}, "k") == "" + assert get_with_assert({"k": False}, "k") is False + assert get_with_assert({"k": None}, "k") is None + + def test_empty_key_list_raises(self): + """Empty key list triggers assertion error.""" + with pytest.raises(AssertionError): + get_with_assert({"a": 1}, []) + + class TestFilteringByErrorType: """Tests for filtering failed tasks by error type.""" diff --git a/tests/test_core/test_queue.py b/tests/test_core/test_queue.py index 9ffdd7d9..ace96588 100644 --- a/tests/test_core/test_queue.py +++ b/tests/test_core/test_queue.py @@ -15,6 +15,8 @@ AdaptiveTaskQueue, TaskQueue, BaseTaskQueue, + InformativeSubsetQueue, + DISCOQueue, ) @@ -212,6 +214,110 @@ def test_single_task(self): assert items[0].query == "Only one" +# ==================== InformativeSubsetQueue Tests ==================== + + +@pytest.mark.core +class TestInformativeSubsetQueue: + """Tests for InformativeSubsetQueue subset filtering.""" + + def test_filters_to_indices(self, simple_tasks): + """Only tasks at the given indices should be yielded.""" + queue = InformativeSubsetQueue(simple_tasks, indices=[0, 2]) + + queries = [task.query for task in queue] + + assert queries == ["Q1", "Q3"] + + def test_preserves_index_order(self): + """Tasks should be yielded in the order given by indices, not original order.""" + tasks = [Task(query=f"Q{i}") for i in range(5)] + queue = InformativeSubsetQueue(tasks, indices=[4, 1, 3]) + + queries = [task.query for task in queue] + + assert queries == ["Q4", "Q1", "Q3"] + + def test_none_indices_yields_all(self, simple_tasks): + """indices=None should yield all tasks in original order.""" + queue = InformativeSubsetQueue(simple_tasks, indices=None) + + queries = [task.query for task in queue] + + assert queries == ["Q1", "Q2", "Q3"] + + def test_stores_all_tasks(self, simple_tasks): + """_all_tasks should contain the full unfiltered list.""" + queue = InformativeSubsetQueue(simple_tasks, indices=[0]) + + assert len(queue._all_tasks) == 3 + assert len(queue) == 1 + + def test_out_of_range_indices_skipped(self): + """Indices not present in the task list should be silently skipped.""" + tasks = [Task(query="Q0"), Task(query="Q1")] + queue = InformativeSubsetQueue(tasks, indices=[0, 5, 99]) + + queries = [task.query for task in queue] + + assert queries == ["Q0"] + + def test_empty_indices(self, simple_tasks): + """Empty indices list should yield no tasks.""" + queue = InformativeSubsetQueue(simple_tasks, indices=[]) + + assert list(queue) == [] + assert len(queue) == 0 + + def test_is_subclass_of_sequential(self, simple_tasks): + """InformativeSubsetQueue should be a SequentialTaskQueue.""" + queue = InformativeSubsetQueue(simple_tasks) + assert isinstance(queue, SequentialTaskQueue) + + +# ==================== DISCOQueue Tests ==================== + + +@pytest.mark.core +class TestDISCOQueue: + """Tests for DISCOQueue diversity-based subset.""" + + def test_filters_to_anchor_points(self): + """Only tasks at anchor-point indices should be yielded.""" + tasks = [Task(query=f"Q{i}") for i in range(10)] + queue = DISCOQueue(tasks, anchor_points=[2, 5, 8]) + + queries = [task.query for task in queue] + + assert queries == ["Q2", "Q5", "Q8"] + + def test_none_anchor_points_yields_all(self, simple_tasks): + """anchor_points=None should yield all tasks.""" + queue = DISCOQueue(simple_tasks, anchor_points=None) + + assert len(list(queue)) == 3 + + def test_stores_anchor_points(self): + """_anchor_points should be accessible.""" + tasks = [Task(query=f"Q{i}") for i in range(5)] + anchor_pts = [0, 3, 4] + queue = DISCOQueue(tasks, anchor_points=anchor_pts) + + assert queue._anchor_points == [0, 3, 4] + + def test_is_subclass_of_informative_subset(self, simple_tasks): + """DISCOQueue should be an InformativeSubsetQueue.""" + queue = DISCOQueue(simple_tasks) + assert isinstance(queue, InformativeSubsetQueue) + + def test_len_matches_anchor_count(self): + """Queue length should match number of valid anchor points.""" + tasks = [Task(query=f"Q{i}") for i in range(10)] + queue = DISCOQueue(tasks, anchor_points=[1, 3, 7]) + + assert len(queue) == 3 + + # ==================== PriorityTaskQueue Tests ==================== diff --git a/tests/test_core/test_scorer.py b/tests/test_core/test_scorer.py new file mode 100644 index 00000000..1c1570d0 --- /dev/null +++ b/tests/test_core/test_scorer.py @@ -0,0 +1,191 @@ +"""Tests for ModelScorer abstract base class. + +These tests verify that the ModelScorer ABC correctly delegates to +subclass implementations, handles logging/tracing, and provides +the expected batch and MCQ convenience methods. +""" + +import pytest +from typing import Dict, List, Optional, Tuple + +from maseval.core.scorer import ModelScorer + + +class StubScorer(ModelScorer): + """Minimal concrete scorer for testing the ABC contract.""" + + def __init__(self, scores: Dict[Tuple[str, str], float], seed: Optional[int] = None): + super().__init__(seed=seed) + self._scores = scores + self._call_log: List[Tuple[str, str]] = [] + + @property + def model_id(self) -> str: + return "stub-model" + + def _loglikelihood_impl(self, context: str, continuation: str) -> float: + self._call_log.append((context, continuation)) + return self._scores[(context, continuation)] + + +class FailingScorer(ModelScorer): + """Scorer that raises on every call, for error-path testing.""" + + @property + def model_id(self) -> str: + return "failing-model" + + def _loglikelihood_impl(self, context: str, continuation: str) -> float: + raise ValueError("model exploded") + + +pytestmark = pytest.mark.core + + +class TestModelScorerLoglikelihood: + """Tests for single-pair loglikelihood.""" + + def test_delegates_to_impl(self): + """loglikelihood() should delegate to _loglikelihood_impl().""" + scorer = StubScorer({("ctx", " cont"): -1.5}) + result = scorer.loglikelihood("ctx", " cont") + + assert result == -1.5 + assert scorer._call_log == [("ctx", " cont")] + + def test_logs_success(self): + """Successful call should be logged.""" + scorer = StubScorer({("a", "b"): -2.0}) + scorer.loglikelihood("a", "b") + + assert len(scorer.logs) == 1 + assert scorer.logs[0]["status"] == "success" + assert scorer.logs[0]["type"] == "loglikelihood" + assert scorer.logs[0]["duration_seconds"] >= 0 + + def test_logs_error_and_reraises(self): + """Failed call should be logged and the exception re-raised.""" + scorer = FailingScorer() + + with pytest.raises(ValueError, match="model exploded"): + scorer.loglikelihood("a", "b") + + assert len(scorer.logs) == 1 + assert scorer.logs[0]["status"] == "error" + assert scorer.logs[0]["error_type"] == "ValueError" + + +class TestModelScorerBatch: + """Tests for batch loglikelihood.""" + + def test_default_batch_loops_over_impl(self): + """Default _loglikelihood_batch_impl loops over _loglikelihood_impl.""" + scores = {("q", " A"): -1.0, ("q", " B"): -2.0, ("q", " C"): -0.5} + scorer = StubScorer(scores) + + results = scorer.loglikelihood_batch([("q", " A"), ("q", " B"), ("q", " C")]) + + assert results == [-1.0, -2.0, -0.5] + assert len(scorer._call_log) == 3 + + def test_batch_logs_single_entry(self): + """Batch call should produce one log entry (not per-pair).""" + scores = {("q", " A"): -1.0, ("q", " B"): -2.0} + scorer = StubScorer(scores) + + scorer.loglikelihood_batch([("q", " A"), ("q", " B")]) + + assert len(scorer.logs) == 1 + assert scorer.logs[0]["type"] == "loglikelihood_batch" + assert scorer.logs[0]["batch_size"] == 2 + + def test_empty_batch(self): + """Empty batch should return empty list.""" + scorer = StubScorer({}) + assert scorer.loglikelihood_batch([]) == [] + + +class TestModelScorerChoices: + """Tests for MCQ loglikelihood_choices.""" + + def test_prepends_delimiter(self): + """Choices should be prepended with the delimiter before scoring.""" + scores = {("Q?", " A"): -1.0, ("Q?", " B"): -0.5, ("Q?", " C"): -2.0} + scorer = StubScorer(scores) + + results = scorer.loglikelihood_choices("Q?", ["A", "B", "C"]) + + assert results == [-1.0, -0.5, -2.0] + assert scorer._call_log == [("Q?", " A"), ("Q?", " B"), ("Q?", " C")] + + def test_custom_delimiter(self): + """Custom delimiter should be used instead of default space.""" + scores = {("Q?", "\nA"): -1.0, ("Q?", "\nB"): -0.5} + scorer = StubScorer(scores) + + results = scorer.loglikelihood_choices("Q?", ["A", "B"], delimiter="\n") + + assert results == [-1.0, -0.5] + assert scorer._call_log == [("Q?", "\nA"), ("Q?", "\nB")] + + +class TestModelScorerTracing: + """Tests for gather_traces and gather_config.""" + + def test_gather_traces_includes_call_stats(self): + """Traces should contain call counts and timing.""" + scores = {("a", "b"): -1.0, ("c", "d"): -2.0} + scorer = StubScorer(scores) + scorer.loglikelihood("a", "b") + scorer.loglikelihood("c", "d") + + traces = scorer.gather_traces() + + assert traces["model_id"] == "stub-model" + assert traces["total_calls"] == 2 + assert traces["successful_calls"] == 2 + assert traces["failed_calls"] == 0 + assert traces["total_duration_seconds"] >= 0 + assert len(traces["logs"]) == 2 + + def test_gather_traces_counts_failures(self): + """Traces should correctly count failed calls.""" + scorer = FailingScorer() + with pytest.raises(ValueError): + scorer.loglikelihood("a", "b") + + traces = scorer.gather_traces() + + assert traces["total_calls"] == 1 + assert traces["successful_calls"] == 0 + assert traces["failed_calls"] == 1 + + def test_gather_config(self): + """Config should include model_id, scorer_type, and seed.""" + scorer = StubScorer({}, seed=42) + + config = scorer.gather_config() + + assert config["model_id"] == "stub-model" + assert config["scorer_type"] == "StubScorer" + assert config["seed"] == 42 + + def test_gather_config_seed_none(self): + """Config should report None seed when unseeded.""" + scorer = StubScorer({}) + + config = scorer.gather_config() + + assert config["seed"] is None + + +class TestModelScorerSeed: + """Tests for seed property.""" + + def test_seed_stored(self): + scorer = StubScorer({}, seed=123) + assert scorer.seed == 123 + + def test_seed_default_none(self): + scorer = StubScorer({}) + assert scorer.seed is None From afd2cf95fb70ae0deea9bf4cce0e1023d5616b4c Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Mon, 16 Mar 2026 08:58:05 +0100 Subject: [PATCH 13/17] [Move DISCO queue to core]: - Move load_anchor_points to DISCOQueue --- maseval/benchmark/mmlu/mmlu.py | 43 +--------------- maseval/core/task.py | 61 +++++++++++++++++++++-- tests/test_core/test_queue.py | 90 ++++++++++++++++++++++++++++++++++ 3 files changed, 147 insertions(+), 47 deletions(-) diff --git a/maseval/benchmark/mmlu/mmlu.py b/maseval/benchmark/mmlu/mmlu.py index 1e778169..d00fe5c0 100644 --- a/maseval/benchmark/mmlu/mmlu.py +++ b/maseval/benchmark/mmlu/mmlu.py @@ -24,19 +24,9 @@ """ import json -import pickle from pathlib import Path from typing import Any, Dict, List, Optional, Sequence, Tuple, Union, cast -# numpy is optional - only needed for anchor points processing -try: - import numpy as np - - HAS_NUMPY = True -except ImportError: - np = None # type: ignore[assignment] - HAS_NUMPY = False - from maseval import ( AgentAdapter, DISCOQueue, @@ -561,29 +551,6 @@ def get_model_adapter(self, model_id: str, **kwargs: Any) -> ModelAdapter: # ============================================================================= -def load_pickle(path: Union[str, Path]) -> Any: - """Load a pickle file.""" - with open(path, "rb") as f: - return pickle.load(f) - - -def load_anchor_points(path: Union[str, Path]) -> List[int]: - """Load anchor points from a .json or .pkl file. Returns a list of doc_ids.""" - path = Path(path) - if not path.exists(): - raise FileNotFoundError(f"Anchor points file not found: {path}") - if path.suffix.lower() == ".json": - with open(path) as f: - anchor_points = json.load(f) - else: - anchor_points = load_pickle(path) - if HAS_NUMPY and isinstance(anchor_points, np.ndarray): - anchor_points = anchor_points.tolist() - elif not HAS_NUMPY and hasattr(anchor_points, "tolist"): - anchor_points = anchor_points.tolist() - return list(anchor_points) - - def load_tasks( data_path: Union[str, Path], anchor_points_path: Optional[Union[str, Path]] = None, @@ -601,8 +568,6 @@ def load_tasks( Returns: TaskQueue containing MMLU tasks. - Raises: - ImportError: If anchor_points_path is provided but numpy is not installed. """ data_path = Path(data_path) @@ -642,14 +607,8 @@ def load_tasks( ) tasks.append(task) - # Load anchor points if provided - anchor_points = None if anchor_points_path is not None: - anchor_points = load_anchor_points(anchor_points_path) - - # Create appropriate queue - if anchor_points is not None: - return DISCOQueue(tasks, anchor_points) + return DISCOQueue(tasks, anchor_points_path=anchor_points_path) else: return SequentialTaskQueue(tasks) diff --git a/maseval/core/task.py b/maseval/core/task.py index 22ec5e0f..07a3af9b 100644 --- a/maseval/core/task.py +++ b/maseval/core/task.py @@ -5,6 +5,7 @@ from collections.abc import Sequence from typing import Iterable, List, Union, Iterator, Optional import json +import pickle from pathlib import Path from enum import Enum @@ -339,25 +340,75 @@ class DISCOQueue(InformativeSubsetQueue): Example: ```python queue = DISCOQueue(tasks, anchor_points=[0, 5, 12]) + # or load from file: + queue = DISCOQueue(tasks, anchor_points_path="anchor_points.pkl") for task in queue: - result = execute(task) # Only 3 tasks + result = execute(task) # Only anchor-point tasks ``` """ - def __init__(self, tasks: Iterable[Task], anchor_points: Optional[List[int]] = None) -> None: + def __init__( + self, + tasks: Iterable[Task], + anchor_points: Optional[List[int]] = None, + anchor_points_path: Optional[Union[str, Path]] = None, + ) -> None: """Initialize DISCO task queue. + Anchor points can be supplied directly via ``anchor_points`` or loaded + from a file via ``anchor_points_path``. Providing both is an error. + Args: tasks: Full list of tasks (ordered by index). anchor_points: Diversity-selected indices into ``tasks``. - Typically loaded from a DISCO anchor-points file or - downloaded from a HuggingFace DISCO model repo. - If ``None``, evaluates all tasks in order. + Typically downloaded from a HuggingFace DISCO model repo. + If ``None`` and ``anchor_points_path`` is also ``None``, + evaluates all tasks in order. + anchor_points_path: Path to a ``.json`` or ``.pkl`` file + containing anchor-point indices. Mutually exclusive with + ``anchor_points``. """ + if anchor_points is not None and anchor_points_path is not None: + raise ValueError("Provide either anchor_points or anchor_points_path, not both.") + + if anchor_points_path is not None: + anchor_points = self.load_anchor_points(anchor_points_path) + self._anchor_points: Optional[List[int]] = anchor_points super().__init__(tasks, indices=anchor_points) + @staticmethod + def load_anchor_points(path: Union[str, Path]) -> List[int]: + """Load anchor points from a ``.json`` or ``.pkl`` file. + + Args: + path: Path to anchor points file. JSON files should contain a + list of integer indices. Pickle files may contain a list or + a numpy array. + + Returns: + List of integer anchor-point indices. + + Raises: + FileNotFoundError: If the file does not exist. + """ + path = Path(path) + if not path.exists(): + raise FileNotFoundError(f"Anchor points file not found: {path}") + + if path.suffix.lower() == ".json": + with open(path) as f: + anchor_points = json.load(f) + else: + with open(path, "rb") as f: + anchor_points = pickle.load(f) + + if hasattr(anchor_points, "tolist"): + anchor_points = anchor_points.tolist() + + return list(anchor_points) + class PriorityTaskQueue(BaseTaskQueue): """Execute tasks ordered by priority. diff --git a/tests/test_core/test_queue.py b/tests/test_core/test_queue.py index ace96588..35bf1933 100644 --- a/tests/test_core/test_queue.py +++ b/tests/test_core/test_queue.py @@ -20,6 +20,16 @@ ) +class _FakeArray: + """Pickle-serializable array-like for testing .tolist() conversion.""" + + def tolist(self): + return [1, 2, 3] + + def __iter__(self): + return iter([1, 2, 3]) + + # ==================== Fixtures ==================== @@ -318,6 +328,86 @@ def test_len_matches_anchor_count(self): assert len(queue) == 3 +@pytest.mark.core +class TestDISCOQueueLoadAnchorPoints: + """Tests for DISCOQueue.load_anchor_points static method.""" + + def test_load_from_json(self, tmp_path): + """Should load anchor points from a JSON file.""" + import json + + path = tmp_path / "anchors.json" + path.write_text(json.dumps([0, 5, 12, 99])) + + result = DISCOQueue.load_anchor_points(path) + + assert result == [0, 5, 12, 99] + + def test_load_from_pickle(self, tmp_path): + """Should load anchor points from a pickle file.""" + import pickle + + path = tmp_path / "anchors.pkl" + with open(path, "wb") as f: + pickle.dump([2, 7, 15], f) + + result = DISCOQueue.load_anchor_points(path) + + assert result == [2, 7, 15] + + def test_load_converts_tolist(self, tmp_path): + """Should call .tolist() on array-like objects (e.g. numpy arrays).""" + import pickle + + path = tmp_path / "anchors.pkl" + with open(path, "wb") as f: + pickle.dump(_FakeArray(), f) + + result = DISCOQueue.load_anchor_points(path) + + assert result == [1, 2, 3] + + def test_file_not_found(self, tmp_path): + """Should raise FileNotFoundError for missing files.""" + with pytest.raises(FileNotFoundError, match="not found"): + DISCOQueue.load_anchor_points(tmp_path / "nonexistent.json") + + def test_accepts_string_path(self, tmp_path): + """Should accept a string path, not just Path objects.""" + import json + + path = tmp_path / "anchors.json" + path.write_text(json.dumps([10, 20])) + + result = DISCOQueue.load_anchor_points(str(path)) + + assert result == [10, 20] + + def test_init_with_anchor_points_path(self, tmp_path): + """DISCOQueue should load anchor points from file when anchor_points_path is given.""" + import json + + tasks = [Task(query=f"Q{i}") for i in range(10)] + path = tmp_path / "anchors.json" + path.write_text(json.dumps([2, 5, 8])) + + queue = DISCOQueue(tasks, anchor_points_path=path) + + assert len(queue) == 3 + assert queue._anchor_points == [2, 5, 8] + + def test_init_rejects_both_anchor_args(self, tmp_path): + """DISCOQueue should raise ValueError when both anchor_points and anchor_points_path are given.""" + import json + + tasks = [Task(query=f"Q{i}") for i in range(5)] + path = tmp_path / "anchors.json" + path.write_text(json.dumps([0, 1])) + + with pytest.raises(ValueError, match="not both"): + DISCOQueue(tasks, anchor_points=[0, 1], anchor_points_path=path) + + # ==================== PriorityTaskQueue Tests ==================== From 72018322928c907c22b2efd271f4be6499cd073a Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Mon, 16 Mar 2026 09:12:29 +0100 Subject: [PATCH 14/17] [Move DISCO queue to core]: Update docs to reflect MMLU, scorer, and queue changes - Add missing documentation for new core components introduced alongside the MMLU benchmark: ModelScorer reference page, InformativeSubsetQueue/DISCOQueue in task reference, get_with_assert in exceptions reference, and HuggingFacePipelineModelAdapter rename in model/HuggingFace pages. - Add mmlu extra to README install section. - Fix grammar in MMLU docs and fill CHANGELOG PR placeholders. --- CHANGELOG.md | 9 +++++---- README.md | 7 +++++++ docs/benchmark/mmlu.md | 2 +- docs/interface/inference/huggingface.md | 17 ++++++++++++++--- docs/reference/exceptions.md | 4 ++++ docs/reference/model.md | 2 +- docs/reference/scorer.md | 19 +++++++++++++++++++ docs/reference/task.md | 22 +++++++++++++++------- mkdocs.yml | 1 + 9 files changed, 67 insertions(+), 16 deletions(-) create mode 100644 docs/reference/scorer.md diff --git a/CHANGELOG.md b/CHANGELOG.md index bea3eedd..e4b63450 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -41,8 +41,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Core** -- Added `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). Available via `from maseval import DISCOQueue`. (PR: #34) -- Added `ModelScorer` abstract base class in `maseval.core.scorer` for log-likelihood scoring, with `loglikelihood()`, `loglikelihood_batch()`, and `loglikelihood_choices()` methods. (PR: #PR_NUMBER_PLACEHOLDER) +- Added `InformativeSubsetQueue` and `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). `DISCOQueue` accepts `anchor_points_path` to load indices from a `.json`/`.pkl` file via `DISCOQueue.load_anchor_points()`. Available via `from maseval import DISCOQueue, InformativeSubsetQueue`. (PR: #34) +- Added `get_with_assert()` utility in `maseval.core.exceptions` for strict dictionary access that raises `KeyError` instead of silently returning a default. Supports nested key lookups. (PR: #34) +- Added `ModelScorer` abstract base class in `maseval.core.scorer` for log-likelihood scoring, with `loglikelihood()`, `loglikelihood_batch()`, and `loglikelihood_choices()` methods. (PR: #34) - Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24) - Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24) - Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24) @@ -53,8 +54,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Interface** -- Added `HuggingFaceModelScorer` in `maseval.interface.inference` — log-likelihood scorer backed by a HuggingFace `AutoModelForCausalLM`, with single-token optimisation for MCQ evaluation. Implements the `ModelScorer` interface. (PR: #PR_NUMBER_PLACEHOLDER) -- Renamed `HuggingFaceModelAdapter` → `HuggingFacePipelineModelAdapter` to distinguish it from the new scorer. The old name remains as a backwards-compatible alias. (PR: #PR_NUMBER_PLACEHOLDER) +- Added `HuggingFaceModelScorer` in `maseval.interface.inference` — log-likelihood scorer backed by a HuggingFace `AutoModelForCausalLM`, with single-token optimisation for MCQ evaluation. Implements the `ModelScorer` interface. (PR: #34) +- Renamed `HuggingFaceModelAdapter` → `HuggingFacePipelineModelAdapter` to distinguish it from the new scorer. The old name remains as a backwards-compatible alias. (PR: #34) - CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22) - Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22) diff --git a/README.md b/README.md index dea369c6..9f71751a 100644 --- a/README.md +++ b/README.md @@ -109,6 +109,13 @@ pip install "maseval[langgraph]" pip install "maseval[llamaindex]" ``` +Or install benchmark-specific dependencies: + +```bash +# MMLU (HuggingFace models) +pip install "maseval[mmlu]" +``` + ## Example Examples are available in the [Documentation](https://maseval.readthedocs.io/en/stable/). diff --git a/docs/benchmark/mmlu.md b/docs/benchmark/mmlu.md index 348f7aaa..1b5d412b 100644 --- a/docs/benchmark/mmlu.md +++ b/docs/benchmark/mmlu.md @@ -88,7 +88,7 @@ tasks = load_tasks( anchor_points_path="/path/to/anchor_points.json", ) -# tasks is an DISCOQueue — only anchor tasks are evaluated +# tasks is a DISCOQueue — only anchor tasks are evaluated print(f"Evaluating {len(tasks)} anchor tasks") ``` diff --git a/docs/interface/inference/huggingface.md b/docs/interface/inference/huggingface.md index 00a424a4..28814b60 100644 --- a/docs/interface/inference/huggingface.md +++ b/docs/interface/inference/huggingface.md @@ -1,7 +1,18 @@ -# HuggingFace Inference Adapter +# HuggingFace Inference Adapters -This page documents the HuggingFace model adapter for MASEval. +This page documents the HuggingFace model adapters for MASEval. + +## Pipeline Model Adapter (Text Generation) [:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface.py){ .md-source-file } -::: maseval.interface.inference.huggingface.HuggingFaceModelAdapter +::: maseval.interface.inference.huggingface.HuggingFacePipelineModelAdapter + +!!! note + `HuggingFaceModelAdapter` is a backwards-compatible alias for `HuggingFacePipelineModelAdapter`. + +## Model Scorer (Log-Likelihood) + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface_scorer.py){ .md-source-file } + +::: maseval.interface.inference.huggingface_scorer.HuggingFaceModelScorer diff --git a/docs/reference/exceptions.md b/docs/reference/exceptions.md index ef96f9dc..99cf2c3e 100644 --- a/docs/reference/exceptions.md +++ b/docs/reference/exceptions.md @@ -38,6 +38,10 @@ SimulatorError (base for simulators) ::: maseval.core.simulator.UserSimulatorError +## Data Access Helpers + +::: maseval.core.exceptions.get_with_assert + ## Validation Helpers These functions simplify input validation and raise `AgentError` with helpful suggestions: diff --git a/docs/reference/model.md b/docs/reference/model.md index 1569d939..f0029c0d 100644 --- a/docs/reference/model.md +++ b/docs/reference/model.md @@ -20,7 +20,7 @@ The following adapter classes implement the ModelAdapter interface for specific [:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface.py){ .md-source-file } -::: maseval.interface.inference.huggingface.HuggingFaceModelAdapter +::: maseval.interface.inference.huggingface.HuggingFacePipelineModelAdapter [:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/google_genai.py){ .md-source-file } diff --git a/docs/reference/scorer.md b/docs/reference/scorer.md new file mode 100644 index 00000000..cf2eddd4 --- /dev/null +++ b/docs/reference/scorer.md @@ -0,0 +1,19 @@ +# Model Scorers + +Model Scorers provide a uniform interface for log-likelihood computation across model providers. Unlike `ModelAdapter` (which handles text generation and chat), scorers evaluate how likely a model considers a given continuation given some context. + +!!! note + + `ModelScorer` is the scoring counterpart to `ModelAdapter`. Use it when you need log-likelihood evaluation (e.g., multiple-choice benchmarks) rather than text generation. + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/scorer.py){ .md-source-file } + +::: maseval.core.scorer.ModelScorer + +## Interfaces + +The following scorer classes implement the ModelScorer interface for specific providers. + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface_scorer.py){ .md-source-file } + +::: maseval.interface.inference.huggingface_scorer.HuggingFaceModelScorer diff --git a/docs/reference/task.md b/docs/reference/task.md index b70ef13f..ad3087d6 100644 --- a/docs/reference/task.md +++ b/docs/reference/task.md @@ -2,15 +2,15 @@ Tasks define individual benchmark scenarios including inputs, expected outputs, and metadata for evaluation. Task queues control execution order and scheduling strategy. -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L55){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L56){ .md-source-file } ::: maseval.core.task.Task -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L27){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L28){ .md-source-file } ::: maseval.core.task.TaskProtocol -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L18){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L19){ .md-source-file } ::: maseval.core.task.TimeoutAction @@ -18,18 +18,26 @@ Tasks define individual benchmark scenarios including inputs, expected outputs, Task queues determine the order in which tasks are executed. Pass a queue to `Benchmark.run(queue=...)` to customize scheduling. -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L86){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L87){ .md-source-file } ::: maseval.core.task.BaseTaskQueue -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L256){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L257){ .md-source-file } ::: maseval.core.task.SequentialTaskQueue -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L276){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L277){ .md-source-file } + +::: maseval.core.task.InformativeSubsetQueue + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L325){ .md-source-file } + +::: maseval.core.task.DISCOQueue + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L413){ .md-source-file } ::: maseval.core.task.PriorityTaskQueue -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L322){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L459){ .md-source-file } ::: maseval.core.task.AdaptiveTaskQueue diff --git a/mkdocs.yml b/mkdocs.yml index 153215e9..dec8cc1e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -110,6 +110,7 @@ nav: - Exceptions: reference/exceptions.md - History: reference/history.md - Model: reference/model.md + - Scorer: reference/scorer.md - Seeding: reference/seeding.md - Simulator: reference/simulator.md - Tasks: reference/task.md From e7d15a86c98a14dd508dab1eb3c7ab67496fc244 Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Mon, 16 Mar 2026 09:15:34 +0100 Subject: [PATCH 15/17] [Move DISCO queue to core]: - Fix SmolAgents docs to make mkdocs build --strict pass --- docs/reference/environment.md | 10 ++++------ docs/reference/user.md | 23 +++++------------------ 2 files changed, 9 insertions(+), 24 deletions(-) diff --git a/docs/reference/environment.md b/docs/reference/environment.md index 77d40e30..7d65e9f1 100644 --- a/docs/reference/environment.md +++ b/docs/reference/environment.md @@ -8,10 +8,8 @@ Environments define the execution context for agents, including available tools, ## Tools and agent-provided helpers -Some agent adapters expose helper tools or user-simulation tools that can be used by the Environment. For example: +Some agent adapters expose helper tools or user-simulation tools that can be used by the Environment. See the framework-specific interface pages for details: -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/smolagents.py){ .md-source-file } - -::: maseval.interface.agents.smolagents.SmolAgentAdapter - -::: maseval.interface.agents.smolagents.SmolAgentLLMUser +- [SmolAgents](../interface/agents/smolagents.md) — `SmolAgentAdapter`, `SmolAgentLLMUser` +- [LangGraph](../interface/agents/langgraph.md) — `LangGraphAgentAdapter` +- [LlamaIndex](../interface/agents/llamaindex.md) — `LlamaIndexAgentAdapter` diff --git a/docs/reference/user.md b/docs/reference/user.md index c739ad25..c3cd1af8 100644 --- a/docs/reference/user.md +++ b/docs/reference/user.md @@ -14,22 +14,9 @@ The `LLMUser` is initialized with a persona and a scenario, both of which are ty ## Interfaces -Some integrations provide convenience user/tool implementations for specific agent frameworks. For example: +Some integrations provide convenience user implementations for specific agent frameworks. See the framework-specific interface pages for details: -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/smolagents.py){ .md-source-file } - -::: maseval.interface.agents.smolagents.SmolAgentLLMUser - -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/langgraph.py){ .md-source-file } - -::: maseval.interface.agents.langgraph.LangGraphLLMUser - -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/llamaindex.py){ .md-source-file } - -::: maseval.interface.agents.llamaindex.LlamaIndexLLMUser - -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/camel.py){ .md-source-file } - -::: maseval.interface.agents.camel.CamelLLMUser - -::: maseval.interface.agents.camel.CamelAgentUser +- [SmolAgents](../interface/agents/smolagents.md) — `SmolAgentLLMUser` +- [LangGraph](../interface/agents/langgraph.md) — `LangGraphLLMUser` +- [LlamaIndex](../interface/agents/llamaindex.md) — `LlamaIndexLLMUser` +- [CAMEL-AI](../interface/agents/camel.md) — `CamelLLMUser`, `CamelAgentUser` From 3aa675e3c97cf06f7d91737296b3fabbfedf660c Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Mon, 16 Mar 2026 09:22:36 +0100 Subject: [PATCH 16/17] [Move DISCO queue to core]: - Fix DISCO references. --- BENCHMARKS.md | 2 +- maseval/core/task.py | 7 +++---- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/BENCHMARKS.md b/BENCHMARKS.md index 0cc5473c..4cb9f74c 100644 --- a/BENCHMARKS.md +++ b/BENCHMARKS.md @@ -90,7 +90,7 @@ MMLU evaluates language models on multiple-choice questions spanning 57 academic ### Source and License - **Original Paper:** [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) -- **DISCO Paper:** [DISCO: DISCOvering key features for accurate prediction of LLM abilities on benchmarks](https://arxiv.org/abs/2407.12890) (Rubinstein et al., 2025) +- **DISCO Paper:** [DISCO: Diversifying Sample Condensation for Efficient Model Evaluation](https://arxiv.org/abs/2510.07959) (Rubinstein et al., ICLR 2026) - **Dataset:** [arubique/flattened-MMLU](https://huggingface.co/datasets/arubique/flattened-MMLU) --- diff --git a/maseval/core/task.py b/maseval/core/task.py index 07a3af9b..9a7b3aca 100644 --- a/maseval/core/task.py +++ b/maseval/core/task.py @@ -327,15 +327,14 @@ class DISCOQueue(InformativeSubsetQueue): Selects a diverse subset of tasks (anchor points) for evaluation. Full benchmark performance is then predicted from results on this subset using - DISCO (DISCOvering key features for accurate prediction of LLM abilities - on benchmarks). + DISCO (Diversifying Sample Condensation for Efficient Model Evaluation). The informativeness criterion is **diversity**: anchor points are chosen to maximise disagreement across models, so that a small evaluation set captures the discriminative structure of the full benchmark. - Reference: `DISCO: DISCOvering key features for accurate prediction of - LLM abilities on benchmarks `_ + Reference: `DISCO: Diversifying Sample Condensation for Efficient Model + Evaluation `_ Example: ```python From 6f5b0e22e6ab514d84ae853439f45980866ac2e3 Mon Sep 17 00:00:00 2001 From: Alexander Rubinstein Date: Mon, 16 Mar 2026 09:27:11 +0100 Subject: [PATCH 17/17] Add benchmark/index.md to mkdocs.yml to fix warning during docs building --- mkdocs.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/mkdocs.yml b/mkdocs.yml index dec8cc1e..4ba742bf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -129,6 +129,7 @@ nav: - LiteLLM: interface/inference/litellm.md - OpenAI: interface/inference/openai.md - Benchmarks: + - Overview: benchmark/index.md - ConVerse: benchmark/converse.md - GAIA2: benchmark/gaia2.md - MACS: benchmark/macs.md