diff --git a/BENCHMARKS.md b/BENCHMARKS.md index fcbde7d3..4cb9f74c 100644 --- a/BENCHMARKS.md +++ b/BENCHMARKS.md @@ -79,7 +79,23 @@ CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses --- -## 6. [Name of Next Benchmark] +## 6. MMLU (Massive Multitask Language Understanding) (Beta) + +MMLU evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration includes anchor-point-based evaluation for DISCO prediction, allowing efficient estimation of full benchmark performance from a subset of tasks. + +> **Beta:** This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome! + +> **Implemented:** A ready-to-use implementation is available via `DefaultMMLUBenchmark` with HuggingFace model support. Install with `pip install maseval[mmlu]`. See the [MMLU documentation](docs/benchmark/mmlu.md) for usage details. + +### Source and License + +- **Original Paper:** [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) +- **DISCO Paper:** [DISCO: Diversifying Sample Condensation for Efficient Model Evaluation](https://arxiv.org/abs/2510.07959) (Rubinstein et al., ICLR 2026) +- **Dataset:** [arubique/flattened-MMLU](https://huggingface.co/datasets/arubique/flattened-MMLU) + +--- + +## 7. [Name of Next Benchmark] (Description for the next benchmark...) diff --git a/CHANGELOG.md b/CHANGELOG.md index c3f11572..e4b63450 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Benchmarks** -- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `AnchorPointsTaskQueue`, `load_tasks()`, and `compute_benchmark_metrics()`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34) +- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `DefaultMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `DefaultMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34) - CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28) @@ -35,21 +35,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Examples** - MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34) +- MMLU benchmark documentation at `docs/benchmark/mmlu.md` with installation, quick start, and API reference. (PR: #34) - Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28) - Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26) **Core** +- Added `InformativeSubsetQueue` and `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). `DISCOQueue` accepts `anchor_points_path` to load indices from a `.json`/`.pkl` file via `DISCOQueue.load_anchor_points()`. Available via `from maseval import DISCOQueue, InformativeSubsetQueue`. (PR: #34) +- Added `get_with_assert()` utility in `maseval.core.exceptions` for strict dictionary access that raises `KeyError` instead of silently returning a default. Supports nested key lookups. (PR: #34) +- Added `ModelScorer` abstract base class in `maseval.core.scorer` for log-likelihood scoring, with `loglikelihood()`, `loglikelihood_batch()`, and `loglikelihood_choices()` methods. (PR: #34) - Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24) - Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24) - Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24) - Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24) - Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24) -- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24) +- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFacePipelineModelAdapter` pass seeds to underlying APIs (PR: #24) - Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39) **Interface** +- Added `HuggingFaceModelScorer` in `maseval.interface.inference` — log-likelihood scorer backed by a HuggingFace `AutoModelForCausalLM`, with single-token optimisation for MCQ evaluation. Implements the `ModelScorer` interface. (PR: #34) +- Renamed `HuggingFaceModelAdapter` → `HuggingFacePipelineModelAdapter` to distinguish it from the new scorer. The old name remains as a backwards-compatible alias. (PR: #34) + - CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22) - Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22) - Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22) @@ -86,6 +93,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Benchmarks** +- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `DefaultMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). `DefaultMMLUBenchmark` now delegates log-likelihood computation to `HuggingFaceModelScorer` and uses a scorer-backed adapter instead of the MMLU-specific `MMLUModelAgent`/`MMLUAgentAdapter` (removed). (PR: #34) - `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26) - `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge` - `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr` diff --git a/README.md b/README.md index dea369c6..9f71751a 100644 --- a/README.md +++ b/README.md @@ -109,6 +109,13 @@ pip install "maseval[langgraph]" pip install "maseval[llamaindex]" ``` +Or install benchmark-specific dependencies: + +```bash +# MMLU (HuggingFace models) +pip install "maseval[mmlu]" +``` + ## Example Examples are available in the [Documentation](https://maseval.readthedocs.io/en/stable/). diff --git a/docs/benchmark/mmlu.md b/docs/benchmark/mmlu.md new file mode 100644 index 00000000..1b5d412b --- /dev/null +++ b/docs/benchmark/mmlu.md @@ -0,0 +1,144 @@ +# MMLU: Massive Multitask Language Understanding (Beta) + +!!! warning "Beta" + This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome! + +The **MMLU Benchmark** evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration supports anchor-point-based evaluation for [DISCO](https://arxiv.org/abs/2510.07959) prediction, enabling efficient estimation of full benchmark performance from a subset of tasks. + +## Overview + +[MMLU](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) is a widely used benchmark for measuring knowledge and reasoning across diverse domains. The MASEval implementation features: + +- **Log-likelihood MCQ evaluation** matching [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) methodology +- **Anchor-point task selection** via `DISCOQueue` for DISCO-style subset evaluation +- **HuggingFace integration** with batched log-probability computation +- **lm-eval compatibility** mode for exact numerical reproduction + +Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses. + +## Installation + +Install MMLU with all dependencies needed to run the HuggingFace benchmark and example script: + +```bash +pip install maseval[mmlu] +``` + +Or with uv: + +```bash +uv sync --extra mmlu +``` + +This installs `transformers`, `torch`, `numpy`, and `huggingface_hub` (the latter two via `transformers`). You can then run the example: + +```bash +python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full +``` + +For DISCO prediction support: + +```bash +pip install maseval[disco] +``` + +For exact lm-evaluation-harness reproduction: + +```bash +pip install maseval[lm-eval] +``` + +## Quick Start + +```python +from maseval.benchmark.mmlu import ( + DefaultMMLUBenchmark, + load_tasks, + compute_benchmark_metrics, +) + +# Load tasks (downloads from HuggingFace automatically) +tasks = load_tasks(data_path="/path/to/mmlu_prompts_examples.json") + +# Create benchmark with HuggingFace model +benchmark = DefaultMMLUBenchmark( + model_id="meta-llama/Llama-2-7b-hf", + device="cuda:0", +) + +# Run evaluation +results = benchmark.run( + tasks=tasks, + agent_data={"model_id": "meta-llama/Llama-2-7b-hf"}, +) + +# Compute metrics +metrics = compute_benchmark_metrics(results) +print(f"Accuracy: {metrics['acc']:.4f}") +``` + +### With Anchor Points (DISCO) + +```python +from maseval.benchmark.mmlu import load_tasks + +# Load tasks filtered to anchor points +tasks = load_tasks( + data_path="/path/to/mmlu_prompts_examples.json", + anchor_points_path="/path/to/anchor_points.json", +) + +# tasks is a DISCOQueue — only anchor tasks are evaluated +print(f"Evaluating {len(tasks)} anchor tasks") +``` + +## Custom Benchmark Subclass + +`MMLUBenchmark` is a framework-agnostic base class. To use a different model backend, subclass it and implement `setup_agents()` and `get_model_adapter()`: + +```python +from maseval import AgentAdapter +from maseval.core.history import MessageHistory +from maseval.benchmark.mmlu import MMLUBenchmark + +class MyAgentAdapter(AgentAdapter): + def __init__(self, model, name): + super().__init__(model, name) + self._messages = [] + + def _run_agent(self, query): + self._messages.append({"role": "user", "content": query}) + response = self.agent.generate(query) + self._messages.append({"role": "assistant", "content": response}) + return response + + def get_messages(self): + return MessageHistory(self._messages) + +class MyMMLUBenchmark(MMLUBenchmark): + def setup_agents(self, agent_data, environment, task, user, seed_generator): + model = self.get_model_adapter(agent_data["model_id"]) + adapter = MyAgentAdapter(model, name="mmlu_agent") + return [adapter], {"mmlu_agent": adapter} + + def get_model_adapter(self, model_id, **kwargs): + adapter = MyModelAdapter(model_id) + register_name = kwargs.get("register_name") + if register_name: + self.register("models", register_name, adapter) + return adapter +``` + +## API Reference + +::: maseval.benchmark.mmlu.MMLUBenchmark + +::: maseval.benchmark.mmlu.DefaultMMLUBenchmark + +::: maseval.benchmark.mmlu.MMLUEnvironment + +::: maseval.benchmark.mmlu.MMLUEvaluator + +::: maseval.benchmark.mmlu.load_tasks + +::: maseval.benchmark.mmlu.compute_benchmark_metrics diff --git a/docs/interface/inference/huggingface.md b/docs/interface/inference/huggingface.md index 00a424a4..28814b60 100644 --- a/docs/interface/inference/huggingface.md +++ b/docs/interface/inference/huggingface.md @@ -1,7 +1,18 @@ -# HuggingFace Inference Adapter +# HuggingFace Inference Adapters -This page documents the HuggingFace model adapter for MASEval. +This page documents the HuggingFace model adapters for MASEval. + +## Pipeline Model Adapter (Text Generation) [:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface.py){ .md-source-file } -::: maseval.interface.inference.huggingface.HuggingFaceModelAdapter +::: maseval.interface.inference.huggingface.HuggingFacePipelineModelAdapter + +!!! note + `HuggingFaceModelAdapter` is a backwards-compatible alias for `HuggingFacePipelineModelAdapter`. + +## Model Scorer (Log-Likelihood) + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface_scorer.py){ .md-source-file } + +::: maseval.interface.inference.huggingface_scorer.HuggingFaceModelScorer diff --git a/docs/reference/environment.md b/docs/reference/environment.md index 77d40e30..7d65e9f1 100644 --- a/docs/reference/environment.md +++ b/docs/reference/environment.md @@ -8,10 +8,8 @@ Environments define the execution context for agents, including available tools, ## Tools and agent-provided helpers -Some agent adapters expose helper tools or user-simulation tools that can be used by the Environment. For example: +Some agent adapters expose helper tools or user-simulation tools that can be used by the Environment. See the framework-specific interface pages for details: -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/smolagents.py){ .md-source-file } - -::: maseval.interface.agents.smolagents.SmolAgentAdapter - -::: maseval.interface.agents.smolagents.SmolAgentLLMUser +- [SmolAgents](../interface/agents/smolagents.md) — `SmolAgentAdapter`, `SmolAgentLLMUser` +- [LangGraph](../interface/agents/langgraph.md) — `LangGraphAgentAdapter` +- [LlamaIndex](../interface/agents/llamaindex.md) — `LlamaIndexAgentAdapter` diff --git a/docs/reference/exceptions.md b/docs/reference/exceptions.md index ef96f9dc..99cf2c3e 100644 --- a/docs/reference/exceptions.md +++ b/docs/reference/exceptions.md @@ -38,6 +38,10 @@ SimulatorError (base for simulators) ::: maseval.core.simulator.UserSimulatorError +## Data Access Helpers + +::: maseval.core.exceptions.get_with_assert + ## Validation Helpers These functions simplify input validation and raise `AgentError` with helpful suggestions: diff --git a/docs/reference/model.md b/docs/reference/model.md index 1569d939..f0029c0d 100644 --- a/docs/reference/model.md +++ b/docs/reference/model.md @@ -20,7 +20,7 @@ The following adapter classes implement the ModelAdapter interface for specific [:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface.py){ .md-source-file } -::: maseval.interface.inference.huggingface.HuggingFaceModelAdapter +::: maseval.interface.inference.huggingface.HuggingFacePipelineModelAdapter [:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/google_genai.py){ .md-source-file } diff --git a/docs/reference/scorer.md b/docs/reference/scorer.md new file mode 100644 index 00000000..cf2eddd4 --- /dev/null +++ b/docs/reference/scorer.md @@ -0,0 +1,19 @@ +# Model Scorers + +Model Scorers provide a uniform interface for log-likelihood computation across model providers. Unlike `ModelAdapter` (which handles text generation and chat), scorers evaluate how likely a model considers a given continuation given some context. + +!!! note + + `ModelScorer` is the scoring counterpart to `ModelAdapter`. Use it when you need log-likelihood evaluation (e.g., multiple-choice benchmarks) rather than text generation. + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/scorer.py){ .md-source-file } + +::: maseval.core.scorer.ModelScorer + +## Interfaces + +The following scorer classes implement the ModelScorer interface for specific providers. + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/inference/huggingface_scorer.py){ .md-source-file } + +::: maseval.interface.inference.huggingface_scorer.HuggingFaceModelScorer diff --git a/docs/reference/task.md b/docs/reference/task.md index b70ef13f..ad3087d6 100644 --- a/docs/reference/task.md +++ b/docs/reference/task.md @@ -2,15 +2,15 @@ Tasks define individual benchmark scenarios including inputs, expected outputs, and metadata for evaluation. Task queues control execution order and scheduling strategy. -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L55){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L56){ .md-source-file } ::: maseval.core.task.Task -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L27){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L28){ .md-source-file } ::: maseval.core.task.TaskProtocol -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L18){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L19){ .md-source-file } ::: maseval.core.task.TimeoutAction @@ -18,18 +18,26 @@ Tasks define individual benchmark scenarios including inputs, expected outputs, Task queues determine the order in which tasks are executed. Pass a queue to `Benchmark.run(queue=...)` to customize scheduling. -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L86){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L87){ .md-source-file } ::: maseval.core.task.BaseTaskQueue -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L256){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L257){ .md-source-file } ::: maseval.core.task.SequentialTaskQueue -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L276){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L277){ .md-source-file } + +::: maseval.core.task.InformativeSubsetQueue + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L325){ .md-source-file } + +::: maseval.core.task.DISCOQueue + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L413){ .md-source-file } ::: maseval.core.task.PriorityTaskQueue -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L322){ .md-source-file } +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L459){ .md-source-file } ::: maseval.core.task.AdaptiveTaskQueue diff --git a/docs/reference/user.md b/docs/reference/user.md index c739ad25..c3cd1af8 100644 --- a/docs/reference/user.md +++ b/docs/reference/user.md @@ -14,22 +14,9 @@ The `LLMUser` is initialized with a persona and a scenario, both of which are ty ## Interfaces -Some integrations provide convenience user/tool implementations for specific agent frameworks. For example: +Some integrations provide convenience user implementations for specific agent frameworks. See the framework-specific interface pages for details: -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/smolagents.py){ .md-source-file } - -::: maseval.interface.agents.smolagents.SmolAgentLLMUser - -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/langgraph.py){ .md-source-file } - -::: maseval.interface.agents.langgraph.LangGraphLLMUser - -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/llamaindex.py){ .md-source-file } - -::: maseval.interface.agents.llamaindex.LlamaIndexLLMUser - -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/camel.py){ .md-source-file } - -::: maseval.interface.agents.camel.CamelLLMUser - -::: maseval.interface.agents.camel.CamelAgentUser +- [SmolAgents](../interface/agents/smolagents.md) — `SmolAgentLLMUser` +- [LangGraph](../interface/agents/langgraph.md) — `LangGraphLLMUser` +- [LlamaIndex](../interface/agents/llamaindex.md) — `LlamaIndexLLMUser` +- [CAMEL-AI](../interface/agents/camel.md) — `CamelLLMUser`, `CamelAgentUser` diff --git a/examples/mmlu_benchmark/README.md b/examples/mmlu_benchmark/README.md index 62c6bafc..0e90291b 100644 --- a/examples/mmlu_benchmark/README.md +++ b/examples/mmlu_benchmark/README.md @@ -2,6 +2,20 @@ Evaluate language models on [MMLU (Massive Multitask Language Understanding)](https://arxiv.org/abs/2009.03300) with optional efficient evaluation via [DISCO](https://arxiv.org/abs/2510.07959). +## Installation + +For basic MMLU evaluation: + +```bash +uv pip install .[mmlu] +``` + +For DISCO prediction (includes DISCO dependencies): + +```bash +uv pip install .[disco] +``` + ## Run without DISCO (full evaluation) From the project root: diff --git a/examples/mmlu_benchmark/mmlu_benchmark.py b/examples/mmlu_benchmark/mmlu_benchmark.py index 023915bd..101aeeba 100644 --- a/examples/mmlu_benchmark/mmlu_benchmark.py +++ b/examples/mmlu_benchmark/mmlu_benchmark.py @@ -52,7 +52,7 @@ # MMLU benchmark imports from maseval.benchmark.mmlu import ( DEFAULT_DEVICE, - HuggingFaceMMLUBenchmark, + DefaultMMLUBenchmark, load_tasks, compute_benchmark_metrics, ) @@ -691,7 +691,7 @@ def main(): ) # Create benchmark - benchmark = HuggingFaceMMLUBenchmark( + benchmark = DefaultMMLUBenchmark( model_id=args.model_id, device=args.device, trust_remote_code=True, diff --git a/maseval/__init__.py b/maseval/__init__.py index 90d52cfa..c6fa6cec 100644 --- a/maseval/__init__.py +++ b/maseval/__init__.py @@ -16,6 +16,8 @@ BaseTaskQueue, TaskQueue, SequentialTaskQueue, + InformativeSubsetQueue, + DISCOQueue, PriorityTaskQueue, AdaptiveTaskQueue, ) @@ -33,6 +35,7 @@ UserSimulatorError, ) from .core.model import ModelAdapter, ChatResponse +from .core.scorer import ModelScorer from .core.user import User, LLMUser, AgenticLLMUser, TerminationReason from .core.evaluator import Evaluator from .core.history import MessageHistory, ToolInvocationHistory @@ -46,6 +49,7 @@ UserError, UserExhaustedError, TaskTimeoutError, + get_with_assert, validate_argument_type, validate_required_arguments, validate_no_extra_arguments, @@ -93,12 +97,16 @@ "BaseTaskQueue", "TaskQueue", "SequentialTaskQueue", + "InformativeSubsetQueue", + "DISCOQueue", "PriorityTaskQueue", "AdaptiveTaskQueue", - # Model adapters + # Model adapters and scorers "ModelAdapter", "ChatResponse", + "ModelScorer", # Exceptions and validation + "get_with_assert", "MASEvalError", "AgentError", "EnvironmentError", diff --git a/maseval/benchmark/mmlu/__init__.py b/maseval/benchmark/mmlu/__init__.py index 19e8fd32..6c6f751c 100644 --- a/maseval/benchmark/mmlu/__init__.py +++ b/maseval/benchmark/mmlu/__init__.py @@ -4,12 +4,10 @@ Usage: from maseval.benchmark.mmlu import ( - MMLUBenchmark, - MMLUEnvironment, - MMLUEvaluator, + DefaultMMLUBenchmark, load_tasks, - AnchorPointsTaskQueue, ) + from maseval import DISCOQueue, InformativeSubsetQueue # Load tasks and anchor points tasks = load_tasks( @@ -17,29 +15,27 @@ anchor_points_path="path/to/anchor_points.pkl", # Optional ) - # Create benchmark - benchmark = MMLUBenchmark() - results = benchmark.run(tasks=tasks, agent_data={"model_id": "gpt-4"}) + # Run benchmark + benchmark = DefaultMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf") + results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b-hf"}) """ +from maseval import DISCOQueue + from .mmlu import ( DEFAULT_AGENT_NAME, DEFAULT_BATCH_SIZE, DEFAULT_CHOICES, DEFAULT_DEVICE, DEFAULT_MODEL_REGISTER_NAME, - FALLBACK_MODEL_ID, MMLU_TASK_NAME, STATUS_SUCCESS, TARGET_DELIMITER, TASK_TYPE_MMLU, MMLUBenchmark, - HuggingFaceMMLUBenchmark, + DefaultMMLUBenchmark, MMLUEnvironment, MMLUEvaluator, - MMLUModelAgent, - MMLUAgentAdapter, - AnchorPointsTaskQueue, load_tasks, compute_benchmark_metrics, ) @@ -50,18 +46,16 @@ "DEFAULT_CHOICES", "DEFAULT_DEVICE", "DEFAULT_MODEL_REGISTER_NAME", - "FALLBACK_MODEL_ID", "MMLU_TASK_NAME", "STATUS_SUCCESS", "TARGET_DELIMITER", "TASK_TYPE_MMLU", "MMLUBenchmark", - "HuggingFaceMMLUBenchmark", + "DefaultMMLUBenchmark", "MMLUEnvironment", "MMLUEvaluator", - "MMLUModelAgent", - "MMLUAgentAdapter", - "AnchorPointsTaskQueue", + "InformativeSubsetQueue", + "DISCOQueue", "load_tasks", "compute_benchmark_metrics", ] diff --git a/maseval/benchmark/mmlu/mmlu.py b/maseval/benchmark/mmlu/mmlu.py index 6506402c..d00fe5c0 100644 --- a/maseval/benchmark/mmlu/mmlu.py +++ b/maseval/benchmark/mmlu/mmlu.py @@ -8,56 +8,38 @@ Usage: from maseval.benchmark.mmlu import ( - MMLUBenchmark, load_tasks, AnchorPointsTaskQueue + DefaultMMLUBenchmark, load_tasks, ) + from maseval import DISCOQueue - # Load tasks filtered to anchor points + # Load tasks (optionally filtered to anchor points) tasks = load_tasks( data_path="/path/to/mmlu_prompts_examples.json", anchor_points_path="/path/to/anchor_points.pkl", ) - # Create benchmark with HuggingFace model - class MyMMLUBenchmark(MMLUBenchmark): - def get_model_adapter(self, model_id, **kwargs): - from transformers import pipeline - from maseval.interface.inference import HuggingFaceModelAdapter - pipe = pipeline("text-generation", model=model_id) - return HuggingFaceModelAdapter(model=pipe, model_id=model_id) - - benchmark = MyMMLUBenchmark() - results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b"}) + # Run with the HuggingFace concrete implementation + benchmark = DefaultMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf") + results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b-hf"}) """ import json -import pickle -from abc import abstractmethod from pathlib import Path -from typing import Any, Dict, Iterator, List, Optional, Sequence, Tuple, Union, cast - -# numpy is optional - only needed for anchor points processing -try: - import numpy as np - - HAS_NUMPY = True -except ImportError: - np = None # type: ignore[assignment] - HAS_NUMPY = False +from typing import Any, Dict, List, Optional, Sequence, Tuple, Union, cast from maseval import ( AgentAdapter, + DISCOQueue, Benchmark, Environment, Evaluator, - MessageHistory, ModelAdapter, Task, User, SeedGenerator, ) -from maseval.core.task import AdaptiveTaskQueue, SequentialTaskQueue -from maseval.core.tracing import TraceableMixin -from maseval.core.config import ConfigurableMixin +from maseval.core.history import MessageHistory +from maseval.core.task import SequentialTaskQueue # ============================================================================= @@ -72,107 +54,35 @@ def get_model_adapter(self, model_id, **kwargs): TARGET_DELIMITER = " " # lm-eval convention for MCQ MMLU_TASK_NAME = "mmlu_prompts" TASK_TYPE_MMLU = "mmlu" -FALLBACK_MODEL_ID = "unknown" STATUS_SUCCESS = "success" # ============================================================================= -# Task Queue +# Agent adapter for scorer-based evaluation # ============================================================================= -class AnchorPointsTaskQueue(AdaptiveTaskQueue): - """Task queue that iterates through tasks in anchor points order. - - This queue is used for DISCO-based evaluation where we only evaluate - on a subset of anchor tasks and predict performance on the full dataset. +class _ScorerBackedAdapter(AgentAdapter): + """Agent adapter for benchmarks that use scorer-based evaluation. - The queue iterates through tasks in the order specified by anchor_points, - and stops when all anchor tasks have been processed. + This adapter is a message container for tracing — the benchmark's + ``run_agents()`` drives evaluation via a ``ModelScorer`` and records + results here. Calling ``agent.run()`` directly is an error because + there is no generation model behind this adapter. """ - def __init__(self, tasks: List[Task], anchor_points: Optional[List[int]] = None): - """Initialize anchor points task queue. - - Args: - tasks: Full list of tasks (ordered by doc_id). - anchor_points: Optional list of task indices (doc_ids) to evaluate. - If None, evaluates all tasks in order. - """ - # If anchor_points provided, filter tasks to only include anchor tasks - # This dramatically improves performance by avoiding O(n²) iteration - if anchor_points is not None: - # Build index mapping for quick lookup - task_by_doc_id: Dict[int, Task] = {} - for i, task in enumerate(tasks): - doc_id = task.metadata.get("doc_id", i) - task_by_doc_id[doc_id] = task - - # Filter to only anchor tasks, preserving anchor order - anchor_tasks = [] - for doc_id in anchor_points: - task = task_by_doc_id.get(doc_id) - if task is not None: - anchor_tasks.append(task) - - # Store original for reference - self._all_tasks = tasks - self._task_by_doc_id = task_by_doc_id - tasks = anchor_tasks - - super().__init__(tasks) - self._anchor_points = anchor_points - self._anchor_idx = 0 - - # Initialize state immediately (since __iter__ is overridden and skips initial_state()) - self._state = self.initial_state() - - def __iter__(self) -> Iterator[Task]: - """Yield tasks in anchor point order. - - Since tasks are pre-filtered during __init__, we simply iterate - over the stored tasks in order. This avoids the infinite loop - issue in AdaptiveTaskQueue.__iter__ which relies on on_task_repeat_end - to remove tasks from _remaining. - """ - return iter(self._tasks) - - def initial_state(self) -> Dict[str, Any]: - """Initialize state for anchor point iteration.""" - return { - "anchor_idx": 0, - "completed_anchors": [], - } - - def select_next_task(self, remaining: Sequence[Task], state: Dict[str, Any]) -> Optional[Task]: - """Select the next anchor task to execute. - - Args: - remaining: Tasks not yet executed. - state: Current state with anchor_idx. - - Returns: - Next anchor task, or None if all anchors processed. - """ - # Simply return the first remaining task since we pre-filtered to anchor tasks only - return remaining[0] if remaining else None - - def update_state(self, task: Task, report: Dict[str, Any], state: Dict[str, Any]) -> Dict[str, Any]: - """Update state after task completion. - - Args: - task: Completed task. - report: Execution report. - state: Current state. + def __init__(self, scorer: Any, name: str) -> None: + super().__init__(agent_instance=scorer, name=name) + self._messages: List[Dict[str, Any]] = [] - Returns: - Updated state. - """ - doc_id = task.metadata.get("doc_id") - state["completed_anchors"].append(doc_id) - state["anchor_idx"] += 1 + def _run_agent(self, query: str) -> Any: + raise NotImplementedError( + f"{type(self).__name__} is backed by a ModelScorer, not a generation model. " + "Use benchmark.run_agents() instead of calling agent.run() directly." + ) - return state + def get_messages(self) -> MessageHistory: + return MessageHistory(self._messages) # ============================================================================= @@ -188,12 +98,18 @@ class MMLUEnvironment(Environment): """ def setup_state(self, task_data: Dict[str, Any]) -> Dict[str, Any]: - """Initialize state from task data.""" + """Initialize state from task data. + + Args: + task_data: Must contain ``"query"`` (str) and ``"environment_data"`` + (dict with ``"choices"``, ``"full_prompt"``, ``"use_full_prompt"``). + """ + env_data = task_data["environment_data"] return { - "query": task_data.get("query", ""), - "choices": task_data.get("environment_data", {}).get("choices", []), - "full_prompt": task_data.get("environment_data", {}).get("full_prompt", ""), - "use_full_prompt": task_data.get("environment_data", {}).get("use_full_prompt", False), + "query": task_data["query"], + "choices": env_data["choices"], + "full_prompt": env_data["full_prompt"], + "use_full_prompt": env_data["use_full_prompt"], } def create_tools(self) -> Dict[str, Any]: @@ -203,11 +119,11 @@ def create_tools(self) -> Dict[str, Any]: def get_prompt(self) -> str: """Get the prompt to send to the model. - Returns full_prompt if use_full_prompt is True, otherwise query. + Returns ``full_prompt`` if ``use_full_prompt`` is True, otherwise ``query``. """ - if self.state.get("use_full_prompt", False): - return self.state.get("full_prompt", self.state.get("query", "")) - return self.state.get("query", "") + if self.state["use_full_prompt"]: + return self.state["full_prompt"] + return self.state["query"] # ============================================================================= @@ -231,14 +147,15 @@ def __init__( """Initialize MMLU evaluator. Args: - task: Task being evaluated (contains gold answer). + task: Task being evaluated. Must have ``evaluation_data["gold"]`` (int) + with the correct answer index. environment: Environment (provides choices). user: Unused for MMLU. """ self.task = task self.environment = environment - self.gold = task.evaluation_data.get("gold", 0) - self.choices = task.environment_data.get("choices", DEFAULT_CHOICES) + self.gold = task.evaluation_data["gold"] + self.choices = task.environment_data["choices"] def filter_traces(self, traces: Dict[str, Any]) -> Dict[str, Any]: """Extract relevant traces for evaluation. @@ -276,11 +193,11 @@ def __call__(self, traces: Dict[str, Any], final_answer: Optional[str] = None) - "predicted": predicted, "gold": self.gold, "correct": correct, - "doc_id": self.task.metadata.get("doc_id"), + "doc_id": self.task.metadata["doc_id"], } # Extract logprobs from traces if available (for logprobs-based evaluation) - messages = traces.get("messages", []) + messages = traces["messages"] for msg in messages: if isinstance(msg, dict) and "logprobs" in msg: result["logprobs"] = msg["logprobs"] @@ -328,103 +245,6 @@ def _parse_answer(self, response: str) -> int: return -1 -# ============================================================================= -# Model Adapter Wrapper for MCQ -# ============================================================================= - - -class MMLUModelAgent(TraceableMixin, ConfigurableMixin): - """Simple agent wrapper that passes prompts to a model for MCQ evaluation. - - This is a minimal agent that just forwards prompts to the model - and returns the response. It supports tracing for MASEval integration. - """ - - def __init__(self, model: ModelAdapter, name: str = DEFAULT_AGENT_NAME): - """Initialize MMLU model agent. - - Args: - model: ModelAdapter to use for generation. - name: Agent name for tracing. - """ - super().__init__() - self.model = model - self.name = name - self._messages: List[Dict[str, Any]] = [] - - def run(self, prompt: str) -> str: - """Run the model on a prompt. - - Args: - prompt: The prompt to send to the model. - - Returns: - Model's response string. - """ - # Record input message - self._messages.append({"role": "user", "content": prompt}) - - # Generate response - response = self.model.generate(prompt) - - # Record output message - self._messages.append({"role": "assistant", "content": response}) - - return response - - def gather_traces(self) -> Dict[str, Any]: - """Gather traces for this agent.""" - return { - **super().gather_traces(), - "name": self.name, - "messages": list(self._messages), - } - - def gather_config(self) -> Dict[str, Any]: - """Gather configuration.""" - return { - **super().gather_config(), - "name": self.name, - "model_id": self.model.model_id, - } - - -class MMLUAgentAdapter(AgentAdapter): - """AgentAdapter wrapper for MMLUModelAgent.""" - - def __init__(self, agent: MMLUModelAgent, name: str): - """Initialize adapter. - - Args: - agent: MMLUModelAgent instance. - name: Adapter name. - """ - super().__init__(agent, name) - - def _run_agent(self, query: str) -> Any: - """Execute the agent.""" - return self.agent.run(query) - - def get_messages(self) -> MessageHistory: - """Get agent messages.""" - return MessageHistory(self.agent._messages) - - def gather_traces(self) -> Dict[str, Any]: - """Gather execution traces from this agent.""" - from maseval.core.tracing import TraceableMixin - - messages = self.get_messages() - return { - **TraceableMixin.gather_traces(self), - "name": self.name, - "agent_type": type(self.agent).__name__, - "message_count": len(messages), - "messages": messages.to_list(), - "callbacks": [type(cb).__name__ for cb in self.callbacks], - "logs": self.logs, - } - - # ============================================================================= # Benchmark # ============================================================================= @@ -436,19 +256,12 @@ class MMLUBenchmark(Benchmark): Evaluates language models on MMLU multiple choice questions. Supports anchor point-based evaluation for DISCO prediction. - Users must subclass and implement: - - get_model_adapter() to provide model adapters + Subclasses must implement: - Usage: - class MyMMLUBenchmark(MMLUBenchmark): - def get_model_adapter(self, model_id, **kwargs): - from transformers import pipeline - from maseval.interface.inference import HuggingFaceModelAdapter - pipe = pipeline("text-generation", model=model_id) - return HuggingFaceModelAdapter(model=pipe, model_id=model_id) + - ``setup_agents()`` - create agents for MCQ evaluation + - ``get_model_adapter()`` - provide model adapters - benchmark = MyMMLUBenchmark() - results = benchmark.run(tasks=tasks, agent_data={"model_id": "llama-7b"}) + For a ready-to-use implementation, see ``DefaultMMLUBenchmark``. """ def __init__( @@ -480,48 +293,11 @@ def setup_environment( "query": task.query, "environment_data": { **task.environment_data, - "use_full_prompt": self.use_full_prompt or agent_data.get("use_full_prompt", False), + "use_full_prompt": self.use_full_prompt, }, } return MMLUEnvironment(task_data) - def setup_user( - self, - agent_data: Dict[str, Any], - environment: Environment, - task: Task, - seed_generator: SeedGenerator, - ) -> Optional[User]: - """MMLU doesn't use a user simulator.""" - return None - - def setup_agents( - self, - agent_data: Dict[str, Any], - environment: Environment, - task: Task, - user: Optional[User], - seed_generator: SeedGenerator, - ) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]: - """Create model agent for MCQ evaluation. - - Args: - agent_data: Agent config with model_id. - environment: MMLU environment. - task: Current task. - user: Unused. - - Returns: - Tuple of (agents_to_run, agents_dict). - """ - model_id = agent_data.get("model_id", FALLBACK_MODEL_ID) - model = self.get_model_adapter(model_id, register_name=DEFAULT_MODEL_REGISTER_NAME) - - agent = MMLUModelAgent(model, name=DEFAULT_AGENT_NAME) - adapter = MMLUAgentAdapter(agent, DEFAULT_AGENT_NAME) - - return [adapter], {DEFAULT_AGENT_NAME: adapter} - def setup_evaluators( self, environment: Environment, @@ -548,21 +324,6 @@ def run_agents( agent = agents[0] return agent.run(prompt) - @abstractmethod - def get_model_adapter(self, model_id: str, **kwargs) -> ModelAdapter: - """Provide a ModelAdapter for the model. - - Must be implemented by subclass. - - Args: - model_id: Model identifier. - **kwargs: Additional arguments (e.g., register_name for tracing). - - Returns: - ModelAdapter instance. - """ - pass - def evaluate( self, evaluators: Sequence[Evaluator], @@ -579,16 +340,18 @@ def evaluate( return results -class HuggingFaceMMLUBenchmark(MMLUBenchmark): +class DefaultMMLUBenchmark(MMLUBenchmark): """MMLU Benchmark using HuggingFace transformers models. This concrete implementation uses log-likelihood based MCQ evaluation - with the same optimizations as lm-evaluation-harness: + via ``HuggingFaceModelScorer``, with the same optimisations as + lm-evaluation-harness: + + 1. Single forward pass per question (one-token continuation optimisation) + 2. Efficient log-softmax computation + 3. Proper left-padding for batch processing - 1. Single forward pass per question (one-token continuation optimization) - 2. Batching multiple questions together - 3. Efficient log-softmax computation - 4. Proper left-padding for batch processing + Agents are created using a scorer-backed adapter (see ``_ScorerBackedAdapter``). """ def __init__( @@ -598,7 +361,7 @@ def __init__( trust_remote_code: bool = True, use_full_prompt: bool = True, batch_size: int = DEFAULT_BATCH_SIZE, - **kwargs, + **kwargs: Any, ): """Initialize HuggingFace MMLU benchmark. @@ -607,195 +370,50 @@ def __init__( device: Device to run model on. trust_remote_code: Trust remote code when loading model (default True). use_full_prompt: Use full prompt with few-shot examples (default True). - batch_size: Batch size for evaluation (number of questions per batch). - **kwargs: Additional arguments passed to MMLUBenchmark. + batch_size: Batch size for lm-eval batching (number of questions per batch). + **kwargs: Additional arguments passed to ``MMLUBenchmark``. """ super().__init__(use_full_prompt=use_full_prompt, **kwargs) self._model_id = model_id self._device = device self._trust_remote_code = trust_remote_code self._batch_size = batch_size - self._model = None - self._tokenizer = None - - def _load_model(self): - """Lazy load the model and tokenizer for log-likelihood computation.""" - if self._model is None: - from transformers import AutoModelForCausalLM, AutoTokenizer - - print(f"Loading model: {self._model_id}") - self._tokenizer = AutoTokenizer.from_pretrained( - self._model_id, - trust_remote_code=self._trust_remote_code, - ) - self._tokenizer.padding_side = "left" - if self._tokenizer.pad_token is None: - self._tokenizer.pad_token = self._tokenizer.eos_token - - # Load model with torch_dtype="auto" to match lm-evaluation-harness exactly - # This uses the model's native dtype (bfloat16 for most modern models) - # Then move to device manually - self._model = AutoModelForCausalLM.from_pretrained( - self._model_id, - trust_remote_code=self._trust_remote_code, - torch_dtype="auto", - ) - self._model = self._model.to(self._device) - self._model.eval() - - # Note: We don't pre-cache choice token IDs here because they depend on context. - # Token IDs are computed dynamically in _get_choice_token_id_in_context() - # to match lm-evaluation-harness behavior exactly. - - return self._model, self._tokenizer - - def _get_choice_token_id_separate(self, choice: str) -> Optional[int]: - """Get the token ID for a choice when tokenized SEPARATELY. - - CRITICAL: lm-evaluation-harness encodes context and continuation separately, - then concatenates. This means "A" is always tokenized standalone (token 330), - NOT in context after "Answer:" (which would be token 28741). - - We must match this behavior to get identical log-likelihood values. - - Args: - choice: The choice string (e.g., "A"). - Returns: - Token ID for the choice (standalone tokenization), or None if multi-token. - """ - _, tokenizer = self._load_model() - - # Tokenize choice ALONE (not in context) - this is how lm-eval does it - choice_tokens = tokenizer.encode(choice, add_special_tokens=False) - - if len(choice_tokens) == 1: - return choice_tokens[0] - else: - # Multi-token choice - return None to trigger multi-token fallback - return None - - def _encode_pair(self, context: str, continuation: str) -> tuple: - """Encode a context-continuation pair like lm-evaluation-harness. - - This matches lm-eval's _encode_pair method exactly: - 1. Encode whole = context + continuation - 2. Encode context alone - 3. continuation_enc = whole[len(context_enc):] - - This handles tokenization boundary effects correctly. - - Args: - context: The context/prompt string. - continuation: The continuation string (e.g., " A" with target_delimiter). - - Returns: - Tuple of (context_enc, continuation_enc) token lists. - """ - _, tokenizer = self._load_model() - - # Handle trailing spaces in context (move to continuation) - n_spaces = len(context) - len(context.rstrip()) - if n_spaces > 0: - continuation = context[-n_spaces:] + continuation - context = context[:-n_spaces] - - # Encode whole string together, then split - whole_enc = tokenizer.encode(context + continuation, add_special_tokens=True) - context_enc = tokenizer.encode(context, add_special_tokens=True) - - # Continuation tokens are what's left after context - continuation_enc = whole_enc[len(context_enc) :] - - return context_enc, continuation_enc - - def _compute_logprobs_single_token(self, prompt: str, choices: list) -> list: - """Compute log-likelihoods using single-token optimization. - - For MCQ with single-letter answers (A, B, C, D), we can compute all - choices in one forward pass since they share the same context. - - IMPORTANT: To match lm-evaluation-harness EXACTLY: - 1. Use target_delimiter=" " before choices (e.g., " A" not "A") - 2. Use _encode_pair to handle tokenization boundaries correctly - 3. Input = (context + continuation)[:-1] - 4. Apply log_softmax to get log probabilities - - Args: - prompt: The prompt/question text. - choices: List of answer choice strings (e.g., ["A", "B", "C", "D"]). - - Returns: - List of log-likelihoods, one per choice. - """ - import torch - - model, _ = self._load_model() - - # lm-eval uses target_delimiter=" " for multiple choice tasks - target_delimiter = TARGET_DELIMITER + from maseval.interface.inference.huggingface_scorer import HuggingFaceModelScorer - # Encode first choice to get the shared context - first_continuation = f"{target_delimiter}{choices[0]}" - context_enc, first_cont_enc = self._encode_pair(prompt, first_continuation) - - # Build input: (context + continuation)[:-1] - full_sequence = context_enc + first_cont_enc - input_tokens = full_sequence[:-1] # Remove last token - - input_ids = torch.tensor([input_tokens], dtype=torch.long, device=self._device) - - with torch.no_grad(): - outputs = model(input_ids) - logits = outputs.logits[0] # (seq_len, vocab_size) - - # Select logits at position where continuation is predicted - # For single-token continuation, this is the last position - inplen = len(input_tokens) - contlen = len(first_cont_enc) - selected_logits = logits[inplen - contlen : inplen] - - # Compute log-softmax - log_probs = torch.nn.functional.log_softmax(selected_logits, dim=-1) - - # Get log prob for each choice's continuation token - logprobs = [] - for choice in choices: - continuation = f"{target_delimiter}{choice}" - _, cont_enc = self._encode_pair(prompt, continuation) - - # Sum log probs for multi-token continuations - total = 0.0 - for i, token_id in enumerate(cont_enc): - total += log_probs[i, token_id].item() - logprobs.append(total) - - return logprobs + self._scorer = HuggingFaceModelScorer( + model_id=model_id, + device=device, + trust_remote_code=trust_remote_code, + ) - def _compute_logprobs_batched(self, prompts: list, choices_list: list) -> list: - """Compute log-likelihoods for a batch of prompts. + def setup_agents( + self, + agent_data: Dict[str, Any], + environment: Environment, + task: Task, + user: Optional[User], + seed_generator: SeedGenerator, + ) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]: + """Create scorer-backed agent for MCQ evaluation. - For exact match with lm-evaluation-harness, we process each prompt - individually using _compute_logprobs_single_token which uses the - correct _encode_pair tokenization logic. + The returned adapter is a tracing container — actual evaluation is + driven by ``self._scorer`` in ``run_agents()``. Args: - prompts: List of prompt strings. - choices_list: List of choice lists (one per prompt). + agent_data: Agent config (unused; model is set at ``__init__``). + environment: MMLU environment. + task: Current task. + user: Unused. + seed_generator: Seed generator (unused for MMLU). Returns: - List of log-likelihood lists, one per prompt. + Tuple of (agents_to_run, agents_dict). """ - # For exact match with lm-eval, process individually - # This ensures correct tokenization via _encode_pair - all_logprobs = [] - for prompt, choices in zip(prompts, choices_list): - logprobs = self._compute_logprobs_single_token(prompt, choices) - all_logprobs.append(logprobs) - - return all_logprobs + adapter = _ScorerBackedAdapter(self._scorer, DEFAULT_AGENT_NAME) + return [adapter], {DEFAULT_AGENT_NAME: adapter} - def precompute_all_logprobs_lmeval(self, tasks) -> dict: + def precompute_all_logprobs_lmeval(self, tasks: Sequence[Task]) -> Dict[Any, List[float]]: """Precompute log-likelihoods for ALL tasks using lm-eval's batching. CRITICAL: lm-evaluation-harness batches ALL requests together and uses @@ -836,7 +454,7 @@ def precompute_all_logprobs_lmeval(self, tasks) -> dict: instance_map = {} # (doc_id, choice_idx) -> position in results for task in tasks: - doc_id = task.metadata.get("doc_id") + doc_id = task.metadata["doc_id"] # Get prompt from task - use full_prompt from environment_data if available if self.use_full_prompt and "full_prompt" in task.environment_data: prompt = task.environment_data["full_prompt"] @@ -862,7 +480,7 @@ def precompute_all_logprobs_lmeval(self, tasks) -> dict: # Map results back to doc_ids doc_logprobs = {} for task in tasks: - doc_id = task.metadata.get("doc_id") + doc_id = task.metadata["doc_id"] logprobs = [] for i in range(len(choices)): pos = instance_map[(doc_id, i)] @@ -875,229 +493,81 @@ def precompute_all_logprobs_lmeval(self, tasks) -> dict: return doc_logprobs - def _compute_logprobs_multi_token(self, prompt: str, choices: list) -> list: - """Compute log-likelihoods for multi-token continuations. - - This is the fallback for when answer choices have multiple tokens. - Uses _encode_pair to match lm-evaluation-harness exactly. - - Args: - prompt: The prompt/question text. - choices: List of answer choice strings. - - Returns: - List of log-likelihoods, one per choice. - """ - import torch - - model, _ = self._load_model() - - # lm-eval uses target_delimiter=" " for multiple choice tasks - target_delimiter = TARGET_DELIMITER - - all_logprobs = [] - for choice in choices: - continuation = f"{target_delimiter}{choice}" - - # Use _encode_pair for correct tokenization - context_enc, continuation_enc = self._encode_pair(prompt, continuation) - - # Build input: (context + continuation)[:-1] - full_sequence = context_enc + continuation_enc - input_tokens = full_sequence[:-1] - - input_ids = torch.tensor([input_tokens], dtype=torch.long, device=self._device) - - with torch.no_grad(): - outputs = model(input_ids) - logits = outputs.logits[0] # (seq_len, vocab_size) - - # Select continuation logits - inplen = len(input_tokens) - contlen = len(continuation_enc) - selected = logits[inplen - contlen : inplen] - - # Compute log-softmax - log_probs = torch.nn.functional.log_softmax(selected, dim=-1) - - # Sum log probs for all continuation tokens - total = 0.0 - for i, token_id in enumerate(continuation_enc): - total += log_probs[i, token_id].item() - - all_logprobs.append(total) - - return all_logprobs - def run_agents( self, - agents, - task, - environment, + agents: Sequence[AgentAdapter], + task: Task, + environment: Environment, query: str = "", - ): + ) -> Any: """Execute log-likelihood based MCQ evaluation. Uses precomputed logprobs if available (for exact lm-eval match), - otherwise falls back to single-forward-pass optimization for - single-token answers, or multi-token batched computation. + otherwise delegates to ``HuggingFaceModelScorer.loglikelihood_choices()`` + which automatically picks single-token or multi-token scoring. """ - # Get the prompt from environment - prompt = environment.get_prompt() - choices = environment.state.get("choices", DEFAULT_CHOICES) - doc_id = task.metadata.get("doc_id") if task else None - - # Check if we have precomputed logprobs (for exact lm-eval match) - if hasattr(self, "_precomputed_logprobs") and doc_id is not None: - logprobs = self._precomputed_logprobs.get(doc_id) - if logprobs is not None: - # Use precomputed values for exact match - best_idx = logprobs.index(max(logprobs)) - answer = choices[best_idx] - - # Store logprobs in environment for later retrieval - environment.state["logprobs"] = logprobs - environment.state["predicted_idx"] = best_idx - - # Record in agent messages for tracing - agent = agents[0] - agent.agent._messages.append({"role": "user", "content": prompt}) - agent.agent._messages.append( - { - "role": "assistant", - "content": answer, - "logprobs": logprobs, - } - ) - - return answer - - # Fall back to computing logprobs on-the-fly - # Load model - self._load_model() - - # lm-eval uses target_delimiter=" " for multiple choice tasks - target_delimiter = TARGET_DELIMITER - - # Check if all choices result in single-token continuations - # using _encode_pair to get the correct tokenization - all_single_token = True - for choice in choices: - continuation = f"{target_delimiter}{choice}" - _, cont_enc = self._encode_pair(prompt, continuation) - if len(cont_enc) != 1: - all_single_token = False - break - - if all_single_token: - # Use optimized single-token path (one forward pass) - logprobs = self._compute_logprobs_single_token(prompt, choices) - else: - # Fall back to multi-token computation - logprobs = self._compute_logprobs_multi_token(prompt, choices) + mmlu_env = cast(MMLUEnvironment, environment) + prompt = mmlu_env.get_prompt() + choices = mmlu_env.state["choices"] + doc_id = task.metadata["doc_id"] + agent = cast(_ScorerBackedAdapter, agents[0]) + + if hasattr(self, "_precomputed_logprobs") and doc_id in self._precomputed_logprobs: + logprobs = self._precomputed_logprobs[doc_id] + best_idx = logprobs.index(max(logprobs)) + answer = choices[best_idx] + mmlu_env.state["logprobs"] = logprobs + mmlu_env.state["predicted_idx"] = best_idx + agent._messages.append({"role": "user", "content": prompt}) + agent._messages.append({"role": "assistant", "content": answer, "logprobs": logprobs}) + return answer + + logprobs = self._scorer.loglikelihood_choices(prompt, choices, delimiter=TARGET_DELIMITER) - # Select the choice with highest log-probability best_idx = logprobs.index(max(logprobs)) answer = choices[best_idx] + mmlu_env.state["logprobs"] = logprobs + mmlu_env.state["predicted_idx"] = best_idx - # Store logprobs in environment for later retrieval if needed - environment.state["logprobs"] = logprobs - environment.state["predicted_idx"] = best_idx - - # Record in agent messages for tracing - agent = agents[0] - agent.agent._messages.append({"role": "user", "content": prompt}) - agent.agent._messages.append( - { - "role": "assistant", - "content": answer, - "logprobs": logprobs, - } - ) - + agent._messages.append({"role": "user", "content": prompt}) + agent._messages.append({"role": "assistant", "content": answer, "logprobs": logprobs}) return answer - def get_model_adapter(self, model_id: str, **kwargs): - """Provide a HuggingFace ModelAdapter. - - Note: For logprobs-based evaluation, we don't actually use the adapter - for generation. This is kept for API compatibility. - - Args: - model_id: Model identifier (ignored, uses instance model_id). - **kwargs: Additional arguments (e.g., register_name). + def get_model_adapter(self, model_id: str, **kwargs: Any) -> ModelAdapter: + """Not used — ``DefaultMMLUBenchmark`` uses ``HuggingFaceModelScorer``. - Returns: - HuggingFaceModelAdapter instance. + Raises: + NotImplementedError: Always. Use ``HuggingFaceModelScorer`` via + ``self._scorer`` for log-likelihood evaluation. """ - from maseval.interface.inference import HuggingFaceModelAdapter - - # Create a minimal adapter for compatibility - # The actual evaluation uses _compute_logprobs_* - class DummyCallable: - def __call__(self, prompt, **kwargs): - return "" - - adapter = HuggingFaceModelAdapter( - model=DummyCallable(), - model_id=self._model_id, + raise NotImplementedError( + "DefaultMMLUBenchmark uses HuggingFaceModelScorer for log-likelihood " + "evaluation, not a generation ModelAdapter. Access the scorer via self._scorer." ) - # Register for tracing if requested - register_name = kwargs.get("register_name") - if register_name: - self.register("models", register_name, adapter) - - return adapter - # ============================================================================= # Data Loading # ============================================================================= -def load_pickle(path: Union[str, Path]) -> Any: - """Load a pickle file.""" - with open(path, "rb") as f: - return pickle.load(f) - - -def load_anchor_points(path: Union[str, Path]) -> List[int]: - """Load anchor points from a .json or .pkl file. Returns a list of doc_ids.""" - path = Path(path) - if not path.exists(): - raise FileNotFoundError(f"Anchor points file not found: {path}") - if path.suffix.lower() == ".json": - with open(path) as f: - anchor_points = json.load(f) - else: - anchor_points = load_pickle(path) - if HAS_NUMPY and isinstance(anchor_points, np.ndarray): - anchor_points = anchor_points.tolist() - elif not HAS_NUMPY and hasattr(anchor_points, "tolist"): - anchor_points = anchor_points.tolist() - return list(anchor_points) - - def load_tasks( data_path: Union[str, Path], anchor_points_path: Optional[Union[str, Path]] = None, limit: Optional[int] = None, -) -> Union[AnchorPointsTaskQueue, SequentialTaskQueue]: +) -> Union[DISCOQueue, SequentialTaskQueue]: """Load MMLU tasks from JSON file. Args: data_path: Path to MMLU prompts JSON file (mmlu_prompts_examples.json format). anchor_points_path: Optional path to anchor points pickle file. - If provided, returns an AnchorPointsTaskQueue that evaluates + If provided, returns an DISCOQueue that evaluates only the anchor tasks in order. limit: Optional limit on number of tasks to load. Returns: TaskQueue containing MMLU tasks. - Raises: - ImportError: If anchor_points_path is provided but numpy is not installed. """ data_path = Path(data_path) @@ -1112,8 +582,15 @@ def load_tasks( # Convert to Tasks tasks = [] for i, item in enumerate(data): + query = item.get("query") or item.get("example") + if query is None: + raise ValueError(f"MMLU task at index {i} has neither 'query' nor 'example' field") + + if "gold" not in item: + raise ValueError(f"MMLU task at index {i} missing required 'gold' field (correct answer index)") + task = Task( - query=item.get("query", item.get("example", "")), + query=query, id=f"mmlu_{i}", environment_data={ "choices": item.get("choices", DEFAULT_CHOICES), @@ -1121,7 +598,7 @@ def load_tasks( "example": item.get("example", ""), }, evaluation_data={ - "gold": item.get("gold", 0), + "gold": item["gold"], }, metadata={ "doc_id": i, @@ -1130,14 +607,8 @@ def load_tasks( ) tasks.append(task) - # Load anchor points if provided - anchor_points = None if anchor_points_path is not None: - anchor_points = load_anchor_points(anchor_points_path) - - # Create appropriate queue - if anchor_points is not None: - return AnchorPointsTaskQueue(tasks, anchor_points) + return DISCOQueue(tasks, anchor_points_path=anchor_points_path) else: return SequentialTaskQueue(tasks) @@ -1164,14 +635,14 @@ def compute_benchmark_metrics(results: List[Dict[str, Any]]) -> Dict[str, Any]: acc_norm_sum = 0.0 for res in results: - if res.get("status") != STATUS_SUCCESS: + if res["status"] != STATUS_SUCCESS: continue - evals = res.get("eval") or [] + evals = res["eval"] or [] for entry in evals: - acc_sum += entry.get("acc", 0.0) - acc_norm_sum += entry.get("acc_norm", 0.0) - if entry.get("correct", False): + acc_sum += entry["acc"] + acc_norm_sum += entry["acc_norm"] + if entry["correct"]: correct_count += 1 return { diff --git a/maseval/core/agent.py b/maseval/core/agent.py index 97011527..1f0aeb9b 100644 --- a/maseval/core/agent.py +++ b/maseval/core/agent.py @@ -1,3 +1,5 @@ +from __future__ import annotations + from abc import ABC, abstractmethod from typing import List, Any, Optional, Dict diff --git a/maseval/core/exceptions.py b/maseval/core/exceptions.py index e4c8c0f1..b3e297c0 100644 --- a/maseval/core/exceptions.py +++ b/maseval/core/exceptions.py @@ -308,6 +308,44 @@ def __init__( # ============================================================================= +def get_with_assert(container: Any, key: Any, error_msg: Optional[str] = None) -> Any: + """Get a value from a container, raising ``KeyError`` if not found. + + Use instead of ``dict.get(key, default)`` when the key is **required**. + A missing key means a bug — not a case to paper over with a fallback. + + Supports nested access via a list of keys:: + + get_with_assert(task, ["metadata", "doc_id"]) + # equivalent to: task["metadata"]["doc_id"] but with a clear error + + Args: + container: Dictionary or other container supporting ``in`` and ``[]``. + key: Key to look up. Pass a list for nested access. + error_msg: Custom error message. If ``None``, a descriptive default + is generated. + + Returns: + The value at the given key. + + Raises: + KeyError: If the key is not found in the container. + """ + if isinstance(key, list): + assert len(key) > 0 + value = get_with_assert(container, key[0], error_msg) + if len(key) == 1: + return value + return get_with_assert(value, key[1:], error_msg) + + if key not in container: + if error_msg is None: + error_msg = f'Required key "{key}" not in container: {container}' + raise KeyError(error_msg) + + return container[key] + + def validate_argument_type( value: Any, expected_type: str, diff --git a/maseval/core/model.py b/maseval/core/model.py index cac1c2ed..d62d204c 100644 --- a/maseval/core/model.py +++ b/maseval/core/model.py @@ -155,7 +155,7 @@ class ModelAdapter(ABC, TraceableMixin, ConfigurableMixin): See maseval.interface.inference for concrete implementations: - AnthropicModelAdapter - GoogleGenAIModelAdapter - - HuggingFaceModelAdapter + - HuggingFacePipelineModelAdapter (alias: HuggingFaceModelAdapter) - LiteLLMModelAdapter - OpenAIModelAdapter diff --git a/maseval/core/scorer.py b/maseval/core/scorer.py new file mode 100644 index 00000000..aed7d672 --- /dev/null +++ b/maseval/core/scorer.py @@ -0,0 +1,276 @@ +"""Core model scorer abstractions for likelihood-based evaluation. + +This module provides the base `ModelScorer` class for computing token-level +scores (log-likelihoods) from language models. While `ModelAdapter` handles +text generation (``chat``, ``generate``), ``ModelScorer`` handles scoring by +computing how likely a model considers a given continuation. + +See `maseval.interface.inference` for concrete implementations. + +Example: + ```python + from maseval.interface.inference import HuggingFaceModelScorer + + scorer = HuggingFaceModelScorer( + model_id="meta-llama/Llama-2-7b-hf", + device="cuda:0", + ) + + # Single pair + ll = scorer.loglikelihood("The capital of France is", " Paris") + + # MCQ evaluation + logprobs = scorer.loglikelihood_choices( + "What is 2+2?\\nA) 3\\nB) 4\\nC) 5\\nD) 6\\nAnswer:", + choices=["A", "B", "C", "D"], + ) + best = ["A", "B", "C", "D"][logprobs.index(max(logprobs))] + ``` +""" + +from __future__ import annotations + +import time +from abc import ABC, abstractmethod +from datetime import datetime +from typing import Any, Dict, List, Optional, Tuple + +from .config import ConfigurableMixin +from .tracing import TraceableMixin + + +class ModelScorer(ABC, TraceableMixin, ConfigurableMixin): + """Abstract base class for model scorers. + + ``ModelScorer`` provides a consistent interface for computing token-level + log-likelihoods from language models. All scorers implement the same + methods, so you can swap providers without changing evaluation code. + + To use a scorer: + + 1. Create an instance with provider-specific configuration + 2. Call ``loglikelihood()`` for single context-continuation pairs + 3. Call ``loglikelihood_batch()`` for efficient batch computation + 4. Call ``loglikelihood_choices()`` for MCQ evaluation + + Implementing a custom scorer: + + Subclass ``ModelScorer`` and implement: + + - ``model_id`` property: Return the model identifier string + - ``_loglikelihood_impl()``: Score a single (context, continuation) pair + + Optionally override: + + - ``_loglikelihood_batch_impl()``: Optimised batch scoring + - ``loglikelihood_choices()``: MCQ-specific optimisations (e.g. shared-context single-pass) + """ + + def __init__(self, seed: Optional[int] = None): + """Initialize the model scorer. + + Args: + seed: Seed for deterministic scoring. Passed to the underlying + model if supported. + """ + super().__init__() + self._seed = seed + self.logs: List[Dict[str, Any]] = [] + + @property + def seed(self) -> Optional[int]: + """Seed for deterministic scoring, or None if unseeded.""" + return self._seed + + @property + @abstractmethod + def model_id(self) -> str: + """The identifier for the underlying model. + + Returns: + A string identifying the model (e.g., ``"meta-llama/Llama-2-7b-hf"``). + """ + + def loglikelihood(self, context: str, continuation: str) -> float: + """Compute the log-likelihood of ``continuation`` given ``context``. + + Args: + context: The conditioning text (prompt). + continuation: The text whose likelihood is scored. + + Returns: + Log-likelihood (negative float; higher = more likely). + """ + start_time = time.time() + try: + result = self._loglikelihood_impl(context, continuation) + duration = time.time() - start_time + self.logs.append( + { + "timestamp": datetime.now().isoformat(), + "type": "loglikelihood", + "duration_seconds": duration, + "status": "success", + } + ) + return result + except Exception as e: + duration = time.time() - start_time + self.logs.append( + { + "timestamp": datetime.now().isoformat(), + "type": "loglikelihood", + "duration_seconds": duration, + "status": "error", + "error": str(e), + "error_type": type(e).__name__, + } + ) + raise + + @abstractmethod + def _loglikelihood_impl(self, context: str, continuation: str) -> float: + """Internal implementation for single-pair scoring. + + Subclasses must implement this. The base class handles timing + and error logging. + + Args: + context: The conditioning text. + continuation: The text to score. + + Returns: + Log-likelihood of the continuation. + """ + + def loglikelihood_batch(self, pairs: List[Tuple[str, str]]) -> List[float]: + """Compute log-likelihoods for a batch of (context, continuation) pairs. + + Override ``_loglikelihood_batch_impl`` for provider-specific batching + optimisations. The default loops over ``_loglikelihood_impl``. + + Args: + pairs: List of (context, continuation) tuples. + + Returns: + List of log-likelihoods, one per pair. + """ + start_time = time.time() + try: + results = self._loglikelihood_batch_impl(pairs) + duration = time.time() - start_time + self.logs.append( + { + "timestamp": datetime.now().isoformat(), + "type": "loglikelihood_batch", + "batch_size": len(pairs), + "duration_seconds": duration, + "status": "success", + } + ) + return results + except Exception as e: + duration = time.time() - start_time + self.logs.append( + { + "timestamp": datetime.now().isoformat(), + "type": "loglikelihood_batch", + "batch_size": len(pairs), + "duration_seconds": duration, + "status": "error", + "error": str(e), + "error_type": type(e).__name__, + } + ) + raise + + def _loglikelihood_batch_impl(self, pairs: List[Tuple[str, str]]) -> List[float]: + """Default batch implementation — loops over ``_loglikelihood_impl``. + + Override in subclasses for provider-specific batching. + + Args: + pairs: List of (context, continuation) tuples. + + Returns: + List of log-likelihoods. + """ + return [self._loglikelihood_impl(ctx, cont) for ctx, cont in pairs] + + def loglikelihood_choices( + self, + context: str, + choices: List[str], + delimiter: str = " ", + ) -> List[float]: + """Compute log-likelihoods for multiple-choice continuations. + + Convenience method for MCQ evaluation. Each choice is prepended with + ``delimiter`` before scoring (e.g. ``" A"``, ``" B"``). + + Subclasses may override this for optimised shared-context scoring + (e.g. single forward pass for single-token choices). + + Args: + context: The question/prompt text. + choices: Answer choice strings (e.g. ``["A", "B", "C", "D"]``). + delimiter: String prepended to each choice (default ``" "``). + + Returns: + List of log-likelihoods, one per choice. + """ + pairs = [(context, f"{delimiter}{c}") for c in choices] + return self.loglikelihood_batch(pairs) + + def gather_traces(self) -> Dict[str, Any]: + """Gather execution traces from this scorer. + + Output fields: + + - ``type`` - Component class name + - ``gathered_at`` - ISO timestamp + - ``model_id`` - Model identifier + - ``total_calls`` - Number of scoring calls + - ``successful_calls`` - Number of successful calls + - ``failed_calls`` - Number of failed calls + - ``total_duration_seconds`` - Total time spent in calls + - ``logs`` - List of individual call records + + Returns: + Dictionary containing scorer execution traces. + """ + total_calls = len(self.logs) + successful_calls = sum(1 for call in self.logs if call["status"] == "success") + failed_calls = total_calls - successful_calls + total_duration = sum(call["duration_seconds"] for call in self.logs) + + return { + **super().gather_traces(), + "model_id": self.model_id, + "total_calls": total_calls, + "successful_calls": successful_calls, + "failed_calls": failed_calls, + "total_duration_seconds": total_duration, + "logs": self.logs, + } + + def gather_config(self) -> Dict[str, Any]: + """Gather configuration from this scorer. + + Output fields: + + - ``type`` - Component class name + - ``gathered_at`` - ISO timestamp + - ``model_id`` - Model identifier + - ``scorer_type`` - The specific scorer class name + - ``seed`` - Seed for deterministic scoring, or None if unseeded + + Returns: + Dictionary containing scorer configuration. + """ + return { + **super().gather_config(), + "model_id": self.model_id, + "scorer_type": type(self).__name__, + "seed": self._seed, + } diff --git a/maseval/core/task.py b/maseval/core/task.py index ed617943..9a7b3aca 100644 --- a/maseval/core/task.py +++ b/maseval/core/task.py @@ -5,6 +5,7 @@ from collections.abc import Sequence from typing import Iterable, List, Union, Iterator, Optional import json +import pickle from pathlib import Path from enum import Enum @@ -273,6 +274,141 @@ def __iter__(self) -> Iterator[Task]: return iter(self._tasks) +class InformativeSubsetQueue(SequentialTaskQueue): + """Evaluates an informative subset of tasks in a specified order. + + Used for efficient evaluation where a carefully selected subset of tasks + can predict performance on the full dataset. The subset is defined by + ``indices`` — integer positions into the original task list. Only tasks + at those positions are yielded, in the order given by ``indices``. + + The informativeness criterion (how the indices were chosen) is determined + by the caller or by a subclass. This base class is criterion-agnostic. + + When ``indices`` is ``None``, all tasks are yielded in their original + order (equivalent to ``SequentialTaskQueue``). + + Attributes: + _all_tasks: The complete, unfiltered task list. + _indices: The subset indices, or ``None``. + + Example: + ```python + # Evaluate only tasks at indices 0, 5, 12 + queue = InformativeSubsetQueue(tasks, indices=[0, 5, 12]) + + for task in queue: + result = execute(task) # Only 3 tasks + ``` + """ + + def __init__(self, tasks: Iterable[Task], indices: Optional[List[int]] = None) -> None: + """Initialize informative-subset task queue. + + Args: + tasks: Full list of tasks (ordered by index). + indices: Positions into ``tasks`` selecting which tasks to evaluate + and in what order. If ``None``, evaluates all tasks in order. + """ + all_tasks = list(tasks) + self._all_tasks: List[Task] = all_tasks + self._indices: Optional[List[int]] = indices + + if indices is not None: + task_by_index: Dict[int, Task] = {i: task for i, task in enumerate(all_tasks)} + filtered = [task_by_index[idx] for idx in indices if idx in task_by_index] + super().__init__(filtered) + else: + super().__init__(all_tasks) + + +class DISCOQueue(InformativeSubsetQueue): + """Diversity-based informative subset using DISCO anchor points. + + Selects a diverse subset of tasks (anchor points) for evaluation. Full + benchmark performance is then predicted from results on this subset using + DISCO (Diversifying Sample Condensation for Efficient Model Evaluation). + + The informativeness criterion is **diversity**: anchor points are chosen + to maximise disagreement across models, so that a small evaluation set + captures the discriminative structure of the full benchmark. + + Reference: `DISCO: Diversifying Sample Condensation for Efficient Model + Evaluation `_ + + Example: + ```python + queue = DISCOQueue(tasks, anchor_points=[0, 5, 12]) + # or load from file: + queue = DISCOQueue(tasks, anchor_points_path="anchor_points.pkl") + + for task in queue: + result = execute(task) # Only anchor-point tasks + ``` + """ + + def __init__( + self, + tasks: Iterable[Task], + anchor_points: Optional[List[int]] = None, + anchor_points_path: Optional[Union[str, Path]] = None, + ) -> None: + """Initialize DISCO task queue. + + Anchor points can be supplied directly via ``anchor_points`` or loaded + from a file via ``anchor_points_path``. Providing both is an error. + + Args: + tasks: Full list of tasks (ordered by index). + anchor_points: Diversity-selected indices into ``tasks``. + Typically downloaded from a HuggingFace DISCO model repo. + If ``None`` and ``anchor_points_path`` is also ``None``, + evaluates all tasks in order. + anchor_points_path: Path to a ``.json`` or ``.pkl`` file + containing anchor-point indices. Mutually exclusive with + ``anchor_points``. + """ + if anchor_points is not None and anchor_points_path is not None: + raise ValueError("Provide either anchor_points or anchor_points_path, not both.") + + if anchor_points_path is not None: + anchor_points = self.load_anchor_points(anchor_points_path) + + self._anchor_points: Optional[List[int]] = anchor_points + super().__init__(tasks, indices=anchor_points) + + @staticmethod + def load_anchor_points(path: Union[str, Path]) -> List[int]: + """Load anchor points from a ``.json`` or ``.pkl`` file. + + Args: + path: Path to anchor points file. JSON files should contain a + list of integer indices. Pickle files may contain a list or + a numpy array. + + Returns: + List of integer anchor-point indices. + + Raises: + FileNotFoundError: If the file does not exist. + """ + path = Path(path) + if not path.exists(): + raise FileNotFoundError(f"Anchor points file not found: {path}") + + if path.suffix.lower() == ".json": + with open(path) as f: + anchor_points = json.load(f) + else: + with open(path, "rb") as f: + anchor_points = pickle.load(f) + + if hasattr(anchor_points, "tolist"): + anchor_points = anchor_points.tolist() + + return list(anchor_points) + + class PriorityTaskQueue(BaseTaskQueue): """Execute tasks ordered by priority. diff --git a/maseval/interface/inference/__init__.py b/maseval/interface/inference/__init__.py index e6765d1e..549c719b 100644 --- a/maseval/interface/inference/__init__.py +++ b/maseval/interface/inference/__init__.py @@ -1,14 +1,20 @@ -"""Inference model adapters for various providers. +"""Inference model adapters and scorers for various providers. -This package contains concrete implementations of ModelAdapter for different -inference providers. Each adapter requires the corresponding optional dependency. +This package contains concrete implementations of ``ModelAdapter`` and +``ModelScorer`` for different inference providers. Each adapter/scorer +requires the corresponding optional dependency. -Available adapters: - - AnthropicModelAdapter: Anthropic Claude models (requires anthropic) - - GoogleGenAIModelAdapter: Google Gemini models (requires google-genai) - - HuggingFaceModelAdapter: HuggingFace transformers (requires transformers) - - LiteLLMModelAdapter: 100+ providers via LiteLLM (requires litellm) - - OpenAIModelAdapter: OpenAI and compatible APIs (requires openai) +Available adapters (text generation): + +- ``AnthropicModelAdapter``: Anthropic Claude models (requires ``anthropic``) +- ``GoogleGenAIModelAdapter``: Google Gemini models (requires ``google-genai``) +- ``HuggingFacePipelineModelAdapter``: HuggingFace pipelines (requires ``transformers``) +- ``LiteLLMModelAdapter``: 100+ providers via LiteLLM (requires ``litellm``) +- ``OpenAIModelAdapter``: OpenAI and compatible APIs (requires ``openai``) + +Available scorers (log-likelihood): + +- ``HuggingFaceModelScorer``: HuggingFace causal LMs (requires ``transformers``) Example: ```python @@ -49,13 +55,26 @@ # Conditionally import HuggingFace adapter try: - from .huggingface import HuggingFaceModelAdapter, ToolCallingNotSupportedError # noqa: F401 + from .huggingface import ( # noqa: F401 + HuggingFacePipelineModelAdapter, + HuggingFaceModelAdapter, + ToolCallingNotSupportedError, + ) + __all__.append("HuggingFacePipelineModelAdapter") __all__.append("HuggingFaceModelAdapter") __all__.append("ToolCallingNotSupportedError") except ImportError: pass +# Conditionally import HuggingFace scorer +try: + from .huggingface_scorer import HuggingFaceModelScorer # noqa: F401 + + __all__.append("HuggingFaceModelScorer") +except ImportError: + pass + # Conditionally import LiteLLM adapter try: from .litellm import LiteLLMModelAdapter # noqa: F401 diff --git a/maseval/interface/inference/huggingface.py b/maseval/interface/inference/huggingface.py index 45fac7e8..f765eb49 100644 --- a/maseval/interface/inference/huggingface.py +++ b/maseval/interface/inference/huggingface.py @@ -1,7 +1,10 @@ -"""HuggingFace model adapter. +"""HuggingFace pipeline model adapter. -This adapter works with HuggingFace transformers pipelines and models. -It supports both simple callable models and full pipeline objects. +This adapter works with HuggingFace transformers pipelines and callables +for text generation via ``chat()`` and ``generate()``. + +For log-likelihood scoring (e.g. MCQ evaluation), see +``HuggingFaceModelScorer`` in ``maseval.interface.inference.huggingface_scorer``. Requires transformers to be installed: pip install maseval[transformers] @@ -9,11 +12,11 @@ Example: ```python from transformers import pipeline - from maseval.interface.inference import HuggingFaceModelAdapter + from maseval.interface.inference import HuggingFacePipelineModelAdapter # Using a pipeline pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct") - model = HuggingFaceModelAdapter(model=pipe, model_id="llama-3.1-8b") + model = HuggingFacePipelineModelAdapter(model=pipe, model_id="llama-3.1-8b") # Simple generation response = model.generate("Hello!") @@ -42,12 +45,18 @@ class ToolCallingNotSupportedError(Exception): pass -class HuggingFaceModelAdapter(ModelAdapter): - """Adapter for HuggingFace transformers models and pipelines. +class HuggingFacePipelineModelAdapter(ModelAdapter): + """Adapter for HuggingFace transformers pipelines and callables. + + Wraps a HuggingFace ``pipeline()`` object (or any text-generation callable) + for use with the ``ModelAdapter`` interface (``chat()``, ``generate()``). + + For log-likelihood scoring, see ``HuggingFaceModelScorer``. Works with: - - transformers.pipeline() objects - - Any callable that accepts a prompt and returns text + + - ``transformers.pipeline()`` objects + - Any callable that accepts a prompt and returns text For chat functionality, the adapter uses the tokenizer's chat template if available. This provides proper formatting for instruction-tuned models. @@ -55,8 +64,8 @@ class HuggingFaceModelAdapter(ModelAdapter): Tool calling support: Tool calling is only supported if the model's chat template explicitly supports it. If you pass tools and the model doesn't support them, - a ToolCallingNotSupportedError is raised. For reliable tool calling, - consider using LiteLLMModelAdapter instead. + a ``ToolCallingNotSupportedError`` is raised. For reliable tool calling, + consider using ``LiteLLMModelAdapter`` instead. """ def __init__( @@ -378,3 +387,7 @@ def gather_config(self) -> Dict[str, Any]: base_config["pipeline_config"] = pipeline_config return base_config + + +# Backwards compatibility alias +HuggingFaceModelAdapter = HuggingFacePipelineModelAdapter diff --git a/maseval/interface/inference/huggingface_scorer.py b/maseval/interface/inference/huggingface_scorer.py new file mode 100644 index 00000000..53d43dad --- /dev/null +++ b/maseval/interface/inference/huggingface_scorer.py @@ -0,0 +1,264 @@ +"""HuggingFace model scorer for log-likelihood evaluation. + +Wraps a raw HuggingFace ``AutoModelForCausalLM`` (not a pipeline) and +exposes ``loglikelihood()`` for scoring context-continuation pairs. Designed +for MCQ-style evaluation where the best answer is chosen by highest +log-likelihood. + +For text generation (``chat()``, ``generate()``), see +``HuggingFacePipelineModelAdapter`` in ``maseval.interface.inference.huggingface``. + +Requires transformers and torch: + pip install maseval[transformers] + +Example: + ```python + from maseval.interface.inference import HuggingFaceModelScorer + + scorer = HuggingFaceModelScorer( + model_id="meta-llama/Llama-2-7b-hf", + device="cuda:0", + ) + + # Score a single continuation + ll = scorer.loglikelihood("The capital of France is", " Paris") + + # MCQ: pick the most likely answer + logprobs = scorer.loglikelihood_choices( + context="What is 2+2? Answer:", + choices=["A", "B", "C", "D"], + ) + best_idx = logprobs.index(max(logprobs)) + ``` +""" + +from __future__ import annotations + +from typing import Any, Dict, List, Optional, Tuple + +from maseval.core.scorer import ModelScorer + + +class HuggingFaceModelScorer(ModelScorer): + """Log-likelihood scorer backed by a HuggingFace causal language model. + + Loads the model lazily on first use. Supports: + + - Single-token optimisation: when all continuations map to a single token, + one forward pass scores every choice. + - Multi-token fallback: separate forward pass per continuation. + - ``loglikelihood_choices()`` override that picks the optimal path + automatically. + + The tokenisation strategy matches ``lm-evaluation-harness``: context and + continuation are encoded separately, then concatenated to handle + tokenisation-boundary effects correctly. + """ + + def __init__( + self, + model_id: str, + device: str = "cuda:0", + trust_remote_code: bool = True, + seed: Optional[int] = None, + ): + """Initialize HuggingFace model scorer. + + Args: + model_id: HuggingFace model identifier + (e.g. ``"meta-llama/Llama-2-7b-hf"``). + device: Torch device string (e.g. ``"cuda:0"``, ``"cpu"``). + trust_remote_code: Trust remote code when loading the model. + seed: Seed for deterministic scoring. + """ + super().__init__(seed=seed) + self._model_id = model_id + self._device = device + self._trust_remote_code = trust_remote_code + self._model: Any = None + self._tokenizer: Any = None + + @property + def model_id(self) -> str: + return self._model_id + + # ------------------------------------------------------------------ + # Model loading + # ------------------------------------------------------------------ + + def _load_model(self) -> Tuple[Any, Any]: + """Lazy-load the model and tokenizer. + + Returns: + Tuple of (model, tokenizer). + """ + if self._model is None: + from transformers import AutoModelForCausalLM, AutoTokenizer + + self._tokenizer = AutoTokenizer.from_pretrained( + self._model_id, + trust_remote_code=self._trust_remote_code, + ) + self._tokenizer.padding_side = "left" + if self._tokenizer.pad_token is None: + self._tokenizer.pad_token = self._tokenizer.eos_token + + self._model = AutoModelForCausalLM.from_pretrained( + self._model_id, + trust_remote_code=self._trust_remote_code, + torch_dtype="auto", + ) + self._model = self._model.to(self._device) + self._model.eval() + + return self._model, self._tokenizer + + # ------------------------------------------------------------------ + # Tokenisation helpers (matches lm-evaluation-harness) + # ------------------------------------------------------------------ + + def _encode_pair(self, context: str, continuation: str) -> Tuple[List[int], List[int]]: + """Encode a context-continuation pair like lm-evaluation-harness. + + 1. Encode ``whole = context + continuation`` + 2. Encode ``context`` alone + 3. ``continuation_enc = whole[len(context_enc):]`` + + Args: + context: The context/prompt string. + continuation: The continuation string. + + Returns: + Tuple of (context_enc, continuation_enc) token lists. + """ + _, tokenizer = self._load_model() + + n_spaces = len(context) - len(context.rstrip()) + if n_spaces > 0: + continuation = context[-n_spaces:] + continuation + context = context[:-n_spaces] + + whole_enc = tokenizer.encode(context + continuation, add_special_tokens=True) + context_enc = tokenizer.encode(context, add_special_tokens=True) + + continuation_enc = whole_enc[len(context_enc) :] + return context_enc, continuation_enc + + # ------------------------------------------------------------------ + # Core scoring + # ------------------------------------------------------------------ + + def _loglikelihood_impl(self, context: str, continuation: str) -> float: + """Score a single (context, continuation) pair. + + Uses ``_encode_pair`` for correct tokenisation, then computes the + sum of per-token log-probabilities over the continuation. + """ + import torch + + model, _ = self._load_model() + + context_enc, continuation_enc = self._encode_pair(context, continuation) + full_sequence = context_enc + continuation_enc + input_tokens = full_sequence[:-1] + + input_ids = torch.tensor([input_tokens], dtype=torch.long, device=self._device) + + with torch.no_grad(): + logits = model(input_ids).logits[0] + inplen = len(input_tokens) + contlen = len(continuation_enc) + selected = logits[inplen - contlen : inplen] + log_probs = torch.nn.functional.log_softmax(selected, dim=-1) + + total = 0.0 + for i, token_id in enumerate(continuation_enc): + total += log_probs[i, token_id].item() + + return total + + # ------------------------------------------------------------------ + # MCQ optimisation + # ------------------------------------------------------------------ + + def loglikelihood_choices( + self, + context: str, + choices: List[str], + delimiter: str = " ", + ) -> List[float]: + """Score multiple-choice continuations with shared-context optimisation. + + When every ``delimiter + choice`` maps to a single continuation token, + all choices are scored in **one** forward pass. Otherwise falls back to + per-choice scoring via ``_loglikelihood_impl``. + + Args: + context: The question/prompt text. + choices: Answer choice strings (e.g. ``["A", "B", "C", "D"]``). + delimiter: String prepended to each choice (default ``" "``). + + Returns: + List of log-likelihoods, one per choice. + """ + model, _ = self._load_model() + + continuations = [f"{delimiter}{c}" for c in choices] + encoded_continuations = [self._encode_pair(context, cont) for cont in continuations] + + all_single_token = all(len(cont_enc) == 1 for _, cont_enc in encoded_continuations) + + if all_single_token: + return self._score_single_token(context, choices, delimiter, encoded_continuations) + + return [self._loglikelihood_impl(context, cont) for cont in continuations] + + def _score_single_token( + self, + context: str, + choices: List[str], + delimiter: str, + encoded_continuations: List[Tuple[List[int], List[int]]], + ) -> List[float]: + """One-forward-pass scoring for single-token continuations.""" + import torch + + model, _ = self._load_model() + + context_enc, first_cont_enc = encoded_continuations[0] + full_sequence = context_enc + first_cont_enc + input_tokens = full_sequence[:-1] + + input_ids = torch.tensor([input_tokens], dtype=torch.long, device=self._device) + + with torch.no_grad(): + logits = model(input_ids).logits[0] + inplen = len(input_tokens) + contlen = len(first_cont_enc) + selected_logits = logits[inplen - contlen : inplen] + log_probs = torch.nn.functional.log_softmax(selected_logits, dim=-1) + + logprobs: List[float] = [] + for _, cont_enc in encoded_continuations: + total = 0.0 + for i, token_id in enumerate(cont_enc): + total += log_probs[i, token_id].item() + logprobs.append(total) + + return logprobs + + # ------------------------------------------------------------------ + # Tracing + # ------------------------------------------------------------------ + + def gather_config(self) -> Dict[str, Any]: + """Gather configuration including device and model settings. + + Returns: + Dictionary containing scorer configuration. + """ + return { + **super().gather_config(), + "device": self._device, + "trust_remote_code": self._trust_remote_code, + } diff --git a/mkdocs.yml b/mkdocs.yml index 4b489f50..4ba742bf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -110,6 +110,7 @@ nav: - Exceptions: reference/exceptions.md - History: reference/history.md - Model: reference/model.md + - Scorer: reference/scorer.md - Seeding: reference/seeding.md - Simulator: reference/simulator.md - Tasks: reference/task.md @@ -128,8 +129,10 @@ nav: - LiteLLM: interface/inference/litellm.md - OpenAI: interface/inference/openai.md - Benchmarks: + - Overview: benchmark/index.md - ConVerse: benchmark/converse.md + - GAIA2: benchmark/gaia2.md - MACS: benchmark/macs.md + - MMLU: benchmark/mmlu.md - MultiAgentBench: benchmark/multiagentbench.md - Tau2: benchmark/tau2.md - - GAIA2: benchmark/gaia2.md diff --git a/pyproject.toml b/pyproject.toml index 51227d46..59e7eb02 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -82,11 +82,22 @@ multiagentbench = [ ] tau2 = ["docstring-parser>=0.16", "addict>=2.4.0"] converse = [] +# HuggingFace model + tokenizer, default dataset download; numpy for example script and anchor-point loading. +# For exact lm-evaluation-harness reproduction (--use_lmeval_batching), also install maseval[lm-eval]. +mmlu = [ + "torch>=2.0.0", + "transformers>=4.37.0", + "numpy>=1.20.0", +] -# LM Evaluation Harness (for HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval) -lm-eval = ["lm-eval @ git+https://github.com/arubique/lm-evaluation-harness.git@main"] +# LM Evaluation Harness — requires transformers 4.x (lm-eval uses APIs removed in 5.x) +lm-eval = [ + "aiohttp>=3.9.0", + "transformers>=4.37.0,<5.0.0", + "lm-eval @ git+https://github.com/arubique/lm-evaluation-harness.git@main", +] -# DISCO prediction (for MMLU benchmark example) +# DISCO prediction (for MMLU benchmark example) — requires transformers 4.x via lm-eval disco = [ "aiohttp>=3.9.0", "click>=8.1.0", @@ -97,6 +108,7 @@ disco = [ "jsonlines>=4.0.0", "lm-eval @ git+https://github.com/arubique/lm-evaluation-harness.git@main", "matplotlib>=3.5.0", + "transformers>=4.37.0,<5.0.0", "scikit-learn>=1.7.2", "scipy>=1.11.0", "stnd @ git+https://github.com/arubique/stnd.git@0d23b52f7742c08b28be560d2d52d450fcd274b7", diff --git a/tests/test_core/test_exceptions.py b/tests/test_core/test_exceptions.py index 416ebb7e..1698fa61 100644 --- a/tests/test_core/test_exceptions.py +++ b/tests/test_core/test_exceptions.py @@ -14,6 +14,7 @@ AgentError, EnvironmentError, UserError, + get_with_assert, validate_argument_type, validate_required_arguments, validate_no_extra_arguments, @@ -370,6 +371,52 @@ def test_validate_arguments_from_schema_strict_mode(self): validate_arguments_from_schema({"name": "test", "extra": 1}, schema, strict=True) +@pytest.mark.core +class TestGetWithAssert: + """Tests for get_with_assert required-key lookup.""" + + def test_single_key_present(self): + """Returns value when key exists.""" + assert get_with_assert({"a": 1}, "a") == 1 + + def test_single_key_missing_raises_key_error(self): + """Raises KeyError with descriptive message when key is missing.""" + with pytest.raises(KeyError, match='Required key "x"'): + get_with_assert({"a": 1}, "x") + + def test_nested_key_access(self): + """Supports nested access via a list of keys.""" + data = {"level1": {"level2": {"level3": "value"}}} + assert get_with_assert(data, ["level1", "level2", "level3"]) == "value" + + def test_nested_key_missing_raises_key_error(self): + """Raises KeyError when a nested key is missing.""" + data = {"level1": {"level2": {}}} + with pytest.raises(KeyError): + get_with_assert(data, ["level1", "level2", "level3"]) + + def test_custom_error_message(self): + """Uses custom error message when provided.""" + with pytest.raises(KeyError, match="MMLU task missing query"): + get_with_assert({}, "query", error_msg="MMLU task missing query") + + def test_single_element_list_key(self): + """List with one key behaves like a single key.""" + assert get_with_assert({"a": 42}, ["a"]) == 42 + + def test_falsy_values_returned(self): + """Falsy values (0, empty string, False, None) are returned, not treated as missing.""" + assert get_with_assert({"k": 0}, "k") == 0 + assert get_with_assert({"k": ""}, "k") == "" + assert get_with_assert({"k": False}, "k") is False + assert get_with_assert({"k": None}, "k") is None + + def test_empty_key_list_raises(self): + """Empty key list triggers assertion error.""" + with pytest.raises(AssertionError): + get_with_assert({"a": 1}, []) + + class TestFilteringByErrorType: """Tests for filtering failed tasks by error type.""" diff --git a/tests/test_core/test_queue.py b/tests/test_core/test_queue.py index 9ffdd7d9..35bf1933 100644 --- a/tests/test_core/test_queue.py +++ b/tests/test_core/test_queue.py @@ -15,9 +15,21 @@ AdaptiveTaskQueue, TaskQueue, BaseTaskQueue, + InformativeSubsetQueue, + DISCOQueue, ) +class _FakeArray: + """Pickle-serializable array-like for testing .tolist() conversion.""" + + def tolist(self): + return [1, 2, 3] + + def __iter__(self): + return iter([1, 2, 3]) + + # ==================== Fixtures ==================== @@ -212,6 +224,190 @@ def test_single_task(self): assert items[0].query == "Only one" +# ==================== InformativeSubsetQueue Tests ==================== + + +@pytest.mark.core +class TestInformativeSubsetQueue: + """Tests for InformativeSubsetQueue subset filtering.""" + + def test_filters_to_indices(self, simple_tasks): + """Only tasks at the given indices should be yielded.""" + queue = InformativeSubsetQueue(simple_tasks, indices=[0, 2]) + + queries = [task.query for task in queue] + + assert queries == ["Q1", "Q3"] + + def test_preserves_index_order(self): + """Tasks should be yielded in the order given by indices, not original order.""" + tasks = [Task(query=f"Q{i}") for i in range(5)] + queue = InformativeSubsetQueue(tasks, indices=[4, 1, 3]) + + queries = [task.query for task in queue] + + assert queries == ["Q4", "Q1", "Q3"] + + def test_none_indices_yields_all(self, simple_tasks): + """indices=None should yield all tasks in original order.""" + queue = InformativeSubsetQueue(simple_tasks, indices=None) + + queries = [task.query for task in queue] + + assert queries == ["Q1", "Q2", "Q3"] + + def test_stores_all_tasks(self, simple_tasks): + """_all_tasks should contain the full unfiltered list.""" + queue = InformativeSubsetQueue(simple_tasks, indices=[0]) + + assert len(queue._all_tasks) == 3 + assert len(queue) == 1 + + def test_out_of_range_indices_skipped(self): + """Indices not present in the task list should be silently skipped.""" + tasks = [Task(query="Q0"), Task(query="Q1")] + queue = InformativeSubsetQueue(tasks, indices=[0, 5, 99]) + + queries = [task.query for task in queue] + + assert queries == ["Q0"] + + def test_empty_indices(self, simple_tasks): + """Empty indices list should yield no tasks.""" + queue = InformativeSubsetQueue(simple_tasks, indices=[]) + + assert list(queue) == [] + assert len(queue) == 0 + + def test_is_subclass_of_sequential(self, simple_tasks): + """InformativeSubsetQueue should be a SequentialTaskQueue.""" + queue = InformativeSubsetQueue(simple_tasks) + assert isinstance(queue, SequentialTaskQueue) + + +# ==================== DISCOQueue Tests ==================== + + +@pytest.mark.core +class TestDISCOQueue: + """Tests for DISCOQueue diversity-based subset.""" + + def test_filters_to_anchor_points(self): + """Only tasks at anchor-point indices should be yielded.""" + tasks = [Task(query=f"Q{i}") for i in range(10)] + queue = DISCOQueue(tasks, anchor_points=[2, 5, 8]) + + queries = [task.query for task in queue] + + assert queries == ["Q2", "Q5", "Q8"] + + def test_none_anchor_points_yields_all(self, simple_tasks): + """anchor_points=None should yield all tasks.""" + queue = DISCOQueue(simple_tasks, anchor_points=None) + + assert len(list(queue)) == 3 + + def test_stores_anchor_points(self): + """_anchor_points should be accessible.""" + tasks = [Task(query=f"Q{i}") for i in range(5)] + anchor_pts = [0, 3, 4] + queue = DISCOQueue(tasks, anchor_points=anchor_pts) + + assert queue._anchor_points == [0, 3, 4] + + def test_is_subclass_of_informative_subset(self, simple_tasks): + """DISCOQueue should be an InformativeSubsetQueue.""" + queue = DISCOQueue(simple_tasks) + assert isinstance(queue, InformativeSubsetQueue) + + def test_len_matches_anchor_count(self): + """Queue length should match number of valid anchor points.""" + tasks = [Task(query=f"Q{i}") for i in range(10)] + queue = DISCOQueue(tasks, anchor_points=[1, 3, 7]) + + assert len(queue) == 3 + + +@pytest.mark.core +class TestDISCOQueueLoadAnchorPoints: + """Tests for DISCOQueue.load_anchor_points static method.""" + + def test_load_from_json(self, tmp_path): + """Should load anchor points from a JSON file.""" + import json + + path = tmp_path / "anchors.json" + path.write_text(json.dumps([0, 5, 12, 99])) + + result = DISCOQueue.load_anchor_points(path) + + assert result == [0, 5, 12, 99] + + def test_load_from_pickle(self, tmp_path): + """Should load anchor points from a pickle file.""" + import pickle + + path = tmp_path / "anchors.pkl" + with open(path, "wb") as f: + pickle.dump([2, 7, 15], f) + + result = DISCOQueue.load_anchor_points(path) + + assert result == [2, 7, 15] + + def test_load_converts_tolist(self, tmp_path): + """Should call .tolist() on array-like objects (e.g. numpy arrays).""" + import pickle + + path = tmp_path / "anchors.pkl" + with open(path, "wb") as f: + pickle.dump(_FakeArray(), f) + + result = DISCOQueue.load_anchor_points(path) + + assert result == [1, 2, 3] + + def test_file_not_found(self, tmp_path): + """Should raise FileNotFoundError for missing files.""" + with pytest.raises(FileNotFoundError, match="not found"): + DISCOQueue.load_anchor_points(tmp_path / "nonexistent.json") + + def test_accepts_string_path(self, tmp_path): + """Should accept a string path, not just Path objects.""" + import json + + path = tmp_path / "anchors.json" + path.write_text(json.dumps([10, 20])) + + result = DISCOQueue.load_anchor_points(str(path)) + + assert result == [10, 20] + + def test_init_with_anchor_points_path(self, tmp_path): + """DISCOQueue should load anchor points from file when anchor_points_path is given.""" + import json + + tasks = [Task(query=f"Q{i}") for i in range(10)] + path = tmp_path / "anchors.json" + path.write_text(json.dumps([2, 5, 8])) + + queue = DISCOQueue(tasks, anchor_points_path=path) + + assert len(queue) == 3 + assert queue._anchor_points == [2, 5, 8] + + def test_init_rejects_both_anchor_args(self, tmp_path): + """DISCOQueue should raise ValueError when both anchor_points and anchor_points_path are given.""" + import json + + tasks = [Task(query=f"Q{i}") for i in range(5)] + path = tmp_path / "anchors.json" + path.write_text(json.dumps([0, 1])) + + with pytest.raises(ValueError, match="not both"): + DISCOQueue(tasks, anchor_points=[0, 1], anchor_points_path=path) + + # ==================== PriorityTaskQueue Tests ==================== diff --git a/tests/test_core/test_scorer.py b/tests/test_core/test_scorer.py new file mode 100644 index 00000000..1c1570d0 --- /dev/null +++ b/tests/test_core/test_scorer.py @@ -0,0 +1,191 @@ +"""Tests for ModelScorer abstract base class. + +These tests verify that the ModelScorer ABC correctly delegates to +subclass implementations, handles logging/tracing, and provides +the expected batch and MCQ convenience methods. +""" + +import pytest +from typing import Dict, List, Optional, Tuple + +from maseval.core.scorer import ModelScorer + + +class StubScorer(ModelScorer): + """Minimal concrete scorer for testing the ABC contract.""" + + def __init__(self, scores: Dict[Tuple[str, str], float], seed: Optional[int] = None): + super().__init__(seed=seed) + self._scores = scores + self._call_log: List[Tuple[str, str]] = [] + + @property + def model_id(self) -> str: + return "stub-model" + + def _loglikelihood_impl(self, context: str, continuation: str) -> float: + self._call_log.append((context, continuation)) + return self._scores[(context, continuation)] + + +class FailingScorer(ModelScorer): + """Scorer that raises on every call, for error-path testing.""" + + @property + def model_id(self) -> str: + return "failing-model" + + def _loglikelihood_impl(self, context: str, continuation: str) -> float: + raise ValueError("model exploded") + + +pytestmark = pytest.mark.core + + +class TestModelScorerLoglikelihood: + """Tests for single-pair loglikelihood.""" + + def test_delegates_to_impl(self): + """loglikelihood() should delegate to _loglikelihood_impl().""" + scorer = StubScorer({("ctx", " cont"): -1.5}) + result = scorer.loglikelihood("ctx", " cont") + + assert result == -1.5 + assert scorer._call_log == [("ctx", " cont")] + + def test_logs_success(self): + """Successful call should be logged.""" + scorer = StubScorer({("a", "b"): -2.0}) + scorer.loglikelihood("a", "b") + + assert len(scorer.logs) == 1 + assert scorer.logs[0]["status"] == "success" + assert scorer.logs[0]["type"] == "loglikelihood" + assert scorer.logs[0]["duration_seconds"] >= 0 + + def test_logs_error_and_reraises(self): + """Failed call should be logged and the exception re-raised.""" + scorer = FailingScorer() + + with pytest.raises(ValueError, match="model exploded"): + scorer.loglikelihood("a", "b") + + assert len(scorer.logs) == 1 + assert scorer.logs[0]["status"] == "error" + assert scorer.logs[0]["error_type"] == "ValueError" + + +class TestModelScorerBatch: + """Tests for batch loglikelihood.""" + + def test_default_batch_loops_over_impl(self): + """Default _loglikelihood_batch_impl loops over _loglikelihood_impl.""" + scores = {("q", " A"): -1.0, ("q", " B"): -2.0, ("q", " C"): -0.5} + scorer = StubScorer(scores) + + results = scorer.loglikelihood_batch([("q", " A"), ("q", " B"), ("q", " C")]) + + assert results == [-1.0, -2.0, -0.5] + assert len(scorer._call_log) == 3 + + def test_batch_logs_single_entry(self): + """Batch call should produce one log entry (not per-pair).""" + scores = {("q", " A"): -1.0, ("q", " B"): -2.0} + scorer = StubScorer(scores) + + scorer.loglikelihood_batch([("q", " A"), ("q", " B")]) + + assert len(scorer.logs) == 1 + assert scorer.logs[0]["type"] == "loglikelihood_batch" + assert scorer.logs[0]["batch_size"] == 2 + + def test_empty_batch(self): + """Empty batch should return empty list.""" + scorer = StubScorer({}) + assert scorer.loglikelihood_batch([]) == [] + + +class TestModelScorerChoices: + """Tests for MCQ loglikelihood_choices.""" + + def test_prepends_delimiter(self): + """Choices should be prepended with the delimiter before scoring.""" + scores = {("Q?", " A"): -1.0, ("Q?", " B"): -0.5, ("Q?", " C"): -2.0} + scorer = StubScorer(scores) + + results = scorer.loglikelihood_choices("Q?", ["A", "B", "C"]) + + assert results == [-1.0, -0.5, -2.0] + assert scorer._call_log == [("Q?", " A"), ("Q?", " B"), ("Q?", " C")] + + def test_custom_delimiter(self): + """Custom delimiter should be used instead of default space.""" + scores = {("Q?", "\nA"): -1.0, ("Q?", "\nB"): -0.5} + scorer = StubScorer(scores) + + results = scorer.loglikelihood_choices("Q?", ["A", "B"], delimiter="\n") + + assert results == [-1.0, -0.5] + assert scorer._call_log == [("Q?", "\nA"), ("Q?", "\nB")] + + +class TestModelScorerTracing: + """Tests for gather_traces and gather_config.""" + + def test_gather_traces_includes_call_stats(self): + """Traces should contain call counts and timing.""" + scores = {("a", "b"): -1.0, ("c", "d"): -2.0} + scorer = StubScorer(scores) + scorer.loglikelihood("a", "b") + scorer.loglikelihood("c", "d") + + traces = scorer.gather_traces() + + assert traces["model_id"] == "stub-model" + assert traces["total_calls"] == 2 + assert traces["successful_calls"] == 2 + assert traces["failed_calls"] == 0 + assert traces["total_duration_seconds"] >= 0 + assert len(traces["logs"]) == 2 + + def test_gather_traces_counts_failures(self): + """Traces should correctly count failed calls.""" + scorer = FailingScorer() + with pytest.raises(ValueError): + scorer.loglikelihood("a", "b") + + traces = scorer.gather_traces() + + assert traces["total_calls"] == 1 + assert traces["successful_calls"] == 0 + assert traces["failed_calls"] == 1 + + def test_gather_config(self): + """Config should include model_id, scorer_type, and seed.""" + scorer = StubScorer({}, seed=42) + + config = scorer.gather_config() + + assert config["model_id"] == "stub-model" + assert config["scorer_type"] == "StubScorer" + assert config["seed"] == 42 + + def test_gather_config_seed_none(self): + """Config should report None seed when unseeded.""" + scorer = StubScorer({}) + + config = scorer.gather_config() + + assert config["seed"] is None + + +class TestModelScorerSeed: + """Tests for seed property.""" + + def test_seed_stored(self): + scorer = StubScorer({}, seed=123) + assert scorer.seed == 123 + + def test_seed_default_none(self): + scorer = StubScorer({}) + assert scorer.seed is None