Move DISCO queue to core by arubique · Pull Request #41 · parameterlab/MASEval

arubique · 2026-03-08T11:42:25Z

Anchor point queue moves to core

AnchorPointsTaskQueue was renamed to DISCOQueue and moved to maseval/core/task.py. Also created a class hierarchy: SequentialTaskQueue → InformativeSubsetQueue → DISCOQueue, where InformativeSubsetQueue is a general base for informativeness-criterion-based subset selection, and DISCOQueue is the concrete implementation where the criterion is diversity (loaded from HuggingFace). Move load_anchor_points from Data Loading in mmlu.py to DISCOQueue.

Avoid implementing agents in MMLUBenchmark; HuggingfaceModelAdapter → HuggingfacePipelinesModelAdapter; HuggingfaceModelEvaluationAdapter; ModelAdapterAgentAdapter

All three parts of this proposal were implemented:

HuggingFaceModelAdapter renamed to HuggingFacePipelineModelAdapter (with backwards-compat alias) in maseval/interface/inference/huggingface.py
HuggingFaceModelScorer created in maseval/interface/inference/huggingface_scorer.py — implements a new ModelScorer ABC (maseval/core/scorer.py) for log-likelihood computation, extracting all scoring logic out of the benchmark
MMLUBenchmark (base) no longer implements setup_agents() or get_model_adapter() — those are abstract, left to concrete subclasses. DefaultMMLUBenchmark uses _ScorerBackedAdapter (a simple tracing container) and delegates scoring to HuggingFaceModelScorer.

Remove silent fallbacks, e.g., "task_data.get('query', '')

All .get() calls on required fields were replaced with direct dict access ([]) so missing data raises KeyError immediately. Also added get_with_assert() to maseval/core/exceptions.py as a reusable utility for required key lookups with clear error messages. Remaining .get() calls are only in places where defaults are genuinely appropriate (external JSON parsing with optional fields, defensive trace inspection on partial data).

Avoid dummy methods or reimplement with pass.

Two instances were fixed:

_DummyCallable in DefaultMMLUBenchmark.get_model_adapter() — removed entirely. get_model_adapter() now raises NotImplementedError with a clear message. setup_agents() was restructured to not call get_model_adapter() at all, using _ScorerBackedAdapter directly.
MMLUBenchmark.setup_user() - removed the redundant return None override since the base class already provides that default.
ModelAgentAdapter - removed entirely (zero consumers after the refactor), cleaned from exports, docs, and changelog.

Update Benchmarks.md with a note that MMLU is implemented

Added an "Implemented" callout to the MMLU section in BENCHMARKS.md pointing users to DefaultMMLUBenchmark, pip install maseval[mmlu], and the docs page.

Add to docs an entry for this (benchmark/index.md and a new mmlu.md)

docs/benchmark/mmlu.md was created with full content (overview, installation, quick start, DISCO usage, custom subclass example, API reference with ::: directives). It's wired into mkdocs.yml nav. Docs build passes (the only warnings are pre-existing duplicate SmolAgentLLMUser references unrelated to MMLU).

Introduce an extra for mmlu with all dependencies

pyproject.toml has an mmlu extra with torch, transformers, and numpy. Separate lm-eval and disco extras exist for the optional lm-evaluation-harness reproduction and DISCO prediction workflows. The lm-eval and disco extras pin transformers<5.0.0 due to API removals in transformers 5.x. Installation instructions added to examples/mmlu_benchmark/README.md.

Description

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

I have read the CONTRIBUTING.md guide.
Commits follow "How to write a good git commit message"

Documentation

Added/updated docstrings for new/modified functions as instructed CONTRIBUTING.md
Updated relevant documentation in docs/ (if applicable)
Tag github issue with this PR (if applicable)

Changelog

Added entry to CHANGELOG.md under [Unreleased] section
- Use Added section for new features
- Use Changed section for modifications to existing functionality
- Use Fixed section for bug fixes
- Use Removed section for deprecated/removed features
OR this is a documentation-only change (no changelog needed)

Example:
- Support for multi-agent tracing (PR:#123)

Architecture (if applicable)

Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
Dependencies: New core dependencies added sparingly; framework integrations go to optional dependencies

Additional Notes

- AnchorPointsTaskQueue moved to core (maseval/core/task.py) - MMLUBenchmark no longer implements agents - Remove silent .get() fallbacks for required fields - Add mmlu = [] extra to pyproject.toml - Add MMLU entry to BENCHMARKS.md - Update documentation with MMLU - Update CHANGELOG.md

- Update dependencies

- Add InformativeSubsetQueue - Rename AnchorPointsTaskQueue to DISCOQueue - Make DISCOQueue a subclass of InformativeSubsetQueue

- Rename HuggingFaceMMLUBenchmark to DefaultMMLUBenchmark for consistency with other benchmarks

…name HuggingFaceModelAdapter Introduce two new core abstractions and refactor the HuggingFace inference layer: - ModelScorer (maseval.core.scorer): ABC for log-likelihood scoring, parallel to ModelAdapter for generation. Methods: loglikelihood(), loglikelihood_batch(), loglikelihood_choices(). - ModelAgentAdapter (maseval.core.agent): generic adapter wrapping any ModelAdapter as an AgentAdapter, replacing benchmark-specific wrappers like MMLUModelAgent/MMLUAgentAdapter. - HuggingFaceModelAdapter renamed to HuggingFacePipelineModelAdapter (old name kept as backwards-compatible alias). - HuggingFaceModelScorer (maseval.interface.inference): concrete ModelScorer backed by AutoModelForCausalLM, with single-token optimisation for MCQ evaluation. Extracted from DefaultMMLUBenchmark. - DefaultMMLUBenchmark refactored to delegate scoring to HuggingFaceModelScorer and use ModelAgentAdapter.

- Replace all .get() calls on required fields by explicit dict lookup.

…ata access in MMLU benchmark - Replace silent .get() fallbacks with direct dict access for required fields (choices, doc_id, gold, acc, etc.) so missing data fails fast - Add get_with_assert utility to maseval.core.exceptions for required key lookups with clear error messages - Remove _DummyCallable from DefaultMMLUBenchmark.get_model_adapter(); raise NotImplementedError instead since scoring uses HuggingFaceModelScorer - Restructure DefaultMMLUBenchmark.setup_agents() to use a scorer-backed adapter directly instead of routing through get_model_adapter() - Remove redundant MMLUBenchmark.setup_user() override (base class already returns None) - Remove ModelAgentAdapter (no consumers) from core, exports, and docs

- Update BENCHMARKS.md

- Update links in mmlu.md

- Update mmlu and disco dependencies - Add installation guide to mmlu example

- Update DefaultMMLUBenchmark.run_agents to pass type checks.

- Add tests for get_with_assert, ModelScorer, InformativeSubsetQueue, and DISCOQueue

- Move load_anchor_points to DISCOQueue

…queue changes - Add missing documentation for new core components introduced alongside the MMLU benchmark: ModelScorer reference page, InformativeSubsetQueue/DISCOQueue in task reference, get_with_assert in exceptions reference, and HuggingFacePipelineModelAdapter rename in model/HuggingFace pages. - Add mmlu extra to README install section. - Fix grammar in MMLU docs and fill CHANGELOG PR placeholders.

- Fix SmolAgents docs to make mkdocs build --strict pass

- Fix DISCO references.

arubique added 17 commits March 8, 2026 12:40

[Move DISCO queue to core]:

c0f81b9

- Update dependencies

[Move DISCO queue to core]:

6ad80a8

- Add InformativeSubsetQueue - Rename AnchorPointsTaskQueue to DISCOQueue - Make DISCOQueue a subclass of InformativeSubsetQueue

[Move DISCO queue to core]:

b498ce7

- Rename HuggingFaceMMLUBenchmark to DefaultMMLUBenchmark for consistency with other benchmarks

[Move DISCO queue to core]:

079ef47

- Replace all .get() calls on required fields by explicit dict lookup.

[Move DISCO queue to core]:

dd46f1a

- Update BENCHMARKS.md

[Move DISCO queue to core]:

3779e2e

- Update links in mmlu.md

[Move DISCO queue to core]:

bf4abbb

- Update mmlu and disco dependencies - Add installation guide to mmlu example

[Move DISCO queue to core]:

f6a5885

- Update DefaultMMLUBenchmark.run_agents to pass type checks.

[Move DISCO queue to core]:

2693197

- Add tests for get_with_assert, ModelScorer, InformativeSubsetQueue, and DISCOQueue

[Move DISCO queue to core]:

afd2cf9

- Move load_anchor_points to DISCOQueue

[Move DISCO queue to core]:

e7d15a8

- Fix SmolAgents docs to make mkdocs build --strict pass

[Move DISCO queue to core]:

3aa675e

- Fix DISCO references.

Add benchmark/index.md to mkdocs.yml to fix warning during docs building

6f5b0e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move DISCO queue to core#41

Move DISCO queue to core#41
arubique wants to merge 17 commits intoparameterlab:mainfrom
arubique:adaptive_queue

arubique commented Mar 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arubique commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Checklist

Contribution

Documentation

Changelog

Architecture (if applicable)

Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arubique commented Mar 8, 2026 •

edited

Loading