Open
Conversation
- AnchorPointsTaskQueue moved to core (maseval/core/task.py) - MMLUBenchmark no longer implements agents - Remove silent .get() fallbacks for required fields - Add mmlu = [] extra to pyproject.toml - Add MMLU entry to BENCHMARKS.md - Update documentation with MMLU - Update CHANGELOG.md
- Update dependencies
- Add InformativeSubsetQueue - Rename AnchorPointsTaskQueue to DISCOQueue - Make DISCOQueue a subclass of InformativeSubsetQueue
- Rename HuggingFaceMMLUBenchmark to DefaultMMLUBenchmark for consistency with other benchmarks
…name HuggingFaceModelAdapter Introduce two new core abstractions and refactor the HuggingFace inference layer: - ModelScorer (maseval.core.scorer): ABC for log-likelihood scoring, parallel to ModelAdapter for generation. Methods: loglikelihood(), loglikelihood_batch(), loglikelihood_choices(). - ModelAgentAdapter (maseval.core.agent): generic adapter wrapping any ModelAdapter as an AgentAdapter, replacing benchmark-specific wrappers like MMLUModelAgent/MMLUAgentAdapter. - HuggingFaceModelAdapter renamed to HuggingFacePipelineModelAdapter (old name kept as backwards-compatible alias). - HuggingFaceModelScorer (maseval.interface.inference): concrete ModelScorer backed by AutoModelForCausalLM, with single-token optimisation for MCQ evaluation. Extracted from DefaultMMLUBenchmark. - DefaultMMLUBenchmark refactored to delegate scoring to HuggingFaceModelScorer and use ModelAgentAdapter.
- Replace all .get() calls on required fields by explicit dict lookup.
…ata access in MMLU benchmark - Replace silent .get() fallbacks with direct dict access for required fields (choices, doc_id, gold, acc, etc.) so missing data fails fast - Add get_with_assert utility to maseval.core.exceptions for required key lookups with clear error messages - Remove _DummyCallable from DefaultMMLUBenchmark.get_model_adapter(); raise NotImplementedError instead since scoring uses HuggingFaceModelScorer - Restructure DefaultMMLUBenchmark.setup_agents() to use a scorer-backed adapter directly instead of routing through get_model_adapter() - Remove redundant MMLUBenchmark.setup_user() override (base class already returns None) - Remove ModelAgentAdapter (no consumers) from core, exports, and docs
- Update BENCHMARKS.md
- Update links in mmlu.md
- Update mmlu and disco dependencies - Add installation guide to mmlu example
- Update DefaultMMLUBenchmark.run_agents to pass type checks.
- Add tests for get_with_assert, ModelScorer, InformativeSubsetQueue, and DISCOQueue
- Move load_anchor_points to DISCOQueue
…queue changes - Add missing documentation for new core components introduced alongside the MMLU benchmark: ModelScorer reference page, InformativeSubsetQueue/DISCOQueue in task reference, get_with_assert in exceptions reference, and HuggingFacePipelineModelAdapter rename in model/HuggingFace pages. - Add mmlu extra to README install section. - Fix grammar in MMLU docs and fill CHANGELOG PR placeholders.
- Fix SmolAgents docs to make mkdocs build --strict pass
- Fix DISCO references.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
AnchorPointsTaskQueue was renamed to DISCOQueue and moved to maseval/core/task.py. Also created a class hierarchy: SequentialTaskQueue → InformativeSubsetQueue → DISCOQueue, where InformativeSubsetQueue is a general base for informativeness-criterion-based subset selection, and DISCOQueue is the concrete implementation where the criterion is diversity (loaded from HuggingFace). Move load_anchor_points from Data Loading in mmlu.py to DISCOQueue.
All three parts of this proposal were implemented:
All .get() calls on required fields were replaced with direct dict access ([]) so missing data raises KeyError immediately. Also added get_with_assert() to maseval/core/exceptions.py as a reusable utility for required key lookups with clear error messages. Remaining .get() calls are only in places where defaults are genuinely appropriate (external JSON parsing with optional fields, defensive trace inspection on partial data).
Two instances were fixed:
Added an "Implemented" callout to the MMLU section in BENCHMARKS.md pointing users to DefaultMMLUBenchmark, pip install maseval[mmlu], and the docs page.
docs/benchmark/mmlu.md was created with full content (overview, installation, quick start, DISCO usage, custom subclass example, API reference with ::: directives). It's wired into mkdocs.yml nav. Docs build passes (the only warnings are pre-existing duplicate SmolAgentLLMUser references unrelated to MMLU).
pyproject.toml has an mmlu extra with torch, transformers, and numpy. Separate lm-eval and disco extras exist for the optional lm-evaluation-harness reproduction and DISCO prediction workflows. The lm-eval and disco extras pin transformers<5.0.0 due to API removals in transformers 5.x. Installation instructions added to examples/mmlu_benchmark/README.md.
Description
Type of Change
Checklist
Contribution
Documentation
docs/(if applicable)Changelog
CHANGELOG.mdunder[Unreleased]sectionExample:
- Support for multi-agent tracing (PR:#123)Architecture (if applicable)
maseval/core/do NOT import frommaseval/interface/Additional Notes