Skip to content

Move DISCO queue to core#41

Open
arubique wants to merge 17 commits intoparameterlab:mainfrom
arubique:adaptive_queue
Open

Move DISCO queue to core#41
arubique wants to merge 17 commits intoparameterlab:mainfrom
arubique:adaptive_queue

Conversation

@arubique
Copy link
Contributor

@arubique arubique commented Mar 8, 2026

  1. Anchor point queue moves to core

AnchorPointsTaskQueue was renamed to DISCOQueue and moved to maseval/core/task.py. Also created a class hierarchy: SequentialTaskQueue → InformativeSubsetQueue → DISCOQueue, where InformativeSubsetQueue is a general base for informativeness-criterion-based subset selection, and DISCOQueue is the concrete implementation where the criterion is diversity (loaded from HuggingFace). Move load_anchor_points from Data Loading in mmlu.py to DISCOQueue.

  1. Avoid implementing agents in MMLUBenchmark; HuggingfaceModelAdapter → HuggingfacePipelinesModelAdapter; HuggingfaceModelEvaluationAdapter; ModelAdapterAgentAdapter

All three parts of this proposal were implemented:

  • HuggingFaceModelAdapter renamed to HuggingFacePipelineModelAdapter (with backwards-compat alias) in maseval/interface/inference/huggingface.py
  • HuggingFaceModelScorer created in maseval/interface/inference/huggingface_scorer.py — implements a new ModelScorer ABC (maseval/core/scorer.py) for log-likelihood computation, extracting all scoring logic out of the benchmark
  • MMLUBenchmark (base) no longer implements setup_agents() or get_model_adapter() — those are abstract, left to concrete subclasses. DefaultMMLUBenchmark uses _ScorerBackedAdapter (a simple tracing container) and delegates scoring to HuggingFaceModelScorer.
  1. Remove silent fallbacks, e.g., "task_data.get('query', '')

All .get() calls on required fields were replaced with direct dict access ([]) so missing data raises KeyError immediately. Also added get_with_assert() to maseval/core/exceptions.py as a reusable utility for required key lookups with clear error messages. Remaining .get() calls are only in places where defaults are genuinely appropriate (external JSON parsing with optional fields, defensive trace inspection on partial data).

  1. Avoid dummy methods or reimplement with pass.

Two instances were fixed:

  • _DummyCallable in DefaultMMLUBenchmark.get_model_adapter() — removed entirely. get_model_adapter() now raises NotImplementedError with a clear message. setup_agents() was restructured to not call get_model_adapter() at all, using _ScorerBackedAdapter directly.
  • MMLUBenchmark.setup_user() - removed the redundant return None override since the base class already provides that default.
  • ModelAgentAdapter - removed entirely (zero consumers after the refactor), cleaned from exports, docs, and changelog.
  1. Update Benchmarks.md with a note that MMLU is implemented

Added an "Implemented" callout to the MMLU section in BENCHMARKS.md pointing users to DefaultMMLUBenchmark, pip install maseval[mmlu], and the docs page.

  1. Add to docs an entry for this (benchmark/index.md and a new mmlu.md)

docs/benchmark/mmlu.md was created with full content (overview, installation, quick start, DISCO usage, custom subclass example, API reference with ::: directives). It's wired into mkdocs.yml nav. Docs build passes (the only warnings are pre-existing duplicate SmolAgentLLMUser references unrelated to MMLU).

  1. Introduce an extra for mmlu with all dependencies

pyproject.toml has an mmlu extra with torch, transformers, and numpy. Separate lm-eval and disco extras exist for the optional lm-evaluation-harness reproduction and DISCO prediction workflows. The lm-eval and disco extras pin transformers<5.0.0 due to API removals in transformers 5.x. Installation instructions added to examples/mmlu_benchmark/README.md.

Description

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

Documentation

  • Added/updated docstrings for new/modified functions as instructed CONTRIBUTING.md
  • Updated relevant documentation in docs/ (if applicable)
  • Tag github issue with this PR (if applicable)

Changelog

  • Added entry to CHANGELOG.md under [Unreleased] section
    • Use Added section for new features
    • Use Changed section for modifications to existing functionality
    • Use Fixed section for bug fixes
    • Use Removed section for deprecated/removed features
  • OR this is a documentation-only change (no changelog needed)

Example:
- Support for multi-agent tracing (PR:#123)

Architecture (if applicable)

  • Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
  • Dependencies: New core dependencies added sparingly; framework integrations go to optional dependencies

Additional Notes

arubique added 17 commits March 8, 2026 12:40
- AnchorPointsTaskQueue moved to core (maseval/core/task.py)
- MMLUBenchmark no longer implements agents
- Remove silent .get() fallbacks for required fields
- Add mmlu = [] extra to pyproject.toml
- Add MMLU entry to BENCHMARKS.md
- Update documentation with MMLU
- Update CHANGELOG.md
- Update dependencies
- Add InformativeSubsetQueue
- Rename AnchorPointsTaskQueue to DISCOQueue
- Make DISCOQueue a subclass of InformativeSubsetQueue
- Rename HuggingFaceMMLUBenchmark to DefaultMMLUBenchmark for consistency with other benchmarks
…name HuggingFaceModelAdapter

Introduce two new core abstractions and refactor the HuggingFace inference layer:
- ModelScorer (maseval.core.scorer): ABC for log-likelihood scoring,
  parallel to ModelAdapter for generation. Methods: loglikelihood(),
  loglikelihood_batch(), loglikelihood_choices().
- ModelAgentAdapter (maseval.core.agent): generic adapter wrapping any
  ModelAdapter as an AgentAdapter, replacing benchmark-specific wrappers
  like MMLUModelAgent/MMLUAgentAdapter.
- HuggingFaceModelAdapter renamed to HuggingFacePipelineModelAdapter
  (old name kept as backwards-compatible alias).
- HuggingFaceModelScorer (maseval.interface.inference): concrete
  ModelScorer backed by AutoModelForCausalLM, with single-token
  optimisation for MCQ evaluation. Extracted from DefaultMMLUBenchmark.
- DefaultMMLUBenchmark refactored to delegate scoring to
  HuggingFaceModelScorer and use ModelAgentAdapter.
- Replace all .get() calls on required fields by explicit dict lookup.
…ata access in MMLU benchmark

- Replace silent .get() fallbacks with direct dict access for required
  fields (choices, doc_id, gold, acc, etc.) so missing data fails fast
- Add get_with_assert utility to maseval.core.exceptions for required
  key lookups with clear error messages
- Remove _DummyCallable from DefaultMMLUBenchmark.get_model_adapter();
  raise NotImplementedError instead since scoring uses HuggingFaceModelScorer
- Restructure DefaultMMLUBenchmark.setup_agents() to use a scorer-backed
  adapter directly instead of routing through get_model_adapter()
- Remove redundant MMLUBenchmark.setup_user() override (base class
  already returns None)
- Remove ModelAgentAdapter (no consumers) from core, exports, and docs
- Update BENCHMARKS.md
- Update links in mmlu.md
- Update mmlu and disco dependencies
- Add installation guide to mmlu example
- Update DefaultMMLUBenchmark.run_agents to pass type checks.
- Add tests for get_with_assert, ModelScorer, InformativeSubsetQueue, and DISCOQueue
- Move load_anchor_points to DISCOQueue
…queue changes

- Add missing documentation for new core components introduced
alongside the MMLU benchmark: ModelScorer reference page,
InformativeSubsetQueue/DISCOQueue in task reference, get_with_assert
in exceptions reference, and HuggingFacePipelineModelAdapter rename
in model/HuggingFace pages.
- Add mmlu extra to README install section.
- Fix grammar in MMLU docs and fill CHANGELOG PR placeholders.
- Fix SmolAgents docs to make mkdocs build --strict pass
- Fix DISCO references.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant