-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Open
Description
Motivation
As the LM Evaluation Harness has grown and evolved, we've accumulated some complexity in our codebase. While this flexibility has been valuable for supporting a wide range of use cases, it has also created several challenges:
- Steeper learning curve: New contributors and users encounter a learning curve when getting familiar with the evaluation pipeline, task configurations, and filter mechanisms
- Maintenance overhead: Some abstractions could be streamlined or be made more explicit.
- Code clarity: The current codebase has grown organically, leading to some patterns that could be more intuitive and maintainable
I'm considering several potential modifications to streamline the harness architecture:
1. Filter/Metric Pipeline Restructuring
- Create more intuitive abstractions for common post-processing patterns (e.g. creating specific filters for common use cases)
- Make the pipeline more type-explicit (e.g. Simplify ConfigurableTask.process_results() #3082)
- Address repeat handling limitations (Fix Metric Calculation for Repeats #3080)
- Make metrics explicit rather than passthrough calculations in
process_results(requires handling different argument types across task patterns)
2. Task Definition Ergonomics
- Create simplified interfaces for conventional task formats (MMLU-style multiple choice, cloze hybrid tasks, etc.) through templating systems (tracked in Standardize Task Templates #3081) and converting from multiple-choice to generation
3. Handle CI Pain Points:
- CI refactor; allow setting args in config #2893 (tracked in Streamlining lm-eval Architecture #3083)
3. Documentation and Discoverability
- Make the tasks more discoverable through better organization and indexing (maybe through some hierarchical grouping)
- Provide clearer documentation for how the pieces fit together
- Improve examples and onboarding materials
Feedback
I'd love to hear from the community about:
- Which areas of complexity have been most challenging in your experience?
- What aspects of the current architecture work well and should be preserved?
- Any specific pain points or use cases that should be prioritized?
- Suggestions for maintaining backward compatibility?
kiersten-stokes
Metadata
Metadata
Assignees
Labels
No labels