Streamlining lm-eval Architecture

## Motivation

As the LM Evaluation Harness has grown and evolved, we've accumulated some complexity in our codebase. While this flexibility has been valuable for supporting a wide range of use cases, it has also created several challenges:

- **Steeper learning curve**: New contributors and users encounter a learning curve when getting familiar with the evaluation pipeline, task configurations, and filter mechanisms
- **Maintenance overhead**: Some abstractions could be streamlined or be made more explicit.
- **Code clarity**: The current codebase has grown organically, leading to some patterns that could be more intuitive and maintainable

I'm considering several potential modifications to streamline the harness architecture:

## 1. Filter/Metric Pipeline Restructuring
- [ ] Create more intuitive abstractions for common post-processing patterns (e.g. creating specific filters for common use cases)
- [ ] Make the pipeline more type-explicit (e.g. #3082)
- [ ] Address repeat handling limitations (#3080)
- [ ] Make metrics explicit rather than passthrough calculations in `process_results` (requires handling different argument types across task patterns)

### 2. Task Definition Ergonomics
- [ ] Create simplified interfaces for conventional task formats (MMLU-style multiple choice, cloze hybrid tasks, etc.) through templating systems (tracked in #3081) and converting from multiple-choice to generation

### 3. Handle CI Pain Points:
- [ ] #2893 (tracked in #3083)

### 3. Documentation and Discoverability
- [ ] Make the tasks more discoverable through better organization and indexing (maybe through some hierarchical grouping)
- [ ] Provide clearer documentation for how the pieces fit together
- [ ] Improve examples and onboarding materials

## Feedback

I'd love to hear from the community about:
- Which areas of complexity have been most challenging in your experience?
- What aspects of the current architecture work well and should be preserved?
- Any specific pain points or use cases that should be prioritized?
- Suggestions for maintaining backward compatibility?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streamlining lm-eval Architecture #3083

Motivation

1. Filter/Metric Pipeline Restructuring

2. Task Definition Ergonomics

3. Handle CI Pain Points:

3. Documentation and Discoverability

Feedback

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Streamlining lm-eval Architecture #3083

Description

Motivation

1. Filter/Metric Pipeline Restructuring

2. Task Definition Ergonomics

3. Handle CI Pain Points:

3. Documentation and Discoverability

Feedback

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions