-
Notifications
You must be signed in to change notification settings - Fork 451
[Feat] LMMS-Eval 0.4 #721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] LMMS-Eval 0.4 #721
Conversation
|
Important Review skippedMore than 25% of the files skipped due to max files limit. The review is being skipped to prevent a low-quality review. 109 files out of 279 files are above the max files limit of 100. Please upgrade to Pro plan to get higher limits. You can disable this status message by setting the ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
…at mode (#692) * Update deps * Restructured * Delete models * Remove deprecated models * Set up auto doc to messages and chat models * Lint * Allow force simple mode * Add auto doc to messages for audio and video Fix lint Init server structure Restructure to server folder Clean base and providers Add clean method for models Fix loggers save result Fix dummy server error Suppress llava warnings Sample evaluator on llava in the wild Update mmmu doc to messages Update version
e591de0 to
abc7ad6
Compare
… protocols Add AsyncAzureOpenAIProvider implementation and update provider factory Refactor sample saving in EvaluationTracker to use cleaned data and improve logging Add llm_as_judge_eval metric to multiple tasks and integrate llm_judge API for evaluation
…generation and evaluation, enhancing API integration and error handling. Update MMBench_Evaluator to streamline API key handling based on environment variables.
…d clarity and efficiency. Update MathVerseEvaluator to streamline answer scoring by eliminating unnecessary extraction steps and enhance evaluation prompts. Remove deprecated metrics from configuration files.
…d response generation and evaluation. Streamline API configuration and error handling by removing direct API key management and utilizing a custom server configuration for requests.
… 'llm_as_judge_eval' across multiple YAML files and adjust the result processing function accordingly. This change aligns with the integration of the llm_judge server for enhanced evaluation metrics.
… evaluation. Introduce 'olympiadbench_OE_MM_maths_en_COMP.yaml' and 'olympiadbench_OE_MM_physics_en_COMP.yaml' files, while removing outdated English and Chinese test configurations. Update evaluation metrics to utilize 'llm_as_judge_eval' for consistency across tasks.
…odel. Introduced `parse_reasoning_model_answer` to clean model responses and updated answer processing in the Qwen2_5_VL class to utilize this new function, enhancing response clarity and logging.
…m 'answer' to 'final_answer' for improved clarity in response generation.
…pdated the return format to include question, response, and ground truth for improved evaluation context. Simplified judge result determination logic.
…get' from 'answer' to 'final_answer' for improved clarity in response generation.
…get' from 'answer' to 'final_answer' for consistency with recent configuration updates and improved clarity in response generation.
* add mmvu task * fix linting videomathqa * fix mmvu to use llm judge * add visualwebbench task
… from 'lm_eval' to 'lmms_eval'. Update task configurations and evaluation metrics accordingly for consistency across the project.
…with 'max_new_tokens' for consistency across YAML files and documentation. This change aligns with recent updates in the generation parameters for improved clarity in model behavior.
|
@kcz358 I think we should write a markdown documentation about the change in |
…at mode (#692) * Update deps * Restructured * Delete models * Remove deprecated models * Set up auto doc to messages and chat models * Lint * Allow force simple mode * Add auto doc to messages for audio and video Fix lint Init server structure Restructure to server folder Clean base and providers Add clean method for models Fix loggers save result Fix dummy server error Suppress llava warnings Sample evaluator on llava in the wild Update mmmu doc to messages Update version
… protocols Add AsyncAzureOpenAIProvider implementation and update provider factory Refactor sample saving in EvaluationTracker to use cleaned data and improve logging Add llm_as_judge_eval metric to multiple tasks and integrate llm_judge API for evaluation
…generation and evaluation, enhancing API integration and error handling. Update MMBench_Evaluator to streamline API key handling based on environment variables.
…d clarity and efficiency. Update MathVerseEvaluator to streamline answer scoring by eliminating unnecessary extraction steps and enhance evaluation prompts. Remove deprecated metrics from configuration files.
… practices, and error resolution strategies for the codebase.
…ipping whitespace from the score string before processing.
…ty files to enhance response handling by replacing Request object usage with direct server method calls for text generation across multiple evaluation tasks.
…ng direct server method calls with Request object usage. Update server configuration in multiple utility files to enhance response handling and streamline evaluation processes.
…reamline configuration. Remove redundant default settings and ensure proper handling of sampling parameters based on the do_sample flag. Update multiple YAML task files to increase max_new_tokens and comment out temperature settings for clarity. Introduce new YAML configuration for MMMU validation reasoning task.
…handling and validation. Implement robust regex patterns for score extraction, ensuring all components are accounted for and scores are clamped within valid ranges. Add logging for better traceability of errors and fallback mechanisms for invalid inputs in the mia_bench evaluation process.
…ion logic. Remove redundant checks for tensor parallelism and streamline generation parameter settings by eliminating unused temperature and top_p configurations.
…ontent. Remove unused distributed executor backend parameter for cleaner execution logic.
Corrected "27.8/16.40" to "27.8/26.40" in the performance comparison table. Also corrected "16.78/13.82" to "16.78/15.82" in the performance comparison table.
Before you open a pull-request, please check if a similar issue already exists or has been closed before.
When you open a pull-request, please be sure to include the following
If you meet the lint warnings, you can use following scripts to reformat code.
Thank you for your contributions!