Skip to content

The results obtained by using vllm and vllm generate differ significantly. #913

@Fu-Fu-Fu-Fu

Description

@Fu-Fu-Fu-Fu

Previously, I used the simple version of vllm to test multiple models on various test sets. Recently, I discovered the updated vllm generate and retested some models. I found significant differences on some datasets for certain models (using the same set of parameters, only the --model option was modified). For example, Qwen3vl-8b-instruct yielded scores of 59.6 and 29.4 on mmvu and scivideobench respectively using vllm generate; however, the scores using vllm generate were 57.9 and 26.2 respectively. I saw in the vllm generate comments that it handles Qwen models specially, but all the models I tested were Qwen series models or variants obtained by fine-tuning Qwen models. Yet, some models showed little difference after being changed. Is this normal?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions