Previously, I used the simple version of vllm to test multiple models on various test sets. Recently, I discovered the updated vllm generate and retested some models. I found significant differences on some datasets for certain models (using the same set of parameters, only the --model option was modified). For example, Qwen3vl-8b-instruct yielded scores of 59.6 and 29.4 on mmvu and scivideobench respectively using vllm generate; however, the scores using vllm generate were 57.9 and 26.2 respectively. I saw in the vllm generate comments that it handles Qwen models specially, but all the models I tested were Qwen series models or variants obtained by fine-tuning Qwen models. Yet, some models showed little difference after being changed. Is this normal?