Checklist
Motivation
https://huggingface.co/microsoft/VibeVoice-1.5B
https://github.com/microsoft/VibeVoice
This model is an auto-regressive architecture with multimodal inputs and outputs, which leverages a diffusion head together with a VAE for processing.
The design provides a general-purpose interface that unifies multimodal generation and understanding.
VibeVoice has become highly popular, and many users are interested in deploying it as an API backend.
If SGLang could support this model, it would be extremely valuable for the community.
difficulty: Transformers need to handle not only discrete text tokens, but also continuous multimodal tokens.
Related resources
No response