[Feature] [New Model]: VibeVoice-1.5B : A Frontier Long Conversational Text-to-Speech Model

### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

https://huggingface.co/microsoft/VibeVoice-1.5B
https://github.com/microsoft/VibeVoice

This model is an auto-regressive architecture with multimodal inputs and outputs, which leverages a diffusion head together with a VAE for processing.
The design provides a general-purpose interface that unifies multimodal generation and understanding.

VibeVoice has become highly popular, and many users are interested in deploying it as an API backend.
If SGLang could support this model, it would be extremely valuable for the community.

difficulty: Transformers need to handle not only discrete text tokens, but also continuous multimodal tokens.

### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] [New Model]: VibeVoice-1.5B : A Frontier Long Conversational Text-to-Speech Model #9697

Checklist

Motivation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] [New Model]: VibeVoice-1.5B : A Frontier Long Conversational Text-to-Speech Model #9697

Description

Checklist

Motivation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions