Skip to content

[Feature] [New Model]: VibeVoice-1.5B : A Frontier Long Conversational Text-to-Speech Model #9697

@YaoyaoChang

Description

@YaoyaoChang

Checklist

Motivation

https://huggingface.co/microsoft/VibeVoice-1.5B
https://github.com/microsoft/VibeVoice

This model is an auto-regressive architecture with multimodal inputs and outputs, which leverages a diffusion head together with a VAE for processing.
The design provides a general-purpose interface that unifies multimodal generation and understanding.

VibeVoice has become highly popular, and many users are interested in deploying it as an API backend.
If SGLang could support this model, it would be extremely valuable for the community.

difficulty: Transformers need to handle not only discrete text tokens, but also continuous multimodal tokens.

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions