SpeechGPT 2.0-preview

Our first human-like real-time interaction system towards context intelligence.

Introduction

SpeechGPT 2.0-preview is our first human-like real-time interaction system as we move towards context intelligence. Trained on millions of hours of speech data, this end-to-end spoken language model features human-like spoken expressions and low-latency responses at the millisecond level, enabling natural, fluid real-time interruption interactions. SpeechGPT 2.0-preview aligns speech and text modalities well. On one hand, it demonstrates significant speech style generalization capabilities, following user commands to achieve multi-emotion, multi-style, and multi-tone control with intelligent switching. It also has strong role-playing abilities, simulating various characters' tones and emotional states. Additionally, it showcases a range of vocal talents, including poetry recitation, storytelling, and speaking in dialects. On the other hand, while it excels in vocal expressiveness, it also has impressive intelligence and text capabilities, enabling it to support tool calls, online searches, and external knowledge base access. Currently, SpeechGPT 2.0-preview has been trained only on Chinese speech data and has not been trained on English voice data. As a result, the model does not yet support English conversation.

We have open-sourced the inference code, model weights, and a brief methodology introduction for SpeechGPT 2.0-preview on https://github.com/OpenMOSS/SpeechGPT-2.0-preview/blob/main/README_EN.md

Welcome to experience our Demo.

Model Summary

SpeechGPT 2.0-preview is an end-to-end spoken dialogue language model. Building on our insights and technological advancements in the field of end-to-end speech dialogue, we have developed an ultra-low bitrate streaming speech Codec that jointly models semantics and acoustics. We have constructed an efficient speech data crawling system, a multifunctional and high-efficiency speech data cleaning pipeline, and a comprehensive multi-granularity speech data annotation system, accumulating millions of hours of real speech data with meticulous annotation. We have developed a conversational dialogue speech synthesis system with strong voice cloning ability, synthesizing hundreds of thousands of hours of multi-role, multi-style speech-to-speech dialogue data based on this. We have proposed a new speech-text mixed-modeling architecture and a multi-stage training process for mixed speech-text modeling to balance textual and speech capabilities, preventing the model from compromising its intelligence while learning speech capabilities, and enabling it to seamlessly replace text models in various frameworks, thus supporting functions such as tool invocation, internet search, and external knowledge base integration. By modeling speech dialogue in an end-to-end manner, SpeechGPT 2.0-preview has achieved a latency of less than 200ms in practical tests, providing users with a smooth real-time interactive experience.

Through the experimental process, we also observed many interesting phenomena and conclusions. For example, through extensive pre-training on speech-text alignment, we found that the model could "emerge" with the ability to generalize speech styles. This includes controlling speech rate even without training on dialogue data with explicit speech rate adjustments, and adopting tones and styles of characters that the model had never seen before. Moreover, the quality of the speech data synthesis engine is key to enhancing the capabilities of the end-to-end speech model across various training stages.

Semantic-Acoustic Joint Modeling Ultra-Low Bitrate Streaming Speech Codec

24kHz speech input

Semantic-acoustic joint modeling

Ultra-low bitrate: 750bps (75 tokens per second)

Supports streaming input/output

Codec Patchify Based Speech-Text LLM Architecture

Codec patchify

We effectively reduce the modality gap between speech and text sequences through Codec Patchify, alleviating conflicts in cross-modal modeling. Specifically, this method aggregates adjacent T time-step RVQ codec tokens into a single patch, which is then mapped to a unified vector representation by the patch projector module. This representation is subsequently input into the speech-text LLM for joint modeling. In terms of model architecture design, the hidden states of the LLM are used for two decoding tasks simultaneously: one decodes to generate text output, while the other is input into the patch decoder for speech generation. Notably, we designed an autoregressive language model with multiple LM heads as the patch decoder. This module decodes in an autoregressive manner, generating multiple RVQ codec tokens per time step, thus producing speech output.

Speech-Text Joint Modeling, Flexible Cross-Combination

The speech-text LLM simultaneously takes in and outputs both speech and text representations.

Model Ability

Real-time Spoken Conversational Ability

Rich in emotion, diverse in style

Role-playing

Vocal Talent

Next Step

SpeechGPT 2.0-preview needs further enhancement in the stability of model and sound quality. We are training and building the system for duplex models, combining RLHF to enhance the expressiveness and stability of the model, further expand the volume of speech data and expand to more languages, please look forward to the next version update.

Contributors

Hanfu Chen, Ke Chen, Qinyuan Cheng, Mingshu Chen, Ruifan Deng, Liwei Fan, Zhaoye Fei, QingHui Gao, Yitian Gong, Ching Wing Kwok, Kexin Huang, Yaozhou Jiang, Xingyu Lu, Shimin Li, Zhengyuan Lin, Ruixiao Li, Qian Tu, Jin Wang, Yang Wang, Siyin Wang, Zhe Xu, Chenchen Yang, Donghua Yu, Yuqian Yao, Yucheng Yuan, Chufan Yu, Dong Zhang, YiWei Zhao, Yuqian Zhang, Jun Zhan, Xin Zhang, Xingjian Zhao, Chengyang Zhu