Moshi:2024实时对话的语音-文本基础模型技术(英文版)
Moshi:2024实时对话的语音-文本基础模型技术(英文版).pdf |
下载文档 |
资源简介
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic inform
本文档仅能预览20页