We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components,
namely voice activity detection, speech recognition, textual dialogue an...