You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A few weeks ago Kyutai-labs finally released Moshi, a LLM that also supports STT and TTS in real time. Alongside, they also released Mimi, their speech codec designed for this. Here's the hf link to mimi
I was wondering if this would be relevant to whisperspeech for the future roadmap. Quoting their README:
Mimi builds on previous neural audio codecs such as SoundStream
and EnCodec, adding a Transformer both in the encoder and decoder,
and adapting the strides to match an overall frame rate of 12.5 Hz. This allows Mimi to get closer to the
average frame rate of text tokens (~3-4 Hz), and limit the number of autoregressive steps in Moshi.
Similarly to SpeechTokenizer, Mimi uses a distillation loss so that the first codebook tokens match
a self-supervised representation from WavLM, which allows modeling semantic and acoustic information with a single model. Interestingly, while Mimi is fully causal and streaming, it learns to match sufficiently well the non-causal
representation from WavLM, without introducing any delays. Finally, and similarly to EBEN,
Mimi uses only an adversarial training loss, along with feature matching, showing strong improvements in terms of
subjective quality despite its low bitrate.
FasterWhisper
Additionaly, nobody seems to be using whisper anymore and instead use FasterWhisper which re implements parts of it and make it both faster and more memory efficient. Is this relevant to whisperspeech? Maybe not at all but I prefered to ask.
Hi,
Mimi
A few weeks ago Kyutai-labs finally released Moshi, a LLM that also supports STT and TTS in real time. Alongside, they also released Mimi, their speech codec designed for this. Here's the hf link to mimi
I was wondering if this would be relevant to whisperspeech for the future roadmap. Quoting their README:
FasterWhisper
Additionaly, nobody seems to be using whisper anymore and instead use FasterWhisper which re implements parts of it and make it both faster and more memory efficient. Is this relevant to whisperspeech? Maybe not at all but I prefered to ask.
whisper V3 turbo
Thirdly, openai released this week their v3 turbo model. It seems to be straightforward to implement as I saw it usable on project in days so I was wondering if you were considering using the v3 version in the future.
Thanks!
The text was updated successfully, but these errors were encountered: