Replies: 4 comments
-
FYI I found a workaround about this without enabling realtime transcribe to improve the speed, so that the during silence duration we already start transcribing. |
Beta Was this translation helpful? Give feedback.
-
Wow, this is actually a very good idea. Couldn't make the faster_audio_recorder.py code work, got an error (doesn't matter much, I get your point and you are absolutely right): Say something...Traceback (most recent call last):
File "C:\Dev\Audio\RealtimeSTT\RealtimeSTT\tests\realtimestt_faster_test.py", line 62, in <module>
recorder.text(process_text)
File "C:\Dev\Audio\RealtimeSTT\RealtimeSTT\RealtimeSTT\audio_recorder.py", line 1188, in text
args=(self.transcribe(),)).start()
File "C:\Dev\Audio\RealtimeSTT\RealtimeSTT\tests\faster_audio_recorder.py", line 227, in transcribe
return self._preprocess_output(result)
File "C:\Dev\Audio\RealtimeSTT\RealtimeSTT\RealtimeSTT\audio_recorder.py", line 1911, in _preprocess_output
text = re.sub(r'\s+', ' ', text.strip())
AttributeError: 'tuple' object has no attribute 'strip' The only downside I currently can see is some additional load on the GPU when the user continues talking while in the post_speech_silence_duration phase. But I think it's insanely unlikely that the final transcription gets blocked by this, since the user probably won't immediately stop talking again then, so it's not really a problem. Thanks a again for the idea, this is really brilliant. I think I can implement this within next ~1-2 weeks. |
Beta Was this translation helpful? Give feedback.
-
Got it to work. Amazing work, thank you again. |
Beta Was this translation helpful? Give feedback.
-
Thanks! Hopeful this can be a standard feature soon so I do not need to keep a local variant code |
Beta Was this translation helpful? Give feedback.
-
My laptop is not very good. Although fast whisper is very fast, still a transcribe of one sentence can take up to 0.6 - 1 sec. This adds up with post_speech_silence_duration which creates a noticeable delay. On the other hand, realtime transcribe seem to be designed only for some preview, which doesn't work well at the end of audio record.
Is it possible we add a new config flag to refactor the realtime worker so instead of transcribe continuously, it only start transcribe at the beginning of speech pause. So we squeeze the time in post_speech_silence_duration, once silence begin detected we give it a try of transcribe, and if the duration matches post_speech_silence_duration at last and recording finished, the following transcribe simply wait for previous transcribe finish and then return the previous transcribed text?
Or, maybe even better, make this a part of normal transcribe flow, not the realtime one
Beta Was this translation helpful? Give feedback.
All reactions