Request: Can we utilize the blank time during post_speech_silence_duration to optimize the time for transcribe #120

WindyYam · 2024-09-30T06:47:39Z

WindyYam
Sep 30, 2024

My laptop is not very good. Although fast whisper is very fast, still a transcribe of one sentence can take up to 0.6 - 1 sec. This adds up with post_speech_silence_duration which creates a noticeable delay. On the other hand, realtime transcribe seem to be designed only for some preview, which doesn't work well at the end of audio record.

Is it possible we add a new config flag to refactor the realtime worker so instead of transcribe continuously, it only start transcribe at the beginning of speech pause. So we squeeze the time in post_speech_silence_duration, once silence begin detected we give it a try of transcribe, and if the duration matches post_speech_silence_duration at last and recording finished, the following transcribe simply wait for previous transcribe finish and then return the previous transcribed text?

Or, maybe even better, make this a part of normal transcribe flow, not the realtime one

WindyYam · 2024-09-30T12:57:55Z

WindyYam
Sep 30, 2024
Author

FYI I found a workaround about this without enabling realtime transcribe to improve the speed, so that the during silence duration we already start transcribing.
Here is my customized AudioRecorder https://github.com/WindyYam/gemini_voice_companion/blob/main/scripts/faster_audio_recorder.py

0 replies

KoljaB · 2024-09-30T15:03:08Z

KoljaB
Sep 30, 2024
Maintainer

Wow, this is actually a very good idea.

Couldn't make the faster_audio_recorder.py code work, got an error (doesn't matter much, I get your point and you are absolutely right):

Say something...Traceback (most recent call last):
  File "C:\Dev\Audio\RealtimeSTT\RealtimeSTT\tests\realtimestt_faster_test.py", line 62, in <module>
    recorder.text(process_text)
  File "C:\Dev\Audio\RealtimeSTT\RealtimeSTT\RealtimeSTT\audio_recorder.py", line 1188, in text
    args=(self.transcribe(),)).start()
  File "C:\Dev\Audio\RealtimeSTT\RealtimeSTT\tests\faster_audio_recorder.py", line 227, in transcribe
    return self._preprocess_output(result)
  File "C:\Dev\Audio\RealtimeSTT\RealtimeSTT\RealtimeSTT\audio_recorder.py", line 1911, in _preprocess_output
    text = re.sub(r'\s+', ' ', text.strip())
AttributeError: 'tuple' object has no attribute 'strip'

The only downside I currently can see is some additional load on the GPU when the user continues talking while in the post_speech_silence_duration phase. But I think it's insanely unlikely that the final transcription gets blocked by this, since the user probably won't immediately stop talking again then, so it's not really a problem.

Thanks a again for the idea, this is really brilliant. I think I can implement this within next ~1-2 weeks.

0 replies

KoljaB · 2024-09-30T15:57:54Z

KoljaB
Sep 30, 2024
Maintainer

Got it to work. Amazing work, thank you again.

0 replies

WindyYam · 2024-09-30T23:21:33Z

WindyYam
Sep 30, 2024
Author

Thanks! Hopeful this can be a standard feature soon so I do not need to keep a local variant code

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Can we utilize the blank time during post_speech_silence_duration to optimize the time for transcribe #120

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Request: Can we utilize the blank time during post_speech_silence_duration to optimize the time for transcribe #120

WindyYam Sep 30, 2024

Replies: 4 comments

WindyYam Sep 30, 2024 Author

KoljaB Sep 30, 2024 Maintainer

KoljaB Sep 30, 2024 Maintainer

WindyYam Sep 30, 2024 Author

WindyYam
Sep 30, 2024

WindyYam
Sep 30, 2024
Author

KoljaB
Sep 30, 2024
Maintainer

KoljaB
Sep 30, 2024
Maintainer

WindyYam
Sep 30, 2024
Author