long form inference example seems to result in corrupted audio after a certain amount of time #154

ajkessel · 2024-09-20T03:22:42Z

I'm using the long-form inference example @jpc provided to process longer text blocks. I tried to process Donald Knuth's short essay on the hyphen in "e-mail" with my own voice as a sample. At first, it sounds fine, but the voice gradually fades away and eventually is just noise. Am I doing something wrong? Here's the resulting audio and the code below, which is just slightly adapted from the example.

Any suggestions for how to improve the result? Am I doing something wrong?

Set up code:

self.tts = Pipeline(t2s_ref=self.config.whisper_t2s_model, s2a_ref=self.config.whisper_s2a_model, torch_compile = True, device=self.config.device, optimize = True)
chunks = self.split_and_prepare_text(text)
self.whisper_long(chunks=chunks,output=out_file,speaker=self.config.whisper_voice)

functions:

    def split_and_prepare_text(self, text, cps=14):
        chunks = []
        sentences = sent_tokenize(text)
        chunk = ""
        for sentence in sentences:
            # replace fancy punctuation that was unseen during training
            sentence = re.sub('[()]', ",", sentence).strip()
            sentence = re.sub(",+", ",", sentence)
            sentence = re.sub('"+', "", sentence)
            sentence = re.sub("/", "", sentence)
            # merge until the result is < 20s
            if len(chunk) + len(sentence) < 20*cps:
                chunk += " " + sentence
            else:
                chunks.append(chunk)
                chunk = sentence
        if chunk: chunks.append(chunk)
        return chunks
    def whisper_long(self,chunks=[], cps=14, overlap=100, output=None, speaker=None):
        global atoks, stoks
        if speaker is None:
            speaker = self.tts.default_speaker 
        elif isinstance(speaker, (str, Path)): 
            speaker = self.tts.extract_spk_emb(speaker)
        r = []
        old_stoks = None
        old_atoks = None
        for chunk in chunks:
            if self.config.debug: print(f"processing chunk {chunk}")
            stoks = self.tts.t2s.generate(chunk, cps=cps, show_progress_bar=False)[0]
            stoks = stoks[stoks != 512]
            if old_stoks is not None:
                assert(len(stoks) < 750-overlap)
                stoks = torch.cat([old_stoks[-overlap:], stoks])
                atoks_prompt = old_atoks[:,:,-overlap*3:]
            else:
                atoks_prompt = None
            atoks = self.tts.s2a.generate(stoks, atoks_prompt=atoks_prompt, speakers=speaker.unsqueeze(0), show_progress_bar=False)
            if atoks_prompt is not None: atoks = atoks[:,:,overlap*3+1:]
            r.append(atoks)
            old_stoks = stoks
            old_atoks = atoks
            self.tts.vocoder.decode_to_notebook(atoks)
        audios = []
        for i,atoks in enumerate(r):
            if i != 0: audios.append(torch.zeros((1, int(24000*0.5)), dtype=atoks.dtype, device=atoks.device))
            audios.append(self.tts.vocoder.decode(atoks))
        if output:
            torchaudio.save(output, torch.cat(audios, -1).cpu(), 24000)

The text was updated successfully, but these errors were encountered:

AldoKacorri · 2024-10-14T17:09:34Z

Hi, did you find where the issue with long form inference is? I have tried several options without success.

ajkessel · 2024-10-15T02:44:43Z

I'm not sure. I have code that works now but I don't remember now what the fix was. You're welcome to check it out and copy as you like.

AldoKacorri · 2024-10-15T16:37:39Z

Thank you, will check it out!

ajkessel · 2024-10-15T16:48:26Z

I do get a linter error old_atoks is not subscriptable and I don't entirely understand the logic there since I copied it from the inference example. But it seems to work.

ajkessel · 2024-10-17T22:09:35Z

Actually, I spoke too soon. It's still trailing off and dying when a longer text is provided.

I understand what the code is supposed to be doing: it takes the last bit of each generated tokens to use as the prompt for the next bit, to achieve better flow/continuity. But somehow it isn't actually working over several iterations.

I'm hoping @jpc can chime in and point us in the right direction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

long form inference example seems to result in corrupted audio after a certain amount of time #154

long form inference example seems to result in corrupted audio after a certain amount of time #154

ajkessel commented Sep 20, 2024 •

edited

Loading

AldoKacorri commented Oct 14, 2024

ajkessel commented Oct 15, 2024

AldoKacorri commented Oct 15, 2024

ajkessel commented Oct 15, 2024

ajkessel commented Oct 17, 2024

long form inference example seems to result in corrupted audio after a certain amount of time #154

long form inference example seems to result in corrupted audio after a certain amount of time #154

Comments

ajkessel commented Sep 20, 2024 • edited Loading

AldoKacorri commented Oct 14, 2024

ajkessel commented Oct 15, 2024

AldoKacorri commented Oct 15, 2024

ajkessel commented Oct 15, 2024

ajkessel commented Oct 17, 2024

ajkessel commented Sep 20, 2024 •

edited

Loading