Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long form inference example seems to result in corrupted audio after a certain amount of time #154

Open
ajkessel opened this issue Sep 20, 2024 · 5 comments

Comments

@ajkessel
Copy link

ajkessel commented Sep 20, 2024

I'm using the long-form inference example @jpc provided to process longer text blocks. I tried to process Donald Knuth's short essay on the hyphen in "e-mail" with my own voice as a sample. At first, it sounds fine, but the voice gradually fades away and eventually is just noise. Am I doing something wrong? Here's the resulting audio and the code below, which is just slightly adapted from the example.

Any suggestions for how to improve the result? Am I doing something wrong?

Set up code:

self.tts = Pipeline(t2s_ref=self.config.whisper_t2s_model, s2a_ref=self.config.whisper_s2a_model, torch_compile = True, device=self.config.device, optimize = True)
chunks = self.split_and_prepare_text(text)
self.whisper_long(chunks=chunks,output=out_file,speaker=self.config.whisper_voice)

functions:

    def split_and_prepare_text(self, text, cps=14):
        chunks = []
        sentences = sent_tokenize(text)
        chunk = ""
        for sentence in sentences:
            # replace fancy punctuation that was unseen during training
            sentence = re.sub('[()]', ",", sentence).strip()
            sentence = re.sub(",+", ",", sentence)
            sentence = re.sub('"+', "", sentence)
            sentence = re.sub("/", "", sentence)
            # merge until the result is < 20s
            if len(chunk) + len(sentence) < 20*cps:
                chunk += " " + sentence
            else:
                chunks.append(chunk)
                chunk = sentence
        if chunk: chunks.append(chunk)
        return chunks
    def whisper_long(self,chunks=[], cps=14, overlap=100, output=None, speaker=None):
        global atoks, stoks
        if speaker is None:
            speaker = self.tts.default_speaker 
        elif isinstance(speaker, (str, Path)): 
            speaker = self.tts.extract_spk_emb(speaker)
        r = []
        old_stoks = None
        old_atoks = None
        for chunk in chunks:
            if self.config.debug: print(f"processing chunk {chunk}")
            stoks = self.tts.t2s.generate(chunk, cps=cps, show_progress_bar=False)[0]
            stoks = stoks[stoks != 512]
            if old_stoks is not None:
                assert(len(stoks) < 750-overlap)
                stoks = torch.cat([old_stoks[-overlap:], stoks])
                atoks_prompt = old_atoks[:,:,-overlap*3:]
            else:
                atoks_prompt = None
            atoks = self.tts.s2a.generate(stoks, atoks_prompt=atoks_prompt, speakers=speaker.unsqueeze(0), show_progress_bar=False)
            if atoks_prompt is not None: atoks = atoks[:,:,overlap*3+1:]
            r.append(atoks)
            old_stoks = stoks
            old_atoks = atoks
            self.tts.vocoder.decode_to_notebook(atoks)
        audios = []
        for i,atoks in enumerate(r):
            if i != 0: audios.append(torch.zeros((1, int(24000*0.5)), dtype=atoks.dtype, device=atoks.device))
            audios.append(self.tts.vocoder.decode(atoks))
        if output:
            torchaudio.save(output, torch.cat(audios, -1).cpu(), 24000)
@AldoKacorri
Copy link

Hi, did you find where the issue with long form inference is? I have tried several options without success.

@ajkessel
Copy link
Author

I'm not sure. I have code that works now but I don't remember now what the fix was. You're welcome to check it out and copy as you like.

@AldoKacorri
Copy link

Thank you, will check it out!

@ajkessel
Copy link
Author

I do get a linter error old_atoks is not subscriptable and I don't entirely understand the logic there since I copied it from the inference example. But it seems to work.

@ajkessel
Copy link
Author

Actually, I spoke too soon. It's still trailing off and dying when a longer text is provided.

I understand what the code is supposed to be doing: it takes the last bit of each generated tokens to use as the prompt for the next bit, to achieve better flow/continuity. But somehow it isn't actually working over several iterations.

I'm hoping @jpc can chime in and point us in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants