Is there a way to post an additional output array of floats for Phoneme Timings?

#9
by STFUnity - opened

I looked inside the models for Lessac and Ryan after doing some experiments in Unity, runs amazing, but we want to actually lock in a skinned mesh renderer to speak along with the phonemes' exact timing. We'd have to get an additional output from the ONNX models themselves - it doesn't need to change the main output tensor shape.

Found it hard to read the model after viewing in onnx-modifier, not too familiar with the conventions inside there but it looked like there were a couple of slice and concatenate spots that took multiple lines of throughput and combined or split them. I'm guessing one of those areas could be a buildup point for the array of ints describing the sample length of each synthesized phoneme's audio. If a second output tensor could be added that has a supplementary array of the phoneme timings and their IDs from the phonemizer output were already known, then you could make a model speak live off of it accurately, no matter how eccentric the training audio is or how blown out the parameters are on the model input.

Anyway just a thought, it doesn't seem too hard but I'm not an expert.

Rhasspy org

Link to discussion on Github: https://github.com/rhasspy/piper/discussions/425

Sign up or log in to comment