Managed to squeeze more efficiency out of this
#20
by Wladastic - opened
https://github.com/Wladastic/omnivoice-tts-nano-webui
Made LM Model load with bitsandBytest nf4, works just as well as fp32, as it seems it does not affect it much, saving a whole lot on vram.
for whatever reason bf16 as dtype allowed me to use less vram in inference or more "evenly" use it.
But only doable as long as the sample voice is less than 4s long, but still miles better than anything else in German.
I am just wondering why it sometimes spikes up to 8gb vram usage, causing it to oom because I set the max vram usage to 4GB and I clear cuda cache after every inference.
EDIT: just managed to offload parts of the model.
Using 1.3GB Vram during inference and 2.4GB on CPU Ram.