ecker
/

vall-e

ecker commited on Nov 19, 2024

Commit

374fde8

verified ·

1 Parent(s): e302cd8

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -100,7 +100,12 @@ This repo contains the following configurations under `./models/`:
   * The "confidence" issue on voices it hasn't seen / hasn't seen much of is much more noticeable as RVQ level 0 is much more susceptable to it.
   * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
     * ...except STT, this received no STT training out of fear of botching the model.
-  * Weights will be added as the model is trained.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.

   * The "confidence" issue on voices it hasn't seen / hasn't seen much of is much more noticeable as RVQ level 0 is much more susceptable to it.
   * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
     * ...except STT, this received no STT training out of fear of botching the model.
+  * ~~Weights will be added as the model is trained.~~
+  * I don't think the model can perform well at the current size.
+    * Longer utterances degrade and stutter.
+    * While more training seems to make it adhere to the prompt better, more training does not make the output more stable.
+      * It seems the exact same as the previous-erroneously-trained model (where it was actually trained to predict the next token, rather than the token in place).
+    * I would say that a bigger model might help, ignoring RVQ levels 1+ and solely focusing on NAR RVQ level 0 does not seem to matter.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.