ecker
/

vall-e

ecker commited on Nov 3, 2024

Commit

6a78baa

verified ·

1 Parent(s): f26d377

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -69,7 +69,12 @@ This repo contains the following configurations under `./models/`:
     + Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
     + This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
     + I *need* to do heavy evaluation against the base model to ensure output quality does not drop before considering replacing the base model with this.
     + Goal is to utilize self-speculation sampling to enable speedups when possible.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.

     + Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
     + This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
     + I *need* to do heavy evaluation against the base model to ensure output quality does not drop before considering replacing the base model with this.
+      + It currently does not seem to perform better even without early-exit...
     + Goal is to utilize self-speculation sampling to enable speedups when possible.
+      + Current implementation will early-exit if the entropy/varentropy of the logits are low enough.
+      + There doesn't seem to be any significant speedups...
+    + Training is a pain, as float16 + AMP will fry the model fast, and training bfloat16 (with/without AMP) seems to harm the model overall.
+      + I'd like to think more time training will help, but it doesn't seem to be worth it for a marginal speedup.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.