ecker
/

vall-e

Model card Files Files and versions Community

ecker commited on Nov 5

Commit

796c86d

•

1 Parent(s): 35058f2

Update README.md

Browse files

Files changed (1) hide show

README.md +12 -5

README.md CHANGED Viewed

@@ -4,9 +4,7 @@ license: agpl-3.0
 This repo catalogs my weights for use with my [VALL-E](https://github.com/e-c-k-e-r/vall-e) implementation as I try and iron out the kinks.
-The model currently is in a *semi-usable* state, and I'm releasing them now in hopes that it also helps jumpstart anyone else that wants to use them.
-To reiterate, this is ***by no means*** complete. I am not passing this off as competitive.
 ## Models
@@ -23,12 +21,15 @@ This repo contains the following configurations under `./models/`:
 	+ Prior testing showed that longer prompt durations results in better utterances.
     + *Can* benefit from additional training, but I recall the average loss being around `1.9` to `2.1`.
         + However, due to regressions (or bias from working under `llama`), I don't think I can optimially train with a RetNet again (both in terms of VRAM consumption and throughput).
     + Currently does not seem to work anymore due to regressions in the code.
 * `config.llama.yaml` / `ar+nar-llama-8`: The most recent-ishly trained weights after learning from my mistakes.
 	+ This configuration utilizes Llama's attention-based transformer as the underlying architecture, making use of creature comforts like RoPE, GQA, and memory-efficient attention (trained under `xformers`, shouldn't really affect things).
 		+ Prompt and response embeddings ARE summed (half the model was trained without summing, but enabling it seemed to make the most sense, and it didn't affect anything to do so).
 		+ Utilizes a HF tokenizer for "optimal" vocab.
 		+ The current RVQ level is included as a token as well to help guide NAR tasks better.
 	+ This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
 		+ Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
@@ -67,13 +68,19 @@ This repo contains the following configurations under `./models/`:
 * `config.llama[layerskip].yaml` / `ar+nar-layerskip-llama-8`: The above, but with very brief training for LayerSkip:
     + Post-trained on a small English subset of Emilia and a small private corpus, and Japanese+French+German from Emilia.
     + Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
     + This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
-    + I *need* to do heavy evaluation against the base model to ensure output quality does not drop before considering replacing the base model with this.
     + Goal is to utilize self-speculation sampling to enable speedups when possible.
-      + Current implementation will early-exit if the entropy/varentropy of the logits are low enough
     + Training is a pain.
       + LayerSkip-aware training does *not* like to train under ROCm.
       + Training under float16+AMP with loss scaling will fry the model with a large enough de facto batch size (>512 samples/update step) and/or too low of a loss scale (<=8K).
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.

 This repo catalogs my weights for use with my [VALL-E](https://github.com/e-c-k-e-r/vall-e) implementation as I try and iron out the kinks.
+The model currently is in a *usable* state under `ar+nar-llama-8` (the default model thats downloaded).
 ## Models
 	+ Prior testing showed that longer prompt durations results in better utterances.
     + *Can* benefit from additional training, but I recall the average loss being around `1.9` to `2.1`.
         + However, due to regressions (or bias from working under `llama`), I don't think I can optimially train with a RetNet again (both in terms of VRAM consumption and throughput).
+        + I would love to revisit this with my more-better-er training paradigms.
     + Currently does not seem to work anymore due to regressions in the code.
 * `config.llama.yaml` / `ar+nar-llama-8`: The most recent-ishly trained weights after learning from my mistakes.
 	+ This configuration utilizes Llama's attention-based transformer as the underlying architecture, making use of creature comforts like RoPE, GQA, and memory-efficient attention (trained under `xformers`, shouldn't really affect things).
 		+ Prompt and response embeddings ARE summed (half the model was trained without summing, but enabling it seemed to make the most sense, and it didn't affect anything to do so).
+          + However, the opposite is not true: a model trained with summed embeddings does not function after disabling this.
 		+ Utilizes a HF tokenizer for "optimal" vocab.
+          + Optimal in the sense it uses the remaining portion of the 256 indices for merged phonemes (although I imagine it would be better NOT to merge, as the model's focus isn't in phoneme output).
 		+ The current RVQ level is included as a token as well to help guide NAR tasks better.
 	+ This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
 		+ Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
 * `config.llama[layerskip].yaml` / `ar+nar-layerskip-llama-8`: The above, but with very brief training for LayerSkip:
     + Post-trained on a small English subset of Emilia and a small private corpus, and Japanese+French+German from Emilia.
     + Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
+    + Initially trained with LaterSkip hyperparamenters `R=4` and `e_scale=0.2`, but midway through swapped to `R=2` and `e_scale=0.1` to maintain stability.
     + This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
     + Goal is to utilize self-speculation sampling to enable speedups when possible.
+      + Current implementation will early-exit if the entropy/varentropy of the logits are low enough (<0.1).
+      + Speedups seem to shave off a second of inference time.
     + Training is a pain.
       + LayerSkip-aware training does *not* like to train under ROCm.
       + Training under float16+AMP with loss scaling will fry the model with a large enough de facto batch size (>512 samples/update step) and/or too low of a loss scale (<=8K).
+    + LayerSkip-aware training seems to degrade the model enough to where it harms the models ability to sound similar to the reference prompt the more it trains.
+      + I imagine this techique only really works for "large" enough models (be it wide and/or deep enough) that may cause it to second-guess in the later levels.
+      + The current size of VALL-E doesn't seem to necessitate LayerSkip, as it seems to instead dumb the model down to ~9 layers instead of 12 (as it typically exits early at layer 9, and the remaining layers offer little additional benefits).
+        + This *does* seem to prove a nice way to shrink models, and perhaps even grow them? I remember finding trying to grow a model causes the extra layers to be useless.
+    * Unless I get a revelation, this experiment is bunk unless it can magically live through a LoRA.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.