Update README.md
Browse files
README.md
CHANGED
@@ -4,9 +4,7 @@ license: agpl-3.0
|
|
4 |
|
5 |
This repo catalogs my weights for use with my [VALL-E](https://github.com/e-c-k-e-r/vall-e) implementation as I try and iron out the kinks.
|
6 |
|
7 |
-
The model currently is in a *
|
8 |
-
|
9 |
-
To reiterate, this is ***by no means*** complete. I am not passing this off as competitive.
|
10 |
|
11 |
## Models
|
12 |
|
@@ -23,12 +21,15 @@ This repo contains the following configurations under `./models/`:
|
|
23 |
+ Prior testing showed that longer prompt durations results in better utterances.
|
24 |
+ *Can* benefit from additional training, but I recall the average loss being around `1.9` to `2.1`.
|
25 |
+ However, due to regressions (or bias from working under `llama`), I don't think I can optimially train with a RetNet again (both in terms of VRAM consumption and throughput).
|
|
|
26 |
+ Currently does not seem to work anymore due to regressions in the code.
|
27 |
|
28 |
* `config.llama.yaml` / `ar+nar-llama-8`: The most recent-ishly trained weights after learning from my mistakes.
|
29 |
+ This configuration utilizes Llama's attention-based transformer as the underlying architecture, making use of creature comforts like RoPE, GQA, and memory-efficient attention (trained under `xformers`, shouldn't really affect things).
|
30 |
+ Prompt and response embeddings ARE summed (half the model was trained without summing, but enabling it seemed to make the most sense, and it didn't affect anything to do so).
|
|
|
31 |
+ Utilizes a HF tokenizer for "optimal" vocab.
|
|
|
32 |
+ The current RVQ level is included as a token as well to help guide NAR tasks better.
|
33 |
+ This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
|
34 |
+ Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
|
@@ -67,13 +68,19 @@ This repo contains the following configurations under `./models/`:
|
|
67 |
* `config.llama[layerskip].yaml` / `ar+nar-layerskip-llama-8`: The above, but with very brief training for LayerSkip:
|
68 |
+ Post-trained on a small English subset of Emilia and a small private corpus, and Japanese+French+German from Emilia.
|
69 |
+ Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
|
|
|
70 |
+ This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
|
71 |
-
+ I *need* to do heavy evaluation against the base model to ensure output quality does not drop before considering replacing the base model with this.
|
72 |
+ Goal is to utilize self-speculation sampling to enable speedups when possible.
|
73 |
-
+ Current implementation will early-exit if the entropy/varentropy of the logits are low enough
|
|
|
74 |
+ Training is a pain.
|
75 |
+ LayerSkip-aware training does *not* like to train under ROCm.
|
76 |
+ Training under float16+AMP with loss scaling will fry the model with a large enough de facto batch size (>512 samples/update step) and/or too low of a loss scale (<=8K).
|
|
|
|
|
|
|
|
|
|
|
77 |
|
78 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
79 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
|
|
4 |
|
5 |
This repo catalogs my weights for use with my [VALL-E](https://github.com/e-c-k-e-r/vall-e) implementation as I try and iron out the kinks.
|
6 |
|
7 |
+
The model currently is in a *usable* state under `ar+nar-llama-8` (the default model thats downloaded).
|
|
|
|
|
8 |
|
9 |
## Models
|
10 |
|
|
|
21 |
+ Prior testing showed that longer prompt durations results in better utterances.
|
22 |
+ *Can* benefit from additional training, but I recall the average loss being around `1.9` to `2.1`.
|
23 |
+ However, due to regressions (or bias from working under `llama`), I don't think I can optimially train with a RetNet again (both in terms of VRAM consumption and throughput).
|
24 |
+
+ I would love to revisit this with my more-better-er training paradigms.
|
25 |
+ Currently does not seem to work anymore due to regressions in the code.
|
26 |
|
27 |
* `config.llama.yaml` / `ar+nar-llama-8`: The most recent-ishly trained weights after learning from my mistakes.
|
28 |
+ This configuration utilizes Llama's attention-based transformer as the underlying architecture, making use of creature comforts like RoPE, GQA, and memory-efficient attention (trained under `xformers`, shouldn't really affect things).
|
29 |
+ Prompt and response embeddings ARE summed (half the model was trained without summing, but enabling it seemed to make the most sense, and it didn't affect anything to do so).
|
30 |
+
+ However, the opposite is not true: a model trained with summed embeddings does not function after disabling this.
|
31 |
+ Utilizes a HF tokenizer for "optimal" vocab.
|
32 |
+
+ Optimal in the sense it uses the remaining portion of the 256 indices for merged phonemes (although I imagine it would be better NOT to merge, as the model's focus isn't in phoneme output).
|
33 |
+ The current RVQ level is included as a token as well to help guide NAR tasks better.
|
34 |
+ This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
|
35 |
+ Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
|
|
|
68 |
* `config.llama[layerskip].yaml` / `ar+nar-layerskip-llama-8`: The above, but with very brief training for LayerSkip:
|
69 |
+ Post-trained on a small English subset of Emilia and a small private corpus, and Japanese+French+German from Emilia.
|
70 |
+ Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
|
71 |
+
+ Initially trained with LaterSkip hyperparamenters `R=4` and `e_scale=0.2`, but midway through swapped to `R=2` and `e_scale=0.1` to maintain stability.
|
72 |
+ This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
|
|
|
73 |
+ Goal is to utilize self-speculation sampling to enable speedups when possible.
|
74 |
+
+ Current implementation will early-exit if the entropy/varentropy of the logits are low enough (<0.1).
|
75 |
+
+ Speedups seem to shave off a second of inference time.
|
76 |
+ Training is a pain.
|
77 |
+ LayerSkip-aware training does *not* like to train under ROCm.
|
78 |
+ Training under float16+AMP with loss scaling will fry the model with a large enough de facto batch size (>512 samples/update step) and/or too low of a loss scale (<=8K).
|
79 |
+
+ LayerSkip-aware training seems to degrade the model enough to where it harms the models ability to sound similar to the reference prompt the more it trains.
|
80 |
+
+ I imagine this techique only really works for "large" enough models (be it wide and/or deep enough) that may cause it to second-guess in the later levels.
|
81 |
+
+ The current size of VALL-E doesn't seem to necessitate LayerSkip, as it seems to instead dumb the model down to ~9 layers instead of 12 (as it typically exits early at layer 9, and the remaining layers offer little additional benefits).
|
82 |
+
+ This *does* seem to prove a nice way to shrink models, and perhaps even grow them? I remember finding trying to grow a model causes the extra layers to be useless.
|
83 |
+
* Unless I get a revelation, this experiment is bunk unless it can magically live through a LoRA.
|
84 |
|
85 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
86 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|