Update README.md
Browse files
README.md
CHANGED
@@ -81,14 +81,29 @@ This repo contains the following configurations under `./models/`:
|
|
81 |
+ The current size of VALL-E doesn't seem to necessitate LayerSkip, as it seems to instead dumb the model down to ~9 layers instead of 12 (as it typically exits early at layer 9, and the remaining layers offer little additional benefits).
|
82 |
+ This *does* seem to prove a nice way to shrink models, and perhaps even grow them? I remember finding trying to grow a model causes the extra layers to be useless.
|
83 |
* Unless I get a revelation, this experiment is bunk unless it can magically live through a LoRA.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
84 |
|
85 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
86 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
87 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|
88 |
-
* a pure NAR (plus length predictor) cannot be realized with the current architecture.
|
89 |
-
+ Transformer-based (or at least attention based) models can't seem to handle generating the initial (RVQ level 0) tokens from "thin air" (be it special tokens to repeating the input prompt).
|
90 |
-
+ A diffusion-based model will definitely work, as those are good at generating from noise.
|
91 |
-
+ The performance gains seem nice as the biggest "bottleneck" is the initial (RVQ level 0) AR pass, but it seems to require a lot of effort.
|
92 |
* a model using [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/):
|
93 |
+ the 24KHz model will *not* converge no matter what. However, naively using just the first 8 RVQ levels might not be good enough, as there's too many codebooks for viable use.
|
94 |
+ the 44KHz model was erroneously assumed to be an even 44KHz, when in reality it's 44.1KHz. *All* of my audio has to be requantized, as there's some stuttering in it.
|
|
|
81 |
+ The current size of VALL-E doesn't seem to necessitate LayerSkip, as it seems to instead dumb the model down to ~9 layers instead of 12 (as it typically exits early at layer 9, and the remaining layers offer little additional benefits).
|
82 |
+ This *does* seem to prove a nice way to shrink models, and perhaps even grow them? I remember finding trying to grow a model causes the extra layers to be useless.
|
83 |
* Unless I get a revelation, this experiment is bunk unless it can magically live through a LoRA.
|
84 |
+
* Experiments shown that this actively harms the model for a very negligible speed gain, as LayerSkip-aware training shifts most of the intelligence down a few layers, and keeps the last couple of layers to further-upon the confidence of the outputs, or something.
|
85 |
+
* Despite being a failure, this does pave a nice way to shrink models from an existing model. However, this does not seem to be useful as even dropping two/three layers really does harm how well the prompt is followed.
|
86 |
+
|
87 |
+
* `config.llama[nar-len].yaml` / `nar-len-llama-8`: A fully non-autoregressive model.
|
88 |
+
* These weights are a work in progress, but currently are a good proof-of-concept so far until training is on-par with the base `ar+nar-llama-8` model.
|
89 |
+
* A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes, until the best option of just training from scratch was picked.
|
90 |
+
* Technically, the `ar+nar-llama-8` can be modified to be a pure non-autoregressive model, but I needed to start from scratch before dumping more time again trying to adapt it.
|
91 |
+
* Speedups are immense compared to the `ar+nar-llama-8`, as the entire audio output is decoded in parallel rather than causally.
|
92 |
+
* Throughput and memory usage should be constant between inferencing steps.
|
93 |
+
* The model only needs to be invoked about 5+25+7 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels) instead.
|
94 |
+
* Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
|
95 |
+
* Weights will be added as the model is trained.
|
96 |
+
|
97 |
+
* `config.llama[experimental].yaml` / `ar+nar-experimental-llama-8`: A salvaged-experiment of `ar+nar-llama-8`.
|
98 |
+
* These weights were from an oversight in trying to train a fully non-autoregressive model.
|
99 |
+
* Demasking was trained autoregressively instead of autoregressively, making this error possibly salvageable for the base model.
|
100 |
+
* This *might* have better output by accounting for possible errors from prior tokens, making it more robust, in theory.
|
101 |
+
* The theory is that training was on tokens being randomly masked off.
|
102 |
+
* These weights right now need to be "fixed" with proper, normal training, before replacing the original reference model.
|
103 |
|
104 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
105 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
106 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|
|
|
|
|
|
|
|
|
107 |
* a model using [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/):
|
108 |
+ the 24KHz model will *not* converge no matter what. However, naively using just the first 8 RVQ levels might not be good enough, as there's too many codebooks for viable use.
|
109 |
+ the 44KHz model was erroneously assumed to be an even 44KHz, when in reality it's 44.1KHz. *All* of my audio has to be requantized, as there's some stuttering in it.
|