ecker
/

vall-e

Model card Files Files and versions Community

ecker commited on Nov 16

Commit

c230973

•

1 Parent(s): e5e7575

Update README.md

Browse files

Files changed (1) hide show

README.md +19 -4

README.md CHANGED Viewed

@@ -81,14 +81,29 @@ This repo contains the following configurations under `./models/`:
       + The current size of VALL-E doesn't seem to necessitate LayerSkip, as it seems to instead dumb the model down to ~9 layers instead of 12 (as it typically exits early at layer 9, and the remaining layers offer little additional benefits).
         + This *does* seem to prove a nice way to shrink models, and perhaps even grow them? I remember finding trying to grow a model causes the extra layers to be useless.
     * Unless I get a revelation, this experiment is bunk unless it can magically live through a LoRA.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
-* a pure NAR (plus length predictor) cannot be realized with the current architecture.
-  + Transformer-based (or at least attention based) models can't seem to handle generating the initial (RVQ level 0) tokens from "thin air" (be it special tokens to repeating the input prompt).
-  + A diffusion-based model will definitely work, as those are good at generating from noise.
-  + The performance gains seem nice as the biggest "bottleneck" is the initial (RVQ level 0) AR pass, but it seems to require a lot of effort.
 * a model using [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/):
   + the 24KHz model will *not* converge no matter what. However, naively using just the first 8 RVQ levels might not be good enough, as there's too many codebooks for viable use.
   + the 44KHz model was erroneously assumed to be an even 44KHz, when in reality it's 44.1KHz. *All* of my audio has to be requantized, as there's some stuttering in it.

       + The current size of VALL-E doesn't seem to necessitate LayerSkip, as it seems to instead dumb the model down to ~9 layers instead of 12 (as it typically exits early at layer 9, and the remaining layers offer little additional benefits).
         + This *does* seem to prove a nice way to shrink models, and perhaps even grow them? I remember finding trying to grow a model causes the extra layers to be useless.
     * Unless I get a revelation, this experiment is bunk unless it can magically live through a LoRA.
+    * Experiments shown that this actively harms the model for a very negligible speed gain, as LayerSkip-aware training shifts most of the intelligence down a few layers, and keeps the last couple of layers to further-upon the confidence of the outputs, or something.
+      * Despite being a failure, this does pave a nice way to shrink models from an existing model. However, this does not seem to be useful as even dropping two/three layers really does harm how well the prompt is followed.
+* `config.llama[nar-len].yaml` / `nar-len-llama-8`: A fully non-autoregressive model.
+  * These weights are a work in progress, but currently are a good proof-of-concept so far until training is on-par with the base `ar+nar-llama-8` model.
+  * A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes, until the best option of just training from scratch was picked.
+    * Technically, the `ar+nar-llama-8` can be modified to be a pure non-autoregressive model, but I needed to start from scratch before dumping more time again trying to adapt it.
+  * Speedups are immense compared to the `ar+nar-llama-8`, as the entire audio output is decoded in parallel rather than causally.
+    * Throughput and memory usage should be constant between inferencing steps.
+    * The model only needs to be invoked about 5+25+7 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels) instead.
+  * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
+  * Weights will be added as the model is trained.
+* `config.llama[experimental].yaml` / `ar+nar-experimental-llama-8`: A salvaged-experiment of `ar+nar-llama-8`.
+  * These weights were from an oversight in trying to train a fully non-autoregressive model.
+    * Demasking was trained autoregressively instead of autoregressively, making this error possibly salvageable for the base model.
+  * This *might* have better output by accounting for possible errors from prior tokens, making it more robust, in theory.
+    * The theory is that training was on tokens being randomly masked off.
+  * These weights right now need to be "fixed" with proper, normal training, before replacing the original reference model.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
 * a model using [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/):
   + the 24KHz model will *not* converge no matter what. However, naively using just the first 8 RVQ levels might not be good enough, as there's too many codebooks for viable use.
   + the 44KHz model was erroneously assumed to be an even 44KHz, when in reality it's 44.1KHz. *All* of my audio has to be requantized, as there's some stuttering in it.