ecker
/

vall-e

Model card Files Files and versions Community

ecker commited on Nov 21

Commit

cbdbdab

•

1 Parent(s): 4554bf4

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -1

README.md CHANGED Viewed

@@ -61,6 +61,8 @@ This repo contains the following configurations under `./models/`:
       * Despite the model *technically* receiving some (wrong) training for this modality, it does work enough from an existing model, albeit not with quality on par with the base AR+NAR modality.
       * Weights will update as training progresses for NAR-len, and may pivot to be the default modality.
         * If all goes well, these weights will revert back to the original snapshot, while the reference model will be renamed to `ar+nar-len-llama-8` instead.
 * ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
     + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
@@ -134,4 +136,6 @@ This repo also contains some LoRAs to serve as a reference under `./loras/`.
 Using a LoRA is the same as a base model, except you're required to have the base model already (obviously). Just use the LoRA's config YAML to load from instead to use it.
 The only caveat is that my original dataset *does* contain (most of) these samples already, but given the sheer size of it, they're probably underutilized.
-* However, the base model already has *almost adequate* output from these speakers, but not enough to be satisfactory.

       * Despite the model *technically* receiving some (wrong) training for this modality, it does work enough from an existing model, albeit not with quality on par with the base AR+NAR modality.
       * Weights will update as training progresses for NAR-len, and may pivot to be the default modality.
         * If all goes well, these weights will revert back to the original snapshot, while the reference model will be renamed to `ar+nar-len-llama-8` instead.
+      * Training a LoRA under the `NAR-len` modality does work, but is still kind-of susceptible to the lesser quality the base `NAR-len` outputs.
+        * In other words, finetuning for a specific speaker doesn't fully fix the quality issue.
 * ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
     + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
 Using a LoRA is the same as a base model, except you're required to have the base model already (obviously). Just use the LoRA's config YAML to load from instead to use it.
 The only caveat is that my original dataset *does* contain (most of) these samples already, but given the sheer size of it, they're probably underutilized.
+* However, the base model already has *almost adequate* output from these speakers, but not enough to be satisfactory.
+LoRAs under `ckpt[ar+nar-old-llama-8]` are LoRAs married to an older checkpoint, while `ckpt` *should* work under the reference model.