run training with easy setup

Files changed (5) hide show

README.md CHANGED Viewed

@@ -1,36 +1,16 @@
-# Transformer-VAE (flax) (WIP)
 A Transformer-VAE made using flax.
-Done as part of Huggingface community training ([see forum post](https://discuss.huggingface.co/t/train-a-vae-to-interpolate-on-english-sentences/7548)).
-Builds on T5, using an autoencoder to convert it into a VAE.
-[See training logs.](https://wandb.ai/fraser/flax-vae)
-## ToDo
-- [ ] Basic training script working. (Fraser + Theo)
-- [ ] Add MMD loss (Theo)
-- [ ] Save a wikipedia sentences dataset to Huggingface (see original https://github.com/ChunyuanLI/Optimus/blob/master/data/download_datasets.md) (Mina)
-- [ ] Make a tokenizer using the OPTIMUS tokenized dataset.
-- [ ] Train on the OPTIMUS wikipedia sentences dataset.
-- [ ] Make Huggingface widget interpolating sentences! (???) https://github.com/huggingface/transformers/tree/master/examples/research_projects/jax-projects#how-to-build-a-demo
-Optional ToDos:
-- [ ] Add Funnel transformer encoder to FLAX (don't need weights).
-- [ ] Train a Funnel-encoder + T5-decoder transformer VAE.
-- [ ] Additional datasets:
-- [ ] Poetry (https://www.gwern.net/GPT-2#data-the-project-gutenberg-poetry-corpus)
-- [ ] 8-bit music (https://github.com/chrisdonahue/LakhNES)
 ## Setup
-Follow all steps to install dependencies from https://cloud.google.com/tpu/docs/jax-quickstart-tpu-vm
 - [ ] Find dataset storage site.
 - [ ] Ask JAX team for dataset storage.

+# T5-VAE-Python (flax) (WIP)
 A Transformer-VAE made using flax.
+It has been trained to interpolate on lines of Python code form the [python-lines dataset](https://huggingface.co/datasets/Fraser/python-lines).
+Done as part of Huggingface community training ([see forum post](https://discuss.huggingface.co/t/train-a-vae-to-interpolate-on-english-sentences/7548)).
+Builds on T5, using an autoencoder to convert it into an MMD-VAE.
 ## Setup
+Follow all steps to install dependencies from https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/README.md#tpu-vm
 - [ ] Find dataset storage site.
 - [ ] Ask JAX team for dataset storage.

requirements.txt DELETED Viewed

@@ -1,3 +0,0 @@
-jax
-jaxlib
--r requirements-tpu.txt

setup_venv.sh ADDED Viewed

+# setup training on a TPU VM
+rm -fr venv
+python3 -m venv venv
+source venv/bin/activate
+pip install -U pip
+pip install -U wheel
+pip install requests
+pip install "jax[tpu]>=0.2.16" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
+cd ..
+git clone https://github.com/huggingface/transformers.git
+cd transformers
+pip install -e ".[flax]"
+cd ..
+git clone https://github.com/huggingface/datasets.git
+cd datasets
+pip install -e ".[streaming]"
+cd ..

train.py CHANGED Viewed

@@ -2,8 +2,6 @@
     Pre-training/Fine-tuning seq2seq models on autoencoding a dataset.
     TODO:
-    - [x] Get this running.
-    - [x] Don't make decoder input ids.
     - [ ] Add reg loss
         - [x] calculate MMD loss
         - [ ] schedule MMD loss weight
@@ -87,6 +85,10 @@ class ModelArguments:
             "help": "Number of dimensions to use for each latent token."
         },
     )
     config_path: Optional[str] = field(
         default=None, metadata={"help": "Pretrained config path"}
     )
@@ -361,8 +363,7 @@ def main():
         model = FlaxT5VaeForAutoencoding.from_pretrained(
             model_args.model_name_or_path, config=config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype)
         )
-        # TODO assert token embedding size == len(tokenizer)
-        assert(model.params['t5']['shared'].shape[0] == len(tokenizer), "T5 Tokenizer doesn't match T5Vae embedding size.")
     else:
         vocab_size = len(tokenizer)
         config.t5.vocab_size = vocab_size

     Pre-training/Fine-tuning seq2seq models on autoencoding a dataset.
     TODO:
     - [ ] Add reg loss
         - [x] calculate MMD loss
         - [ ] schedule MMD loss weight
             "help": "Number of dimensions to use for each latent token."
         },
     )
+    add_special_tokens: bool = field(
+        default=False,
+        metadata={"help": "Add these special tokens to the tokenizer: {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}"},
+    )
     config_path: Optional[str] = field(
         default=None, metadata={"help": "Pretrained config path"}
     )
         model = FlaxT5VaeForAutoencoding.from_pretrained(
             model_args.model_name_or_path, config=config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype)
         )
+        assert model.params['t5']['shared'].shape[0] == len(tokenizer), "T5 Tokenizer doesn't match T5Vae embedding size."
     else:
         vocab_size = len(tokenizer)
         config.t5.vocab_size = vocab_size

train.sh CHANGED Viewed

+export RUN_NAME=single_latent
+./venv/bin/python train.py \
+--t5_model_name_or_path="t5-base" \
+--output_dir="output/${RUN_NAME}" \
+--overwrite_output_dir \
+--dataset_name="Fraser/python-lines" \
+--do_train --do_eval \
+--n_latent_tokens 1 \
+--latent_token_size 32 \
+--save_steps="2500" \
+--eval_steps="2500" \
+--block_size="32" \
+--per_device_train_batch_size="10" \
+--per_device_eval_batch_size="10" \
+--overwrite_output_dir \
+--num_train_epochs="1" \
+--push_to_hub \