Spaces:

rynmurdock
/

Babel

Runtime error

App Files Files Community

rynmurdock commited on May 6, 2024

Commit

c5ca37a

•

1 Parent(s): 5b8f2e0

init

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

Optimus/.gitignore +8 -0
Optimus/README.md +121 -0
Optimus/code/README.md +41 -0
Optimus/code/app.py +0 -0
Optimus/code/examples/README.md +392 -0
Optimus/code/examples/__pycache__/utils_glue.cpython-37.pyc +0 -0
Optimus/code/examples/big_ae/__pycache__/grad_app.cpython-310.pyc +0 -0
Optimus/code/examples/big_ae/__pycache__/utils.cpython-37.pyc +0 -0
Optimus/code/examples/big_ae/debug_data.py +6 -0
Optimus/code/examples/big_ae/eval_dialog_multi_response.py +378 -0
Optimus/code/examples/big_ae/eval_dialog_response.py +295 -0
Optimus/code/examples/big_ae/grad_app.py +486 -0
Optimus/code/examples/big_ae/metrics.py +196 -0
Optimus/code/examples/big_ae/modules/__init__.py +7 -0
Optimus/code/examples/big_ae/modules/__pycache__/__init__.cpython-310.pyc +0 -0
Optimus/code/examples/big_ae/modules/__pycache__/__init__.cpython-37.pyc +0 -0
Optimus/code/examples/big_ae/modules/__pycache__/arae.cpython-310.pyc +0 -0
Optimus/code/examples/big_ae/modules/__pycache__/arae.cpython-37.pyc +0 -0
Optimus/code/examples/big_ae/modules/__pycache__/cara.cpython-310.pyc +0 -0
Optimus/code/examples/big_ae/modules/__pycache__/cara.cpython-37.pyc +0 -0
Optimus/code/examples/big_ae/modules/__pycache__/spacefusion.cpython-310.pyc +0 -0
Optimus/code/examples/big_ae/modules/__pycache__/spacefusion.cpython-37.pyc +0 -0
Optimus/code/examples/big_ae/modules/__pycache__/utils.cpython-310.pyc +0 -0
Optimus/code/examples/big_ae/modules/__pycache__/utils.cpython-37.pyc +0 -0
Optimus/code/examples/big_ae/modules/__pycache__/vae.cpython-310.pyc +0 -0
Optimus/code/examples/big_ae/modules/__pycache__/vae.cpython-37.pyc +0 -0
Optimus/code/examples/big_ae/modules/arae.py +274 -0
Optimus/code/examples/big_ae/modules/cara.py +374 -0
Optimus/code/examples/big_ae/modules/ctrl_gen.py +371 -0
Optimus/code/examples/big_ae/modules/decoders/dec_gpt2.py +358 -0
Optimus/code/examples/big_ae/modules/decoders/decoder.py +79 -0
Optimus/code/examples/big_ae/modules/encoders/__init__.py +1 -0
Optimus/code/examples/big_ae/modules/encoders/enc_lstm.py +126 -0
Optimus/code/examples/big_ae/modules/encoders/encoder.py +58 -0
Optimus/code/examples/big_ae/modules/encoders/gaussian_encoder.py +147 -0
Optimus/code/examples/big_ae/modules/spacefusion.py +143 -0
Optimus/code/examples/big_ae/modules/utils.py +40 -0
Optimus/code/examples/big_ae/modules/vae.py +638 -0
Optimus/code/examples/big_ae/run_data_filtering.py +507 -0
Optimus/code/examples/big_ae/run_dialog_dataloader.py +483 -0
Optimus/code/examples/big_ae/run_encoding_generation.py +487 -0
Optimus/code/examples/big_ae/run_generation_from_prior.py +414 -0
Optimus/code/examples/big_ae/run_gpt2_generation.py +390 -0
Optimus/code/examples/big_ae/run_latent_generation.py +577 -0
Optimus/code/examples/big_ae/run_lm_ae_pretraining.py +692 -0
Optimus/code/examples/big_ae/run_lm_causal_pretraining.py +692 -0
Optimus/code/examples/big_ae/run_lm_finetuning_baseline.py +573 -0
Optimus/code/examples/big_ae/run_lm_gpt2_training.py +658 -0
Optimus/code/examples/big_ae/run_lm_vae_label_ctrl_gen.py +875 -0
Optimus/code/examples/big_ae/run_lm_vae_pretraining.py +669 -0

Optimus/.gitignore ADDED Viewed

	@@ -0,0 +1,8 @@

+data/datasets/glue_data/glue_data
+data/datasets/glue_data/train.tx
+data/datasets/glue_data/cached_lm_gpt_bert_256_train.jsont
+code/runs
+output/*
+code/pytorch_transformers/__pycache__/*
+code/examples/big_ae/modules/encoders/__pycache__/*

Optimus/README.md ADDED Viewed

	@@ -0,0 +1,121 @@

+# Optimus: the first pre-trained Big VAE language model <img src="doc/figs/logo_optimus.png" width="100" align="right">
+This repository contains source code necessary to reproduce the results presented in the EMNLP 2020 paper [Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space](https://arxiv.org/abs/2004.04092).
+|<img src="doc/figs/optimus_scheme.png" width="350"> | <img src="doc/figs/headfig_optimus.png" width="800">
+|-------------------------|:-------------------------:|
+| The network architecture of Optimus: encoder for representation learning and decoder for generation | Sentences are organized and manipulated in a pre-trained compact and smooth latent space
+For more on this project, see the [Microsoft Research Blog post](https://www.microsoft.com/en-us/research/blog/a-deep-generative-model-trifecta-three-advances-that-work-towards-harnessing-large-scale-power/).
+## News
+May 21, 2020: Releasing a [`demo`](http://40.71.23.172:8899/) for latent space manipulation, including sentence interpolation and analogy. Check out the [`website`](http://40.71.23.172:8899/).
+May 20, 2020: The latent space manipulation code is cleaned and released. See instructions at [`optimius_for_snli.md`](doc/optimius_for_snli.md).
+May 13, 2020: The fine-tuning code for langauge modeling is released. See instructions  at [`optimus_finetune_language_models.md`](doc/optimus_finetune_language_models.md)
+## Contents
+There are four steps to use this codebase to reproduce the results in the paper.
+1. [Dependencies](#dependencies)
+2. [Prepare datasets](#prepare-datasets)
+3. [Model training](#Model-training)
+    1. Pre-training on setences in Wikipedia
+    2. Languange Modeling
+    3. Guided Language Generation
+    4. Low-resource Language Understanding
+4. [Collect and plot results](#collect-and-plot-results)
+## Dependencies
+Pull docker from Docker Hub at: `chunyl/pytorch-transformers:v2`. Please see the instruction at [`doc/env.md`](doc/env.md)
+The project is organized into the following structures, with ensential files & folders visualized.  `output` saves the models checkpoints.
+```
+├── Optimus
+   └── code
+       ├── examples
+           ├── big_ae
+               ├── modules
+                   ├── vae.py
+                   └── ...
+               ├── run_lm_vae_pretraining_phdist_beta.py
+               ├── run_lm_vae_training.py
+               └── ...
+	   ├── pytorch_transformers
+               ├── modeling_bert.py
+               ├── modeling_gpt2.py
+               └── ...
+       ├── scripts
+           ├── scripts_docker
+	   ├── scripts_local
+	   ├── scripts_philly
+   └── data
+       └── datasets
+           ├── wikipedia_json_64_filtered
+               └── ...
+	   ├── snli_data
+           └── ...
+   └── output
+       ├── pretrain
+       ├── LM
+       └── ...
+```
+## Prepare Datasets
+Please download or preparation the data via following the instructions at [`data/download_datasets.md`](data/download_datasets.md).
+## Model Training
+**1. Pre-training on setences in Wikipedia**
+We pre-trained our models on Philly (a Microsoft internal compute cluster), the code is specialized for multi-node multi-GPU compute on this platform. The pre-training main python is [`run_lm_vae_pretraining_phdist_beta.py`](code/examples/big_ae/run_lm_vae_pretraining_phdist_beta.py). You may need to adjust the distributed training scripts.
+**2. Languange Modeling**
+To have a fair comparison with existing VAE languange models, we consider a model with latent dimension 32. The pre-trained model is fine-tuned on four commonly datasets for one epoch. Please see the details at [`doc/optimus_finetune_language_models.md`](doc/optimus_finetune_language_models.md)
+**3. Guided Language Generation**
+**Latent Space Manipulation** To ensure good performance, we consider a model with latent dimension 768. The pre-trained model is fine-tuned on SNLI dataset, where sentences show related patterns. Please see the details at
+Please see the details at [`doc/optimius_for_snli.md`](doc/optimius_for_snli.md)
+**4. Low-resource Language Understanding**
+## Collect and Plot Results
+Once the networks are trained and the results are saved, we extracted key results using Python script. The results can be plotted using the included IPython notebook `plots/main_plots.ipynb`.
+Start the IPython Notebook server:
+```
+$ cd plots
+$ ipython notebook
+```
+Select the `main_plots.ipynb` notebook and execute the included
+code. Note that without modification, we have copyed our extracted results into the notebook, and script will output figures in the paper. If you've run your own training and wish to plot results, you'll have to organize your results in the same format instead.
+## Questions?
+Please drop me ([Chunyuan](http://chunyuan.li/)) a line if you have any questions.
+```
+@inproceedings{li2020_Optimus,
+  title={Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space},
+  author={Li, Chunyuan and Gao, Xiang and Li, Yuan and Li, Xiujun and Peng, Baolin and Zhang, Yizhe and Gao, Jianfeng},
+  booktitle={EMNLP},
+  year={2020}
+}
+```

Optimus/code/README.md ADDED Viewed

	@@ -0,0 +1,41 @@

+## Set up Environment
+Pull docker from Docker Hub at: chunyl/pytorch-transformers:v2
+Edit the project path to the absolute path on your computer by changing the "SCRIPTPATH" in [run_docker.sh](./scripts/scripts_docker/run_docker.sh)
+In this directory ("code"), and run docker
+    sh scripts/scripts_docker/run_docker.sh
+## Fine-tune Language Models
+    sh scripts/scripts_local/run_ft_lm_vae_optimus.sh
+The main training script is [`run_lm_vae_training.py`](./examples/big_ae/run_lm_vae_training.py) and conducts the fine-tuning loop, taking the following options (among others) as arguments:
+- `--checkpoint_dir`: the folder that the pre-trained Optimus is saved.
+- `--gloabl_step_eval`: it specifies the checkpoint (the steps that Optimus is trained).
+- `--train_data_file` and `--eval_data_file`: the path for training and testing datasets for the downstream fine-tuning.
+- `--dataset`: the dataset for fine-tuning. such as `Penn`
+- `--num_train_epochs`: number of training epochs (type=int); default 1.
+- `--dim_target_kl`:   the hyper-paramter used in dimension-wise thresholding used in fine-tuning(type=float); default 0.5.
+- `--beta`:   the maximum beta value used in cyclical annealing schedule used in fine-tuning(type=float); default 1.0.
+- `--ratio_zero`:   the proportion of beta=0 in one period for fine-tuning(type=float); default 0.5
+- `--ratio_increase`:  the proportion of beta that increases from 0 to the maximum value in one period in cyclical annealing schedule used in fine-tuning(type=float); default 0.25.
+For more options, please see [`run_lm_vae_training.py`](./examples/big_ae/run_lm_vae_training.py) and  see the examples we provided in [`run_ft_lm_vae_optimus.sh`](./scripts/scripts_local/run_ft_lm_vae_optimus.sh), or [more running scripts we used to run the code on a cluster](./scripts/scripts_philly).
+## Play with the latent space
+    sh scripts/scripts_local/eval_optimus_latent_space.sh
+The main training script is [`run_latent_generation.py`](./examples/big_ae/run_latent_generation.py) and evaluates the various ways to generate text conditioned on latent vectors, taking the following options (among others) as arguments:
+- `--play_mode`:  The current scripts supports two ways to play with the pre-trained VAE models: [`reconstrction`, `interpolation`]

Optimus/code/app.py ADDED Viewed

File without changes

Optimus/code/examples/README.md ADDED Viewed

	@@ -0,0 +1,392 @@

+# Examples
+In this section a few examples are put together. All of these examples work for several models, making use of the very
+similar API between the different models.
+| Section                    | Description                                                                                                                                                |
+|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [Language Model fine-tuning](#language-model-fine-tuning) | Fine-tuning the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
+| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet.                                         |
+| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision.                              |
+| [SQuAD](#squad) | Using BERT for question answering, examples with distributed training.                                                                                  |
+| [Multiple Choice](#multiple choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks.
+## Language model fine-tuning
+Based on the script [`run_lm_finetuning.py`](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_lm_finetuning.py).
+Fine-tuning the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
+to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
+are fine-tuned using a masked language modeling (MLM) loss.
+Before running the following example, you should get a file that contains text on which the language model will be
+fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
+We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
+text that will be used for evaluation.
+### GPT-2/GPT and causal language modeling
+The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
+the tokenization). The loss here is that of causal language modeling.
+```bash
+export TRAIN_FILE=/path/to/dataset/wiki.train.raw
+export TEST_FILE=/path/to/dataset/wiki.test.raw
+python run_lm_finetuning.py \
+    --output_dir=output \
+    --model_type=gpt2 \
+    --model_name_or_path=gpt2 \
+    --do_train \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval \
+    --eval_data_file=$TEST_FILE
+```
+This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
+a score of ~20 perplexity once fine-tuned on the dataset.
+### RoBERTa/BERT and masked language modeling
+The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
+as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
+pre-training: masked language modeling.
+In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
+slightly slower (over-fitting takes more epochs).
+We use the `--mlm` flag so that the script may change its loss function.
+```bash
+export TRAIN_FILE=/path/to/dataset/wiki.train.raw
+export TEST_FILE=/path/to/dataset/wiki.test.raw
+python run_lm_finetuning.py \
+    --output_dir=output \
+    --model_type=roberta \
+    --model_name_or_path=roberta-base \
+    --do_train \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval \
+    --eval_data_file=$TEST_FILE \
+    --mlm
+```
+## Language generation
+Based on the script [`run_generation.py`](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_generation.py).
+Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet.
+A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
+can try out the different models available in the library.
+Example usage:
+```bash
+python run_generation.py \
+    --model_type=gpt2 \
+    --model_name_or_path=gpt2
+```
+## GLUE
+Based on the script [`run_glue.py`](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_glue.py).
+Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
+Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
+GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
+uncased  BERT base model (the checkpoint `bert-base-uncased`). All experiments ran on 8 V100 GPUs with a total train
+batch size of 24. Some of these tasks have a small dataset and training can lead to high variance in the results
+between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
+| Task  | Metric                       | Result      |
+|-------|------------------------------|-------------|
+| CoLA  | Matthew's corr               | 48.87       |
+| SST-2 | Accuracy                     | 91.74       |
+| MRPC  | F1/Accuracy                  | 90.70/86.27 |
+| STS-B | Person/Spearman corr.        | 91.39/91.04 |
+| QQP   | Accuracy/F1                  | 90.79/87.66 |
+| MNLI  | Matched acc./Mismatched acc. | 83.70/84.83 |
+| QNLI  | Accuracy                     | 89.31       |
+| RTE   | Accuracy                     | 71.43       |
+| WNLI  | Accuracy                     | 43.66       |
+Some of these results are significantly different from the ones reported on the test set
+of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
+Before running anyone of these GLUE tasks you should download the
+[GLUE data](https://gluebenchmark.com/tasks) by running
+[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
+and unpack it to some directory `$GLUE_DIR`.
+```bash
+export GLUE_DIR=/path/to/glue
+export TASK_NAME=MRPC
+python run_glue.py \
+  --model_type bert \
+  --model_name_or_path bert-base-cased \
+  --task_name $TASK_NAME \
+  --do_train \
+  --do_eval \
+  --do_lower_case \
+  --data_dir $GLUE_DIR/$TASK_NAME \
+  --max_seq_length 128 \
+  --per_gpu_train_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3.0 \
+  --output_dir /tmp/$TASK_NAME/
+```
+where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
+The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
+In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
+output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
+The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
+CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
+said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
+since the data processor for each task inherits from the base class DataProcessor.
+### MRPC
+#### Fine-tuning example
+The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
+than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
+Before running anyone of these GLUE tasks you should download the
+[GLUE data](https://gluebenchmark.com/tasks) by running
+[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
+and unpack it to some directory `$GLUE_DIR`.
+```bash
+export GLUE_DIR=/path/to/glue
+python run_glue.py \
+  --model_type bert \
+  --model_name_or_path bert-base-cased \
+  --task_name MRPC \
+  --do_train \
+  --do_eval \
+  --do_lower_case \
+  --data_dir $GLUE_DIR/MRPC/ \
+  --max_seq_length 128 \
+  --per_gpu_train_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3.0 \
+  --output_dir /tmp/mrpc_output/
+```
+Our test ran on a few seeds with [the original implementation hyper-
+parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
+results between 84% and 88%.
+#### Using Apex and mixed-precision
+Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
+[apex](https://github.com/NVIDIA/apex), then run the following example:
+```bash
+export GLUE_DIR=/path/to/glue
+python run_glue.py \
+  --model_type bert \
+  --model_name_or_path bert-base-cased \
+  --task_name MRPC \
+  --do_train \
+  --do_eval \
+  --do_lower_case \
+  --data_dir $GLUE_DIR/MRPC/ \
+  --max_seq_length 128 \
+  --per_gpu_train_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3.0 \
+  --output_dir /tmp/mrpc_output/ \
+  --fp16
+```
+#### Distributed training
+Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
+reaches F1 > 92 on MRPC.
+```bash
+export GLUE_DIR=/path/to/glue
+python -m torch.distributed.launch \
+    --nproc_per_node 8 run_glue.py \
+    --model_type bert \
+    --model_name_or_path bert-base-cased \
+    --task_name MRPC \
+    --do_train \
+    --do_eval \
+    --do_lower_case \
+    --data_dir $GLUE_DIR/MRPC/ \
+    --max_seq_length 128 \
+    --per_gpu_train_batch_size 8 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3.0 \
+    --output_dir /tmp/mrpc_output/
+```
+Training with these hyper-parameters gave us the following results:
+```bash
+acc = 0.8823529411764706
+acc_and_f1 = 0.901702786377709
+eval_loss = 0.3418912578906332
+f1 = 0.9210526315789473
+global_step = 174
+loss = 0.07231863956341798
+```
+### MNLI
+The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
+```bash
+export GLUE_DIR=/path/to/glue
+python -m torch.distributed.launch \
+    --nproc_per_node 8 run_glue.py \
+    --model_type bert \
+    --model_name_or_path bert-base-cased \
+    --task_name mnli \
+    --do_train \
+    --do_eval \
+    --do_lower_case \
+    --data_dir $GLUE_DIR/MNLI/ \
+    --max_seq_length 128 \
+    --per_gpu_train_batch_size 8 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3.0 \
+    --output_dir output_dir \
+```
+The results  are the following:
+```bash
+***** Eval results *****
+  acc = 0.8679706601466992
+  eval_loss = 0.4911287787382479
+  global_step = 18408
+  loss = 0.04755385363816904
+***** Eval results *****
+  acc = 0.8747965825874695
+  eval_loss = 0.45516540421714036
+  global_step = 18408
+  loss = 0.04755385363816904
+```
+##Multiple Choice
+Based on the script [`run_multiple_choice.py`]().
+#### Fine-tuning on SWAG
+Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
+```
+#training on 4 tesla V100(16GB) GPUS
+export SWAG_DIR=/path/to/swag_data_dir
+python ./examples/single_model_scripts/run_multiple_choice.py \
+--model_type roberta \
+--task_name swag \
+--model_name_or_path roberta-base \
+--do_train \
+--do_eval \
+--do_lower_case \
+--data_dir $SWAG_DIR \
+--learning_rate 5e-5 \
+--num_train_epochs 3 \
+--max_seq_length 80 \
+--output_dir models_bert/swag_base \
+--per_gpu_eval_batch_size=16 \
+--per_gpu_train_batch_size=16 \
+--gradient_accumulation_steps 2 \
+--overwrite_output
+```
+Training with the defined hyper-parameters yields the following results:
+```
+***** Eval results *****
+eval_acc = 0.8338998300509847
+eval_loss = 0.44457291918821606
+```
+## SQuAD
+Based on the script [`run_squad.py`](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_squad.py).
+#### Fine-tuning on SQuAD
+This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
+on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
+$SQUAD_DIR directory.
+* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
+* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
+* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
+```bash
+export SQUAD_DIR=/path/to/SQUAD
+python run_squad.py \
+  --model_type bert \
+  --model_name_or_path bert-base-cased \
+  --do_train \
+  --do_eval \
+  --do_lower_case \
+  --train_file $SQUAD_DIR/train-v1.1.json \
+  --predict_file $SQUAD_DIR/dev-v1.1.json \
+  --per_gpu_train_batch_size 12 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 2.0 \
+  --max_seq_length 384 \
+  --doc_stride 128 \
+  --output_dir /tmp/debug_squad/
+```
+Training with the previously defined hyper-parameters yields the following results:
+```bash
+f1 = 88.52
+exact_match = 81.22
+```
+#### Distributed training
+Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:
+```bash
+python -m torch.distributed.launch --nproc_per_node=8 run_squad.py \
+    --model_type bert \
+    --model_name_or_path bert-base-cased \
+    --do_train \
+    --do_eval \
+    --do_lower_case \
+    --train_file $SQUAD_DIR/train-v1.1.json \
+    --predict_file $SQUAD_DIR/dev-v1.1.json \
+    --learning_rate 3e-5 \
+    --num_train_epochs 2 \
+    --max_seq_length 384 \
+    --doc_stride 128 \
+    --output_dir ../models/wwm_uncased_finetuned_squad/ \
+    --per_gpu_train_batch_size 24 \
+    --gradient_accumulation_steps 12
+```
+Training with the previously defined hyper-parameters yields the following results:
+```bash
+f1 = 93.15
+exact_match = 86.91
+```
+This fine-tuneds model is available as a checkpoint under the reference
+`bert-large-uncased-whole-word-masking-finetuned-squad`.

Optimus/code/examples/__pycache__/utils_glue.cpython-37.pyc ADDED Viewed

Binary file (21.5 kB). View file

Optimus/code/examples/big_ae/__pycache__/grad_app.cpython-310.pyc ADDED Viewed

Binary file (14 kB). View file

Optimus/code/examples/big_ae/__pycache__/utils.cpython-37.pyc ADDED Viewed

Binary file (40.3 kB). View file

Optimus/code/examples/big_ae/debug_data.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import torch
+import os
+output_dir = "../output/philly_rr1_vae_wikipedia_pretraining_2nd_file"
+data = torch.load(os.path.join(output_dir, 'batch_debug_6621.pt')

Optimus/code/examples/big_ae/eval_dialog_multi_response.py ADDED Viewed

	@@ -0,0 +1,378 @@

+import numpy as np
+import torch
+import torch.nn.functional as F
+from nltk.translate.bleu_score import sentence_bleu
+from nltk.translate.bleu_score import SmoothingFunction
+from sklearn.metrics.pairwise import cosine_similarity as cosine
+from collections import Counter
+import os, pickle, pdb
+class Metrics:
+    # based on https://raw.githubusercontent.com/guxd/DialogWAE/29f206af05bfe5fe28fec4448e208310a7c9258d/experiments/metrics.py
+    def __init__(self, path_word2vec='../data/datasets/dailydialog_data/glove.twitter.27B.200d.txt'):
+        """
+        :param word2vec - a numpy array of word2vec with shape [vocab_size x emb_size]
+        """
+        super(Metrics, self).__init__()
+        self.load_word2vec(path_word2vec)
+        #self.word2vec = dict()
+    def load_word2vec(self, path_word2vec):
+        path_pkl = path_word2vec + '.pkl'
+        if os.path.exists(path_pkl):
+            print('loading word2vec from '+path_pkl)
+            self.word2vec = pickle.load(open(path_pkl, 'rb'))
+        else:
+            self.word2vec = dict()
+            for i, line in enumerate(open(path_word2vec, encoding='utf-8')):
+                ss = line.strip('\n').split()
+                self.word2vec[ss[0]] = [float(v) for v in ss[1:]]
+                if i % 1e4 == 0:
+                    print('processed %ik word2vec'%(i/1e3))
+            print('dumping word2vec to '+path_pkl)
+            pickle.dump(self.word2vec, open(path_pkl, 'wb'))
+        self.embed_dim = len(list(self.word2vec.values())[0])
+        print('loaded %i word2vec of dim %i'%(len(self.word2vec), self.embed_dim))
+    def embedding(self, seqs):
+        # note: different from original implementation
+        batch_size, seqlen = seqs.shape
+        embs = np.zeros([batch_size, seqlen, self.embed_dim])
+        for i in range(batch_size):
+            for j in range(seqlen):
+                w = seqs[i,j]
+                if w != '' and w in self.word2vec:
+                    embs[i, j, :] = self.word2vec[w]
+        return embs
+    def extrema(self, embs, lens): # embs: [batch_size x seq_len x emb_size]  lens: [batch_size]
+        """
+        computes the value of every single dimension in the word vectors which has the greatest
+        difference from zero.
+        :param seq: sequence
+        :param seqlen: length of sequence
+        """
+        # Find minimum and maximum value for every dimension in predictions
+        batch_size, seq_len, emb_size = embs.shape
+        max_mask = np.zeros((batch_size, seq_len, emb_size), dtype=np.int)
+        for i,length in enumerate(lens):
+            max_mask[i,:length,:]=1
+        min_mask = 1-max_mask
+        seq_max = (embs*max_mask).max(1) # [batch_sz x emb_sz]
+        seq_min = (embs+min_mask).min(1)
+        # Find the maximum absolute value in min and max data
+        comp_mask = seq_max >= np.abs(seq_min)# [batch_sz x emb_sz]
+        # Add vectors for finding final sequence representation for predictions
+        extrema_emb = seq_max* comp_mask + seq_min* np.logical_not(comp_mask)
+        return extrema_emb
+    def mean(self, embs, lens):
+        batch_size, seq_len, emb_size=embs.shape
+        mask = np.zeros((batch_size, seq_len, emb_size), dtype=np.int)
+        for i,length in enumerate(lens):
+            mask[i,:length,:]=1
+        return (embs*mask).sum(1)/(mask.sum(1)+1e-8)
+    def sim_bleu(self, hyps, ref):
+        """
+        :param ref - a list of tokens of the reference
+        :param hyps - a list of tokens of the hypothesis
+        :return maxbleu - recall bleu
+        :return avgbleu - precision bleu
+        """
+        scores = []
+        for hyp in hyps:
+            try:
+                scores.append(sentence_bleu([ref], hyp, smoothing_function=SmoothingFunction().method7,
+                                        weights=[1./3, 1./3, 1./3]))
+            except:
+                scores.append(0.0)
+        return np.max(scores), np.mean(scores)
+    def sim_bow(self, pred, pred_lens, ref, ref_lens):
+        """
+        :param pred - ndarray [batch_size x seqlen]
+        :param pred_lens - list of integers
+        :param ref - ndarray [batch_size x seqlen]
+        """
+        # look up word embeddings for prediction and reference
+        emb_pred = self.embedding(pred) # [batch_sz x seqlen1 x emb_sz]
+        emb_ref = self.embedding(ref) # [batch_sz x seqlen2 x emb_sz]
+        ext_emb_pred=self.extrema(emb_pred, pred_lens)
+        ext_emb_ref=self.extrema(emb_ref, ref_lens)
+        bow_extrema=cosine(ext_emb_pred, ext_emb_ref) # [batch_sz_pred x batch_sz_ref]
+        avg_emb_pred = self.mean(emb_pred, pred_lens) # Calculate mean over seq
+        avg_emb_ref = self.mean(emb_ref, ref_lens)
+        bow_avg = cosine(avg_emb_pred, avg_emb_ref) # [batch_sz_pred x batch_sz_ref]
+        batch_pred, seqlen_pred, emb_size=emb_pred.shape
+        batch_ref, seqlen_ref, emb_size=emb_ref.shape
+        cos_sim = cosine(emb_pred.reshape((-1, emb_size)), emb_ref.reshape((-1, emb_size))) # [(batch_sz*seqlen1)x(batch_sz*seqlen2)]
+        cos_sim = cos_sim.reshape((batch_pred, seqlen_pred, batch_ref, seqlen_ref))
+        # Find words with max cosine similarity
+        max12 = cos_sim.max(1).mean(2) # max over seqlen_pred
+        max21 = cos_sim.max(3).mean(1) # max over seqlen_ref
+        bow_greedy=(max12+max21)/2 # [batch_pred x batch_ref(1)]
+        return np.max(bow_extrema), np.max(bow_avg), np.max(bow_greedy)
+    def div_distinct(self, seqs, seq_lens):
+        """
+        distinct-1 distinct-2 metrics for diversity measure proposed
+        by Li et al. "A Diversity-Promoting Objective Function for Neural Conversation Models"
+        we counted numbers of distinct unigrams and bigrams in the generated responses
+        and divide the numbers by total number of unigrams and bigrams.
+        The two metrics measure how informative and diverse the generated responses are.
+        High numbers and high ratios mean that there is much content in the generated responses,
+        and high numbers further indicate that the generated responses are long
+        """
+        batch_size = seqs.shape[0]
+        intra_dist1, intra_dist2=np.zeros(batch_size), np.zeros(batch_size)
+        n_unigrams, n_bigrams, n_unigrams_total , n_bigrams_total = 0. ,0., 0., 0.
+        unigrams_all, bigrams_all = Counter(), Counter()
+        for b in range(batch_size):
+            unigrams= Counter([tuple(seqs[b,i:i+1]) for i in range(seq_lens[b])])
+            bigrams = Counter([tuple(seqs[b,i:i+2]) for i in range(seq_lens[b]-1)])
+            intra_dist1[b]=(len(unigrams.items())+1e-12)/(seq_lens[b]+1e-5)
+            intra_dist2[b]=(len(bigrams.items())+1e-12)/(max(0, seq_lens[b]-1)+1e-5)
+            unigrams_all.update([tuple(seqs[b,i:i+1]) for i in range(seq_lens[b])])
+            bigrams_all.update([tuple(seqs[b,i:i+2]) for i in range(seq_lens[b]-1)])
+            n_unigrams_total += seq_lens[b]
+            n_bigrams_total += max(0, seq_lens[b]-1)
+        inter_dist1 = (len(unigrams_all.items())+1e-12)/(n_unigrams_total+1e-5)
+        inter_dist2 = (len(bigrams_all.items())+1e-12)/(n_bigrams_total+1e-5)
+        return intra_dist1, intra_dist2, inter_dist1, inter_dist2
+import pdb
+def eval_multi_ref(path, path_multi_ref=None):
+    """
+    based on: https://github.com/guxd/DialogWAE/blob/29f206af05bfe5fe28fec4448e208310a7c9258d/sample.py
+    path:   each line is '\t'.join([src, ref, hyp])
+    path_multi_ref:   each line is '\t'.join([src, hyp])
+    the order of unique src appeared in `path_multi_ref` should be the same as that in `path`
+    """
+    metrics = Metrics()
+    d_ref = dict()
+    d_hyp = dict()
+    src2ix = dict()
+    ix2src = dict()
+    ix = 0
+    for line in open(path, encoding='utf-8'):
+        line = line.strip('\n').strip()
+        if len(line) == 0:
+            continue
+        # pdb.set_trace()
+        src, ref, hyp = line.split('\t')
+        #src, ref = line.split('\t'); hyp = ref
+        src = src.replace(' EOS ',' [SEP] ').strip()
+        ref = ref.strip().split()
+        hyp = hyp.strip().split()
+        if src not in d_ref:
+            d_ref[src] = ref
+            d_hyp[src] = [hyp]
+            src2ix[src] = ix
+            ix2src[ix] = src
+            ix += 1
+        else:
+            d_hyp[src].append(hyp)
+    print('loaded %i src-ref-hyp tuples'%(len(d_ref)))
+    def chr_only(s):
+        ret = ''
+        for c in s:
+            if c.isalpha():
+                ret += c
+        return ret
+    if path_multi_ref is not None:
+        set_src4multiref = set()
+        ix = -1
+        d_multi_ref = dict()
+        for line in open(path_multi_ref, encoding='utf-8'):
+            line = line.strip('\n').strip()
+            if len(line) == 0:
+                continue
+            src4multiref, ref = line.split('\t')[:2]
+            src4multiref = src4multiref.replace(' EOS ', ' ').replace(' [SEP] ',' ').strip()
+            ref = ref.strip().split()
+            if src4multiref not in set_src4multiref:
+                set_src4multiref.add(src4multiref)
+                ix += 1
+                src = ix2src[ix]
+                id_hyp = chr_only(src)
+                id_multiref = chr_only(src4multiref)
+                if id_multiref != id_hyp:
+                    print('[ERROR] cannot match src4multiref and src4hyp')
+                    print('src4multiref:', src4multiref)
+                    print('src4hyp:', ix2src[ix])
+                    # pdb.set_trace()
+                    raise ValueError
+                d_multi_ref[src] = [ref]
+            else:
+                d_multi_ref[src].append(ref)
+        n_ref = [len(d_multi_ref[k]) for k in d_multi_ref]
+        print('loaded %i src with multi-ref, avg n_ref = %.3f'%(len(d_multi_ref), np.mean(n_ref)))
+    n_miss = 0
+    for src in d_ref:
+        if src not in d_multi_ref:
+            n_miss += 1
+            print('[WARNING] cannot find multiref for src: '+src)
+            d_multi_ref[src] = [d_ref[src]]
+    if n_miss > 5:
+        raise ValueError
+    n = len(d_ref)
+    print(path)
+    print('n_src\t%i'%n)
+    avg_lens = 0
+    maxbleu = 0
+    avgbleu = 0
+    intra_dist1, intra_dist2, inter_dist1, inter_dist2 = 0,0,0,0
+    bow_extrema, bow_avg, bow_greedy = 0,0,0
+    for src in d_ref:
+        # BLEU ----
+        if path_multi_ref is None:
+            m, a = metrics.sim_bleu(d_hyp[src], d_ref[src])
+        else:
+            n_ref = len(d_multi_ref[src])
+            m, a = 0, 0
+            for ref in d_multi_ref[src]:
+                _m, _a = metrics.sim_bleu(d_hyp[src], ref)
+                m += _m
+                a += _a
+            m /= n_ref
+            a /= n_ref
+        maxbleu += m
+        avgbleu += a
+        # diversity ----
+        seq_len = [len(hyp) for hyp in d_hyp[src]]
+        max_len = max(seq_len)
+        seqs = []
+        for hyp in d_hyp[src]:
+            padded = hyp + [''] * (max_len - len(hyp))
+            seqs.append(np.reshape(padded, [1, -1]))
+        seqs = np.concatenate(seqs, axis=0)
+        intra1, intra2, inter1, inter2 = metrics.div_distinct(seqs, seq_len)
+        intra_dist1 += np.mean(intra1)
+        intra_dist2 += np.mean(intra2)
+        inter_dist1 += inter1
+        inter_dist2 += inter2
+        avg_lens += np.mean(seq_len)
+        # BOW ----
+        def calc_bow(ref):
+            n_hyp = len(d_hyp[src])
+            seqs_ref = np.concatenate([np.reshape(ref, [1,-1])] * n_hyp, axis=0)
+            seq_len_ref = [len(ref)] * n_hyp
+            return metrics.sim_bow(seqs, seq_len, seqs_ref, seq_len_ref)
+        if path_multi_ref is None:
+            extrema, avg, greedy = calc_bow(d_ref[src])
+        else:
+            extrema, avg, greedy = 0, 0, 0
+            for ref in d_multi_ref[src]:
+                e, a, g = calc_bow(ref)
+                extrema += e
+                avg += a
+                greedy += g
+            extrema /= n_ref
+            avg /= n_ref
+            greedy /= n_ref
+        bow_extrema += extrema
+        bow_avg += avg
+        bow_greedy += greedy
+    recall_bleu = maxbleu/n
+    prec_bleu = avgbleu/n
+    f1 = 2*(prec_bleu*recall_bleu) / (prec_bleu+recall_bleu+10e-12)
+    print('BLEU')
+    print('  R\t%.3f'%recall_bleu)
+    print('  P\t%.3f'%prec_bleu)
+    print('  F1\t%.3f'%f1)
+    print('BOW')
+    print('  A\t%.3f'%(bow_avg/n))
+    print('  E\t%.3f'%(bow_extrema/n))
+    print('  G\t%.3f'%(bow_greedy/n))
+    print('intra_dist')
+    print('  1\t%.3f'%(intra_dist1/n))
+    print('  2\t%.3f'%(intra_dist2/n))
+    print('inter_dist')
+    print('  1\t%.3f'%(inter_dist1/n))
+    print('  2\t%.3f'%(inter_dist2/n))
+    print('avg_L\t%.1f'%(avg_lens/n))
+    results = {
+        "BLEU_R": recall_bleu, "BLEU_P": prec_bleu, "BLEU_F1": f1, "BOW_A": bow_avg/n, "BOW_E": bow_extrema/n, "BOW_G": bow_greedy/n, "intra_dist1": intra_dist1/n, "intra_dist2": intra_dist2/n, "inter_dist1": inter_dist1/n, "inter_dist2": inter_dist2/n, "avg_L": avg_lens/n
+    }
+    return results
+def create_rand_baseline():
+    path = 'data/datasets/dailydialog_data/test.txt'
+    srcs = []
+    refs = []
+    for line in open(path, encoding='utf-8'):
+        src, ref = line.strip('\n').split('\t')
+        srcs.append(src.strip())
+        refs.append(ref.strip())
+    hyps = set()
+    path = 'data/datasets/dailydialog_data/train.txt'
+    for line in open(path, encoding='utf-8'):
+        _, ref = line.strip('\n').split('\t')
+        hyps.add(ref)
+        if len(hyps) == len(srcs) *10:
+            print('collected training ref')
+            break
+    hyps = list(hyps)
+    lines = []
+    j = 0
+    for i in range(len(srcs)):
+        lines += ['\t'.join([srcs[i], refs[i], hyp]) for hyp in hyps[j:j+10]]
+        j = j + 10
+    with open('out/rand.tsv', 'w', encoding='utf-8') as f:
+        f.write('\n'.join(lines))
+def create_human_baseline():
+    path = 'data/datasets/dailydialog_data/test.txt'
+    lines = []
+    for line in open(path, encoding='utf-8'):
+        src, ref = line.strip('\n').split('\t')
+        src = src.strip()
+        ref = ref.strip()
+        lines.append('\t'.join([src, ref, ref]))
+    with open('out/human.tsv', 'w', encoding='utf-8') as f:
+        f.write('\n'.join(lines))
+if __name__ == "__main__":
+    path = 'D:/data/switchboard/test.txt.1ref'
+    path_multi_ref = 'D:/data/switchboard/test.txt'
+    eval_multi_ref(path_multi_ref, path)

Optimus/code/examples/big_ae/eval_dialog_response.py ADDED Viewed

	@@ -0,0 +1,295 @@

+import numpy as np
+import torch
+import torch.nn.functional as F
+from nltk.translate.bleu_score import sentence_bleu
+from nltk.translate.bleu_score import SmoothingFunction
+from sklearn.metrics.pairwise import cosine_similarity as cosine
+from collections import Counter
+import os, pickle
+class Metrics:
+    # based on https://raw.githubusercontent.com/guxd/DialogWAE/29f206af05bfe5fe28fec4448e208310a7c9258d/experiments/metrics.py
+    def __init__(self, path_word2vec='../data/datasets/dailydialog_data/glove.twitter.27B.200d.txt'):
+        """
+        :param word2vec - a numpy array of word2vec with shape [vocab_size x emb_size]
+        """
+        self.path_word2vec = path_word2vec
+        super(Metrics, self).__init__()
+        self.load_word2vec(path_word2vec)
+    def load_word2vec(self, path_word2vec):
+        path_pkl = path_word2vec + '.pkl'
+        if os.path.exists(path_pkl):
+            print('loading word2vec from '+path_pkl)
+            self.word2vec = pickle.load(open(path_pkl, 'rb'))
+        else:
+            self.word2vec = dict()
+            for i, line in enumerate(open(path_word2vec, encoding='utf-8')):
+                ss = line.strip('\n').split()
+                self.word2vec[ss[0]] = [float(v) for v in ss[1:]]
+                if i % 1e4 == 0:
+                    print('processed %ik word2vec'%(i/1e3))
+            print('dumping word2vec to '+path_pkl)
+            pickle.dump(self.word2vec, open(path_pkl, 'wb'))
+        # pdb.set_trace()
+        self.embed_dim = len(self.word2vec["."]) # len(self.word2vec.values()[0])
+        print('loaded %i word2vec of dim %i'%(len(self.word2vec), self.embed_dim))
+    def embedding(self, seqs):
+        # note: different from original implementation
+        batch_size, seqlen = seqs.shape
+        embs = np.zeros([batch_size, seqlen, self.embed_dim])
+        for i in range(batch_size):
+            for j in range(seqlen):
+                w = seqs[i,j]
+                if w != '' and w in self.word2vec:
+                    embs[i, j, :] = self.word2vec[w]
+        return embs
+    def extrema(self, embs, lens): # embs: [batch_size x seq_len x emb_size]  lens: [batch_size]
+        """
+        computes the value of every single dimension in the word vectors which has the greatest
+        difference from zero.
+        :param seq: sequence
+        :param seqlen: length of sequence
+        """
+        # Find minimum and maximum value for every dimension in predictions
+        batch_size, seq_len, emb_size = embs.shape
+        max_mask = np.zeros((batch_size, seq_len, emb_size), dtype=np.int)
+        for i,length in enumerate(lens):
+            max_mask[i,:length,:]=1
+        min_mask = 1-max_mask
+        seq_max = (embs*max_mask).max(1) # [batch_sz x emb_sz]
+        seq_min = (embs+min_mask).min(1)
+        # Find the maximum absolute value in min and max data
+        comp_mask = seq_max >= np.abs(seq_min)# [batch_sz x emb_sz]
+        # Add vectors for finding final sequence representation for predictions
+        extrema_emb = seq_max* comp_mask + seq_min* np.logical_not(comp_mask)
+        return extrema_emb
+    def mean(self, embs, lens):
+        batch_size, seq_len, emb_size=embs.shape
+        mask = np.zeros((batch_size, seq_len, emb_size), dtype=np.int)
+        for i,length in enumerate(lens):
+            mask[i,:length,:]=1
+        return (embs*mask).sum(1)/(mask.sum(1)+1e-8)
+    def sim_bleu(self, hyps, ref):
+        """
+        :param ref - a list of tokens of the reference
+        :param hyps - a list of tokens of the hypothesis
+        :return maxbleu - recall bleu
+        :return avgbleu - precision bleu
+        """
+        scores = []
+        for hyp in hyps:
+            try:
+                scores.append(sentence_bleu([ref], hyp, smoothing_function=SmoothingFunction().method7,
+                                        weights=[1./3, 1./3, 1./3]))
+            except:
+                scores.append(0.0)
+        return np.max(scores), np.mean(scores)
+    def sim_bow(self, pred, pred_lens, ref, ref_lens):
+        """
+        :param pred - ndarray [batch_size x seqlen]
+        :param pred_lens - list of integers
+        :param ref - ndarray [batch_size x seqlen]
+        """
+        # look up word embeddings for prediction and reference
+        emb_pred = self.embedding(pred) # [batch_sz x seqlen1 x emb_sz]
+        emb_ref = self.embedding(ref) # [batch_sz x seqlen2 x emb_sz]
+        ext_emb_pred=self.extrema(emb_pred, pred_lens)
+        ext_emb_ref=self.extrema(emb_ref, ref_lens)
+        bow_extrema=cosine(ext_emb_pred, ext_emb_ref) # [batch_sz_pred x batch_sz_ref]
+        avg_emb_pred = self.mean(emb_pred, pred_lens) # Calculate mean over seq
+        avg_emb_ref = self.mean(emb_ref, ref_lens)
+        bow_avg = cosine(avg_emb_pred, avg_emb_ref) # [batch_sz_pred x batch_sz_ref]
+        batch_pred, seqlen_pred, emb_size=emb_pred.shape
+        batch_ref, seqlen_ref, emb_size=emb_ref.shape
+        cos_sim = cosine(emb_pred.reshape((-1, emb_size)), emb_ref.reshape((-1, emb_size))) # [(batch_sz*seqlen1)x(batch_sz*seqlen2)]
+        cos_sim = cos_sim.reshape((batch_pred, seqlen_pred, batch_ref, seqlen_ref))
+        # Find words with max cosine similarity
+        max12 = cos_sim.max(1).mean(2) # max over seqlen_pred
+        max21 = cos_sim.max(3).mean(1) # max over seqlen_ref
+        bow_greedy=(max12+max21)/2 # [batch_pred x batch_ref(1)]
+        return np.max(bow_extrema), np.max(bow_avg), np.max(bow_greedy)
+    def div_distinct(self, seqs, seq_lens):
+        """
+        distinct-1 distinct-2 metrics for diversity measure proposed
+        by Li et al. "A Diversity-Promoting Objective Function for Neural Conversation Models"
+        we counted numbers of distinct unigrams and bigrams in the generated responses
+        and divide the numbers by total number of unigrams and bigrams.
+        The two metrics measure how informative and diverse the generated responses are.
+        High numbers and high ratios mean that there is much content in the generated responses,
+        and high numbers further indicate that the generated responses are long
+        """
+        batch_size = seqs.shape[0]
+        intra_dist1, intra_dist2=np.zeros(batch_size), np.zeros(batch_size)
+        n_unigrams, n_bigrams, n_unigrams_total , n_bigrams_total = 0. ,0., 0., 0.
+        unigrams_all, bigrams_all = Counter(), Counter()
+        for b in range(batch_size):
+            unigrams= Counter([tuple(seqs[b,i:i+1]) for i in range(seq_lens[b])])
+            bigrams = Counter([tuple(seqs[b,i:i+2]) for i in range(seq_lens[b]-1)])
+            intra_dist1[b]=(len(unigrams.items())+1e-12)/(seq_lens[b]+1e-5)
+            intra_dist2[b]=(len(bigrams.items())+1e-12)/(max(0, seq_lens[b]-1)+1e-5)
+            unigrams_all.update([tuple(seqs[b,i:i+1]) for i in range(seq_lens[b])])
+            bigrams_all.update([tuple(seqs[b,i:i+2]) for i in range(seq_lens[b]-1)])
+            n_unigrams_total += seq_lens[b]
+            n_bigrams_total += max(0, seq_lens[b]-1)
+        inter_dist1 = (len(unigrams_all.items())+1e-12)/(n_unigrams_total+1e-5)
+        inter_dist2 = (len(bigrams_all.items())+1e-12)/(n_bigrams_total+1e-5)
+        return intra_dist1, intra_dist2, inter_dist1, inter_dist2
+import pdb
+def eval_dialog_response(generated_text_file_path):
+    """
+    based on: https://github.com/guxd/DialogWAE/blob/29f206af05bfe5fe28fec4448e208310a7c9258d/sample.py
+    quoted from the DialogWAE paper: https://arxiv.org/pdf/1805.12352.pdf
+    * "For each test context, we sample 10 responses from the models and compute their BLEU scores"
+    * "We use Glove vectors" "For each test context, we report the maximum BOW embedding score among the 10 sampled responses."
+    * "intra-dist as the average of distinct values within each sampled response"
+    " "inter-dist as the distinct value among all sampled responses."
+    """
+    metrics = Metrics()
+    d_ref = dict()
+    d_hyp = dict()
+    for line in open(generated_text_file_path, encoding='utf-8'):
+        line = line.strip('\n').strip()
+        if len(line) == 0:
+            continue
+        src, ref, hyp = line.split('\t')
+        src = src.strip()
+        ref = ref.strip().split()
+        hyp = hyp.strip().split()
+        if src not in d_ref:
+            d_ref[src] = ref
+            d_hyp[src] = [hyp]
+        else:
+            d_hyp[src].append(hyp)
+    n = len(d_ref)
+    print(generated_text_file_path)
+    print('n_src\t%i'%n)
+    avg_lens = 0
+    maxbleu = 0
+    avgbleu = 0
+    intra_dist1, intra_dist2, inter_dist1, inter_dist2 = 0,0,0,0
+    bow_extrema, bow_avg, bow_greedy = 0,0,0
+    for src in d_ref:
+        m, a = metrics.sim_bleu(d_hyp[src], d_ref[src])
+        maxbleu += m
+        avgbleu += a
+        seq_len = [len(hyp) for hyp in d_hyp[src]]
+        max_len = max(seq_len)
+        seqs = []
+        for hyp in d_hyp[src]:
+            padded = hyp + [''] * (max_len - len(hyp))
+            seqs.append(np.reshape(padded, [1, -1]))
+        seqs = np.concatenate(seqs, axis=0)
+        intra1, intra2, inter1, inter2 = metrics.div_distinct(seqs, seq_len)
+        intra_dist1 += np.mean(intra1)
+        intra_dist2 += np.mean(intra2)
+        inter_dist1 += inter1
+        inter_dist2 += inter2
+        n_hyp = len(d_hyp[src])
+        seqs_ref = np.concatenate([np.reshape(d_ref[src], [1,-1])] * n_hyp, axis=0)
+        seq_len_ref = [len(d_ref[src])] * n_hyp
+        if metrics.word2vec is not None:
+            extrema, avg, greedy = metrics.sim_bow(seqs, seq_len, seqs_ref, seq_len_ref)
+            bow_extrema += extrema
+            bow_avg += avg
+            bow_greedy += greedy
+        avg_lens += np.mean(seq_len)
+    recall_bleu = maxbleu/n
+    prec_bleu = avgbleu/n
+    f1 = 2*(prec_bleu*recall_bleu) / (prec_bleu+recall_bleu+10e-12)
+    print('BLEU')
+    print('  R\t%.3f'%recall_bleu)
+    print('  P\t%.3f'%prec_bleu)
+    print('  F1\t%.3f'%f1)
+    print('BOW')
+    print('  A\t%.3f'%(bow_avg/n))
+    print('  E\t%.3f'%(bow_extrema/n))
+    print('  G\t%.3f'%(bow_greedy/n))
+    print('intra_dist')
+    print('  1\t%.3f'%(intra_dist1/n))
+    print('  2\t%.3f'%(intra_dist2/n))
+    print('inter_dist')
+    print('  1\t%.3f'%(inter_dist1/n))
+    print('  2\t%.3f'%(inter_dist2/n))
+    print('avg_L\t%.1f'%(avg_lens/n))
+    results = {
+        "BLEU_R": recall_bleu, "BLEU_P": prec_bleu, "BLEU_F1": f1, "BOW_A": bow_avg/n, "BOW_E": bow_extrema/n, "BOW_G": bow_greedy/n, "intra_dist1": intra_dist1/n, "intra_dist2": intra_dist2/n, "inter_dist1": inter_dist1/n, "inter_dist2": inter_dist2/n, "avg_L": avg_lens/n
+    }
+    return results
+def create_rand_baseline():
+    path = 'data/datasets/dailydialog_data/test.txt'
+    srcs = []
+    refs = []
+    for line in open(path, encoding='utf-8'):
+        src, ref = line.strip('\n').split('\t')
+        srcs.append(src.strip())
+        refs.append(ref.strip())
+    hyps = set()
+    path = 'data/datasets/dailydialog_data/train.txt'
+    for line in open(path, encoding='utf-8'):
+        _, ref = line.strip('\n').split('\t')
+        hyps.add(ref)
+        if len(hyps) == len(srcs) *10:
+            print('collected training ref')
+            break
+    hyps = list(hyps)
+    lines = []
+    j = 0
+    for i in range(len(srcs)):
+        lines += ['\t'.join([srcs[i], refs[i], hyp]) for hyp in hyps[j:j+10]]
+        j = j + 10
+    with open('out/rand.tsv', 'w', encoding='utf-8') as f:
+        f.write('\n'.join(lines))
+def create_human_baseline():
+    path = 'data/datasets/dailydialog_data/test.txt'
+    lines = []
+    for line in open(path, encoding='utf-8'):
+        src, ref = line.strip('\n').split('\t')
+        src = src.strip()
+        ref = ref.strip()
+        lines.append('\t'.join([src, ref, ref]))
+    with open('out/human.tsv', 'w', encoding='utf-8') as f:
+        f.write('\n'.join(lines))
+if __name__ == "__main__":
+    #create_rand_baseline()
+    #create_human_baseline()
+    eval_dialog_response('out/eval_text_generation_results (1).txt')
+    #eval('out/rand.tsv')

Optimus/code/examples/big_ae/grad_app.py ADDED Viewed

	@@ -0,0 +1,486 @@

+# -*- coding: utf-8 -*-
+"""message_bottle.ipynb
+Automatically generated by Colab.
+Original file is located at
+    https://colab.research.google.com/drive/1I47sLakpuwERGzn-XoNct67mwiDS1mQD
+"""
+import matplotlib.pyplot as plt
+import matplotlib
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+import torch
+import torch.nn.functional as F
+import numpy as np
+from tqdm import tqdm, trange
+from types import SimpleNamespace
+import sys
+sys.path.append('/home/ryn_mote/Misc/generative_recommender/text_space/Optimus/code/examples/big_ae/')
+sys.path.append('/home/ryn_mote/Misc/generative_recommender/text_space/Optimus/code/')
+from pytorch_transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, BertConfig
+from pytorch_transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2ForLatentConnector
+from pytorch_transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
+from pytorch_transformers import XLNetLMHeadModel, XLNetTokenizer
+from pytorch_transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
+from pytorch_transformers import BertForLatentConnector, BertTokenizer
+from modules import VAE
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+torch.set_float32_matmul_precision('high')
+from tqdm import tqdm
+################################################
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (vocabulary size)
+            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[indices_to_remove] = filter_value
+    return logits
+def sample_sequence_conditional(model, length, context, past=None, num_samples=1, temperature=1, top_k=0, top_p=0.0, device='cpu', decoder_tokenizer=None):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    with torch.no_grad():
+        while True:
+        # for _ in trange(length):
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+            # pdb.set_trace()
+            if next_token.unsqueeze(0)[0,0].item() == decoder_tokenizer.encode('<EOS>')[0]:
+                break
+    return generated
+def latent_code_from_text(text,):# args):
+    tokenized1 = tokenizer_encoder.encode(text)
+    tokenized1 = [101] + tokenized1 + [102]
+    coded1 = torch.Tensor([tokenized1])
+    coded1 =torch.Tensor.long(coded1)
+    with torch.no_grad():
+        x0 = coded1
+        x0 = x0.to('cuda')
+        pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
+        mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+        latent_z = mean.squeeze(1)
+        coded_length = len(tokenized1)
+        return latent_z, coded_length
+# args
+def text_from_latent_code(latent_z):
+    past = latent_z
+    context_tokens = tokenizer_decoder.encode('<BOS>')
+    length = 128 # maximum length, but not used
+    out = sample_sequence_conditional(
+        model=model_vae.decoder,
+        context=context_tokens,
+        past=past,
+        length= length, # Chunyuan: Fix length; or use <EOS> to complete a sentence
+        temperature=.2,
+        top_k=50,
+        top_p=.98,
+        device='cuda',
+        decoder_tokenizer = tokenizer_decoder
+    )
+    text_x1 = tokenizer_decoder.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+    text_x1 = text_x1.split()[1:-1]
+    text_x1 = ' '.join(text_x1)
+    return text_x1
+################################################
+# Load model
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer)
+}
+latent_size = 768
+model_path = '/home/ryn_mote/Misc/generative_recommender/text_space/1.0_checkpoint-31250/checkpoint-31250/checkpoint-full-31250/'
+encoder_path = '/home/ryn_mote/Misc/generative_recommender/text_space/1.0_checkpoint-31250/checkpoint-31250/checkpoint-encoder-31250/'
+decoder_path = '/home/ryn_mote/Misc/generative_recommender/text_space/1.0_checkpoint-31250/checkpoint-31250/checkpoint-decoder-31250/'
+block_size = 100
+# Load a trained Encoder model and vocabulary that you have fine-tuned
+encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES['bert']
+model_encoder = encoder_model_class.from_pretrained(encoder_path, latent_size=latent_size)
+tokenizer_encoder = encoder_tokenizer_class.from_pretrained('bert-base-cased', do_lower_case=True)
+model_encoder.to('cuda')
+if block_size <= 0:
+    block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+block_size = min(block_size, tokenizer_encoder.max_len_single_sentence)
+# Load a trained Decoder model and vocabulary that you have fine-tuned
+decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES['gpt2']
+model_decoder = decoder_model_class.from_pretrained(decoder_path, latent_size=latent_size)
+tokenizer_decoder = decoder_tokenizer_class.from_pretrained('gpt2', do_lower_case=False)
+model_decoder.to('cuda')
+if block_size <= 0:
+    block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+block_size = min(block_size, tokenizer_decoder.max_len_single_sentence)
+# Load full model
+output_full_dir = '/home/ryn_mote/Misc/generative_recommender/text_space/'
+checkpoint = torch.load(os.path.join(model_path, 'training.bin'))
+# Chunyuan: Add Padding token to GPT2
+special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+print('We have added', num_added_toks, 'tokens to GPT2')
+model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+assert tokenizer_decoder.pad_token == '<PAD>'
+# Evaluation
+model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, SimpleNamespace(**{'latent_size': latent_size, 'device':'cuda'}))
+model_vae.load_state_dict(checkpoint['model_state_dict'])
+print("Pre-trained Optimus is successfully loaded")
+model_vae.to('cuda').to(torch.bfloat16)
+l = latent_code_from_text('A photo of a mountain.')[0]
+t = text_from_latent_code(l)
+print(t, l, l.shape)
+################################################
+import gradio as gr
+import numpy as np
+from sklearn.svm import SVC
+from sklearn.inspection import permutation_importance
+from sklearn import preprocessing
+import pandas as pd
+import random
+import time
+dtype = torch.bfloat16
+torch.set_grad_enabled(False)
+prompt_list = [p for p in list(set(
+                pd.read_csv('./twitter_prompts.csv').iloc[:, 1].tolist())) if type(p) == str]
+start_time = time.time()
+####################### Setup Model
+# TODO put back
+# @spaces.GPU()
+def generate(prompt, in_embs=None,):
+  if prompt != '':
+    print(prompt)
+    #in_embs = in_embs / in_embs.abs().max() * .15 if in_embs != None else None
+    in_embs = .9 * in_embs.to('cuda') + .5 * latent_code_from_text(prompt)[0] if in_embs != None else latent_code_from_text(prompt)[0]
+  else:
+    print('From embeds.')
+  in_embs = in_embs / in_embs.abs().max() * .6
+  in_embs = in_embs.to('cuda').to(torch.bfloat16)
+  plt.close('all')
+  plt.hist(np.array(in_embs.detach().to('cpu').to(torch.float)).flatten(), bins=5)
+  plt.savefig('real_im_emb_plot.jpg')
+  text = text_from_latent_code(in_embs)
+  in_embs = latent_code_from_text(text)[0]
+  print(text)
+  return text, in_embs.to('cpu')
+#######################
+# TODO add to state instead of shared across all
+glob_idx = 0
+def next_one(embs, ys, calibrate_prompts):
+    global glob_idx
+    glob_idx = glob_idx + 1
+    with torch.no_grad():
+        if len(calibrate_prompts) > 0:
+            print('######### Calibrating with sample prompts #########')
+            prompt = calibrate_prompts.pop(0)
+            text, img_embs = generate(prompt)
+            embs += img_embs
+            print(len(embs))
+            return text, embs, ys, calibrate_prompts
+        else:
+            print('######### Roaming #########')
+            # handle case where every instance of calibration prompts is 'Neither' or 'Like' or 'Dislike'
+            if len(list(set(ys))) <= 1:
+                embs.append(.01*torch.randn(latent_size))
+                embs.append(.01*torch.randn(latent_size))
+                ys.append(0)
+                ys.append(1)
+            if len(list(ys)) < 10:
+                embs += [.01*torch.randn(latent_size)] * 3
+                ys += [0] * 3
+            pos_indices = [i for i in range(len(embs)) if ys[i] == 1]
+            neg_indices = [i for i in range(len(embs)) if ys[i] == 0]
+            # the embs & ys stay tied by index but we shuffle to drop randomly
+            random.shuffle(pos_indices)
+            random.shuffle(neg_indices)
+            #if len(pos_indices) - len(neg_indices) > 48 and len(pos_indices) > 80:
+            #    pos_indices = pos_indices[32:]
+            if len(neg_indices) - len(pos_indices) > 48/16 and len(pos_indices) > 6:
+                pos_indices = pos_indices[5:]
+            if len(neg_indices) - len(pos_indices) > 48/16 and len(neg_indices) > 6:
+                neg_indices = neg_indices[5:]
+            if len(neg_indices) > 25:
+                neg_indices = neg_indices[1:]
+            print(len(pos_indices), len(neg_indices))
+            indices = pos_indices + neg_indices
+            embs = [embs[i] for i in indices]
+            ys = [ys[i] for i in indices]
+            indices = list(range(len(embs)))
+            # also add the latest 0 and the latest 1
+            has_0 = False
+            has_1 = False
+            for i in reversed(range(len(ys))):
+                if ys[i] == 0 and has_0 == False:
+                    indices.append(i)
+                    has_0 = True
+                elif ys[i] == 1 and has_1 == False:
+                    indices.append(i)
+                    has_1 = True
+                if has_0 and has_1:
+                    break
+            # we may have just encountered a rare multi-threading diffusers issue (https://github.com/huggingface/diffusers/issues/5749);
+            # this ends up adding a rating but losing an embedding, it seems.
+            # let's take off a rating if so to continue without indexing errors.
+            if len(ys) > len(embs):
+                print('ys are longer than embs; popping latest rating')
+                ys.pop(-1)
+            feature_embs = np.array(torch.stack([embs[i].to('cpu') for i in indices]).to('cpu'))
+            scaler = preprocessing.StandardScaler().fit(feature_embs)
+            feature_embs = scaler.transform(feature_embs)
+            chosen_y = np.array([ys[i] for i in indices])
+            print('Gathering coefficients')
+            lin_class = SVC(max_iter=50000, kernel='linear', class_weight='balanced', C=.1).fit(feature_embs, chosen_y)
+            coef_ = torch.tensor(lin_class.coef_, dtype=torch.double)
+            print(coef_.shape, 'COEF')
+            print('Gathered')
+            rng_prompt = random.choice(prompt_list)
+            w = 1# if len(embs) % 2 == 0 else 0
+            im_emb = w * coef_.to(dtype=dtype)
+            prompt= '' if glob_idx % 3 != 0 else rng_prompt
+            text, im_emb = generate(prompt, im_emb)
+            embs += im_emb
+            return text, embs, ys, calibrate_prompts
+def start(_, embs, ys, calibrate_prompts):
+    text, embs, ys, calibrate_prompts = next_one(embs, ys, calibrate_prompts)
+    return [
+            gr.Button(value='Like (L)', interactive=True),
+            gr.Button(value='Neither (Space)', interactive=True),
+            gr.Button(value='Dislike (A)', interactive=True),
+            gr.Button(value='Start', interactive=False),
+            text,
+            embs,
+            ys,
+            calibrate_prompts
+            ]
+def choose(text, choice, embs, ys, calibrate_prompts):
+    if choice == 'Like (L)':
+        choice = 1
+    elif choice == 'Neither (Space)':
+        embs = embs[:-1]
+        text, embs, ys, calibrate_prompts = next_one(embs, ys, calibrate_prompts)
+        return text, embs, ys, calibrate_prompts
+    else:
+        choice = 0
+    # if we detected NSFW, leave that area of latent space regardless of how they rated chosen.
+    # TODO skip allowing rating
+    if text == None:
+        print('NSFW -- choice is disliked')
+        choice = 0
+    ys += [choice]*1
+    text, embs, ys, calibrate_prompts = next_one(embs, ys, calibrate_prompts)
+    return text, embs, ys, calibrate_prompts
+css = '''.gradio-container{max-width: 700px !important}
+#description{text-align: center}
+#description h1, #description h3{display: block}
+#description p{margin-top: 0}
+.fade-in-out {animation: fadeInOut 3s forwards}
+@keyframes fadeInOut {
+    0% {
+      background: var(--bg-color);
+    }
+    100% {
+      background: var(--button-secondary-background-fill);
+    }
+}
+'''
+js_head = '''
+<script>
+document.addEventListener('keydown', function(event) {
+    if (event.key === 'a' || event.key === 'A') {
+        // Trigger click on 'dislike' if 'A' is pressed
+        document.getElementById('dislike').click();
+    } else if (event.key === ' ' || event.keyCode === 32) {
+        // Trigger click on 'neither' if Spacebar is pressed
+        document.getElementById('neither').click();
+    } else if (event.key === 'l' || event.key === 'L') {
+        // Trigger click on 'like' if 'L' is pressed
+        document.getElementById('like').click();
+    }
+});
+function fadeInOut(button, color) {
+  button.style.setProperty('--bg-color', color);
+  button.classList.remove('fade-in-out');
+  void button.offsetWidth; // This line forces a repaint by accessing a DOM property
+  button.classList.add('fade-in-out');
+  button.addEventListener('animationend', () => {
+    button.classList.remove('fade-in-out'); // Reset the animation state
+  }, {once: true});
+}
+document.body.addEventListener('click', function(event) {
+    const target = event.target;
+    if (target.id === 'dislike') {
+      fadeInOut(target, '#ff1717');
+    } else if (target.id === 'like') {
+      fadeInOut(target, '#006500');
+    } else if (target.id === 'neither') {
+      fadeInOut(target, '#cccccc');
+    }
+});
+</script>
+'''
+with gr.Blocks(css=css, head=js_head) as demo:
+    gr.Markdown('''# Compass
+### Generative Recommenders for Exporation of Text
+Explore the latent space without prompting based on your preferences. Learn more on [the write-up](https://rynmurdock.github.io/posts/2024/3/generative_recomenders/).
+    ''', elem_id="description")
+    embs = gr.State([])
+    ys = gr.State([])
+    calibrate_prompts = gr.State([
+    'the moon is melting into my glass of tea',
+    'a sea slug -- pair of claws scuttling -- jelly fish glowing',
+    'an adorable creature. It may be a goblin or a pig or a slug.',
+    'an animation about a gorgeous nebula',
+    'a sketch of an impressive mountain by da vinci',
+    'a watercolor painting: the octopus writhes',
+    ])
+    def l():
+        return None
+    with gr.Row(elem_id='output-image'):
+        text = gr.Textbox(interactive=False, elem_id="text")
+    with gr.Row(equal_height=True):
+        b3 = gr.Button(value='Dislike (A)', interactive=False, elem_id="dislike")
+        b2 = gr.Button(value='Neither (Space)', interactive=False, elem_id="neither")
+        b1 = gr.Button(value='Like (L)', interactive=False, elem_id="like")
+        b1.click(
+        choose,
+        [text, b1, embs, ys, calibrate_prompts],
+        [text, embs, ys, calibrate_prompts]
+        )
+        b2.click(
+        choose,
+        [text, b2, embs, ys, calibrate_prompts],
+        [text, embs, ys, calibrate_prompts]
+        )
+        b3.click(
+        choose,
+        [text, b3, embs, ys, calibrate_prompts],
+        [text, embs, ys, calibrate_prompts]
+        )
+    with gr.Row():
+        b4 = gr.Button(value='Start')
+        b4.click(start,
+                 [b4, embs, ys, calibrate_prompts],
+                 [b1, b2, b3, b4, text, embs, ys, calibrate_prompts])
+    with gr.Row():
+        html = gr.HTML('''<div style='text-align:center; font-size:20px'>You will calibrate for several prompts and then roam. </ div><br><br><br>
+<div style='text-align:center; font-size:14px'>Note that while the model is unlikely to produce NSFW text, this may still occur, and users should avoid NSFW content when rating.
+</ div>
+<br><br>
+<div style='text-align:center; font-size:14px'>Thanks to @multimodalart for their contributions to the demo, esp. the interface and @maxbittker for feedback.
+</ div>''')
+demo.launch(share=True)

Optimus/code/examples/big_ae/metrics.py ADDED Viewed

	@@ -0,0 +1,196 @@

+import os
+from multiprocessing import Pool
+import pdb
+import numpy as np
+import nltk
+nltk.download('punkt')
+from nltk.translate.bleu_score import SmoothingFunction
+try:
+    from multiprocessing import cpu_count
+except:
+    from os import cpu_count
+class Metrics(object):
+    def __init__(self):
+        self.name = 'Metric'
+    def get_name(self):
+        return self.name
+    def set_name(self, name):
+        self.name = name
+    def get_score(self):
+        pass
+class Bleu(Metrics):
+    def __init__(self, test_text='', real_text='', gram=3, num_real_sentences=500, num_fake_sentences=10000):
+        super(Bleu, self).__init__()
+        self.name = 'Bleu'
+        self.test_data = test_text
+        self.real_data = real_text
+        self.gram = gram
+        self.sample_size = num_real_sentences
+        self.reference = None
+        self.is_first = True
+        self.num_sentences = num_fake_sentences
+    def get_name(self):
+        return self.name
+    def get_score(self, is_fast=True, ignore=False):
+        if ignore:
+            return 0
+        if self.is_first:
+            self.get_reference()
+            self.is_first = False
+        if is_fast:
+            return self.get_bleu_fast()
+        return self.get_bleu_parallel()
+    # fetch REAL DATA
+    def get_reference(self):
+        if self.reference is None:
+            reference = list()
+            with open(self.real_data) as real_data:
+                for text in real_data:
+                    text = nltk.word_tokenize(text)
+                    reference.append(text)
+            self.reference = reference
+            return reference
+        else:
+            return self.reference
+    def get_bleu(self):
+        raise Exception('make sure you call BLEU paralell')
+        ngram = self.gram
+        bleu = list()
+        reference = self.get_reference()
+        weight = tuple((1. / ngram for _ in range(ngram)))
+        with open(self.test_data) as test_data:
+            for hypothesis in test_data:
+                hypothesis = nltk.word_tokenize(hypothesis)
+                bleu.append(nltk.translate.bleu_score.sentence_bleu(reference, hypothesis, weight,
+                                                                    smoothing_function=SmoothingFunction().method1))
+        return sum(bleu) / len(bleu)
+    def calc_bleu(self, reference, hypothesis, weight):
+        return nltk.translate.bleu_score.sentence_bleu(reference, hypothesis, weight,
+                                                       smoothing_function=SmoothingFunction().method1)
+    def get_bleu_fast(self):
+        reference = self.get_reference()
+        reference = reference[0:self.sample_size]
+        return self.get_bleu_parallel(reference=reference)
+    def get_bleu_parallel(self, reference=None):
+        ngram = self.gram
+        if reference is None:
+            reference = self.get_reference()
+        weight = tuple((1. / ngram for _ in range(ngram)))
+        pool = Pool(cpu_count())
+        result = list()
+        maxx = self.num_sentences
+        with open(self.test_data) as test_data:
+            for i, hypothesis in enumerate(test_data):
+                #print('i : {}'.format(i))
+                hypothesis = nltk.word_tokenize(hypothesis)
+                result.append(pool.apply_async(self.calc_bleu, args=(reference, hypothesis, weight)))
+                if i > maxx : break
+        score = 0.0
+        cnt = 0
+        for it, i in enumerate(result):
+            #print('i : {}'.format(it))
+            score += i.get()
+            cnt += 1
+        pool.close()
+        pool.join()
+        return score / cnt
+class SelfBleu(Metrics):
+    def __init__(self, test_text='', gram=3, model_path='', num_sentences=500):
+        super(SelfBleu, self).__init__()
+        self.name = 'Self-Bleu'
+        self.test_data = test_text
+        self.gram = gram
+        self.sample_size = num_sentences
+        self.reference = None
+        self.is_first = True
+    def get_name(self):
+        return self.name
+    def get_score(self, is_fast=True, ignore=False):
+        if ignore:
+            return 0
+        if self.is_first:
+            self.get_reference()
+            self.is_first = False
+        if is_fast:
+            return self.get_bleu_fast()
+        return self.get_bleu_parallel()
+    def get_reference(self):
+        if self.reference is None:
+            reference = list()
+            with open(self.test_data) as real_data:
+                for text in real_data:
+                    text = nltk.word_tokenize(text)
+                    reference.append(text)
+            self.reference = reference
+            return reference
+        else:
+            return self.reference
+    def get_bleu(self):
+        ngram = self.gram
+        bleu = list()
+        reference = self.get_reference()
+        weight = tuple((1. / ngram for _ in range(ngram)))
+        with open(self.test_data) as test_data:
+            for hypothesis in test_data:
+                hypothesis = nltk.word_tokenize(hypothesis)
+                bleu.append(nltk.translate.bleu_score.sentence_bleu(reference, hypothesis, weight,
+                                                                    smoothing_function=SmoothingFunction().method1))
+        return sum(bleu) / len(bleu)
+    def calc_bleu(self, reference, hypothesis, weight):
+        return nltk.translate.bleu_score.sentence_bleu(reference, hypothesis, weight,
+                                                       smoothing_function=SmoothingFunction().method1)
+    def get_bleu_fast(self):
+        reference = self.get_reference()
+        # random.shuffle(reference)
+        reference = reference[0:self.sample_size]
+        return self.get_bleu_parallel(reference=reference)
+    def get_bleu_parallel(self, reference=None):
+        ngram = self.gram
+        if reference is None:
+            reference = self.get_reference()
+        weight = tuple((1. / ngram for _ in range(ngram)))
+        pool = Pool(cpu_count())
+        result = list()
+        sentence_num = len(reference)
+        for index in range(sentence_num):
+            #genious:
+            hypothesis = reference[index]
+            other = reference[:index] + reference[index+1:]
+            result.append(pool.apply_async(self.calc_bleu, args=(other, hypothesis, weight)))
+        score = 0.0
+        cnt = 0
+        for i in result:
+            score += i.get()
+            cnt += 1
+        pool.close()
+        pool.join()
+        return score / cnt

Optimus/code/examples/big_ae/modules/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+from .encoders import *
+from .decoders import *
+from .vae import *
+from .utils import *
+from .spacefusion import *
+from .cara import *
+from .arae import *

Optimus/code/examples/big_ae/modules/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (327 Bytes). View file

Optimus/code/examples/big_ae/modules/__pycache__/__init__.cpython-37.pyc ADDED Viewed

Binary file (270 Bytes). View file

Optimus/code/examples/big_ae/modules/__pycache__/arae.cpython-310.pyc ADDED Viewed

Binary file (6.64 kB). View file

Optimus/code/examples/big_ae/modules/__pycache__/arae.cpython-37.pyc ADDED Viewed

Binary file (6.44 kB). View file

Optimus/code/examples/big_ae/modules/__pycache__/cara.cpython-310.pyc ADDED Viewed

Binary file (8.63 kB). View file

Optimus/code/examples/big_ae/modules/__pycache__/cara.cpython-37.pyc ADDED Viewed

Binary file (8.41 kB). View file

Optimus/code/examples/big_ae/modules/__pycache__/spacefusion.cpython-310.pyc ADDED Viewed

Binary file (4.44 kB). View file

Optimus/code/examples/big_ae/modules/__pycache__/spacefusion.cpython-37.pyc ADDED Viewed

Binary file (4.37 kB). View file

Optimus/code/examples/big_ae/modules/__pycache__/utils.cpython-310.pyc ADDED Viewed

Binary file (1.34 kB). View file

Optimus/code/examples/big_ae/modules/__pycache__/utils.cpython-37.pyc ADDED Viewed

Binary file (1.28 kB). View file

Optimus/code/examples/big_ae/modules/__pycache__/vae.cpython-310.pyc ADDED Viewed

Binary file (14.8 kB). View file

Optimus/code/examples/big_ae/modules/__pycache__/vae.cpython-37.pyc ADDED Viewed

Binary file (15 kB). View file

Optimus/code/examples/big_ae/modules/arae.py ADDED Viewed

	@@ -0,0 +1,274 @@

+import math
+import torch
+import torch.nn as nn
+from .utils import log_sum_exp
+import pdb
+import sys
+sys.path.append('../../')
+from pytorch_transformers.modeling_bert import BertEmbeddings
+import torch.nn.functional as F
+class ARAE(nn.Module):
+    def __init__(self, encoder, decoder, tokenizer_encoder, tokenizer_decoder, args): #
+        super(ARAE, self).__init__()
+        self.encoder = encoder
+        self.decoder = decoder
+        self.tokenizer_encoder = tokenizer_encoder
+        self.tokenizer_decoder = tokenizer_decoder
+        self.args = args
+        self.nz = args.latent_size
+        self.bos_token_id_list = self.tokenizer_decoder.encode(self.tokenizer_decoder.bos_token)
+        self.pad_token_id = self.tokenizer_decoder.encode(self.tokenizer_decoder.pad_token)[0]
+        # connector: from Bert hidden units to the latent space
+        self.linear = nn.Linear(encoder.config.hidden_size, self.nz, bias=False)
+        # # Standard Normal prior
+        # loc = torch.zeros(self.nz, device=args.device)
+        # scale = torch.ones(self.nz, device=args.device)
+        # self.prior = torch.distributions.normal.Normal(loc, scale)
+        self.label_embedding = nn.Embedding(args.label_size, self.nz, padding_idx=0)    # use the same size as latent_z so as to use the same decoder.linear()
+        self.latent_generator = nn.Linear(self.nz, self.nz)
+        self.latent_classifier = nn.Linear(self.nz, args.label_size if args.label_size > 2 else 1)
+        self.latent_discriminator = nn.Linear(self.nz, 1)
+        self.gpt_embeddings = nn.Embedding(self.decoder.config.vocab_size, self.decoder.config.n_embd)
+        self.gpt_embeddings.weight.data = decoder.transformer.wte.weight.data
+        self.conv1 = nn.Conv1d(self.encoder.config.hidden_size, self.encoder.config.hidden_size, 3)
+        self.classifier = nn.Linear(self.encoder.config.hidden_size, 1 if args.label_size <= 2 else args.label_size)
+        self.CrossEntropyLoss = torch.nn.CrossEntropyLoss()
+        self.BCEWithLogitsLoss = torch.nn.BCEWithLogitsLoss()
+    def forward(self, input_seq_ids, tgt_seq_ids, cond_labels, attention_mask=None):
+        # inputs: (B, seq_len)
+        # labels: (B, seq_len)
+        # cond_labels: (B), conditional labels.
+        ones_label = torch.ones_like(cond_labels).to(dtype=torch.float32)
+        zeros_label = torch.zeros_like(cond_labels).to(dtype=torch.float32)
+        random_noise = torch.nn.init.normal_(torch.empty(input_seq_ids.size(0), self.nz)).to(device=input_seq_ids.device, dtype=torch.float32)
+        # Encode inputs
+        outputs = self.encoder(input_seq_ids, attention_mask=attention_mask)
+        pooled_hidden_fea = outputs[1]  # (B, dim_h)
+        # Encode z
+        latent_z = self.linear(pooled_hidden_fea)    # (B, nz)
+        # Generate z
+        gen_z = self.latent_generator(random_noise)  # (B, nz)
+        # Latent discriminator
+        prob_encode_z_dis = self.latent_discriminator(latent_z).squeeze(1).float()  # (B)
+        prob_gen_z_dis = self.latent_discriminator(gen_z).squeeze(1).float()  # (B)
+        # Train latent discriminator
+        loss_lsd = self.BCEWithLogitsLoss(prob_gen_z_dis, zeros_label) + self.BCEWithLogitsLoss(prob_encode_z_dis, ones_label)
+        acc_encode_z_dis = ((prob_encode_z_dis >= 0).float() == ones_label).float()
+        acc_gen_z_dis = ((prob_gen_z_dis >= 0).float() == zeros_label).float()
+        # Train sampler adversarially
+        loss_lsg = self.BCEWithLogitsLoss(prob_gen_z_dis, ones_label)
+        # Latent classifier
+        prob_encode_z_cls = self.latent_classifier(latent_z)  # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_encode_z_cls = prob_encode_z_cls.squeeze(1)  # (B)
+            # Train latent classifier
+            loss_lsc = self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+            acc_encode_z_cls = ((prob_encode_z_cls >= 0).float() == cond_labels.float()).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+        else:
+            # Train latent classifier
+            loss_lsc = self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+            acc_encode_z_cls = (torch.argmax(prob_encode_z_cls, dim=-1) == cond_labels).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+        # Embed labels
+        label_emb = self.label_embedding(cond_labels)  # (B, hidden_size)
+        past_label = self.decoder.linear(label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        if self.args.label_size <= 2:
+            sampled_cond_labels = 1 - cond_labels
+        else:
+            raise NotImplementedError    # todo: currently only implemented for binary labels. need to change for multi-class labels.
+        sampled_label_emb = self.label_embedding(sampled_cond_labels)  # (B, hidden_size)
+        past_sampled_label = self.decoder.linear(sampled_label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        # Generate based on encoded z and gt labels. (reconstruction)
+        past_z = self.decoder.linear(latent_z)    # (B, n_blocks * hidden_size)
+        gen_past_z = self.decoder.linear(gen_z)    # (B, n_blocks * hidden_size)
+        past = torch.cat([past_z.unsqueeze(1), past_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+        outputs = self.decoder(input_ids=tgt_seq_ids, past=past, labels=tgt_seq_ids, label_ignore=self.pad_token_id)
+        loss_rec = outputs[0]
+        # Train a classifier in the observation space
+        tgt_emb = self.gpt_embeddings(tgt_seq_ids)
+        tgt_encode = self.conv1(tgt_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        tgt_encode = torch.mean(tgt_encode, dim=-1) # (B, dim_h)
+        prob_cls = self.classifier(tgt_encode)   # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_cls = prob_cls.squeeze(1)
+            loss_cls = self.BCEWithLogitsLoss(prob_cls, cond_labels.float())
+            pred_cls = (prob_cls >= 0).to(dtype=torch.long)
+        else:
+            loss_cls = self.CrossEntropyLoss(prob_cls, cond_labels)
+            pred_cls = torch.argmax(prob_cls, dim=-1)
+        acc_cls = (pred_cls == cond_labels).float()
+        # Loss
+        loss = loss_rec + loss_encoder + loss_lsc + loss_lsd + loss_lsg + loss_cls
+        if not self.training:
+            # Generate based on encoded z and gt labels
+            generated = self.sample_sequence_conditional_batch(past=past, context=self.bos_token_id_list)
+            # Generate based on encoded z and sampled labels (attribute transfer)
+            at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            at_generated = self.sample_sequence_conditional_batch(past=at_past, context=self.bos_token_id_list) # (B, seq_len)
+            # Generate based on sampled z and sampled labels. (conditional generation)
+            cg_past = torch.cat([gen_past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            cg_generated = self.sample_sequence_conditional_batch(past=cg_past, context=self.bos_token_id_list) # (B, seq_len)
+            # classifier on gt generated sentences.
+            ge_emb = self.gpt_embeddings(generated)
+            ge_encode = self.conv1(ge_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            ge_encode = torch.mean(ge_encode, dim=-1)   # (B, dim_h)
+            prob_ge_cls = self.classifier(ge_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_ge_cls = (prob_ge_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_ge_cls = torch.argmax(prob_ge_cls, dim=-1)
+            acc_ge_cls = (pred_ge_cls == cond_labels).float()
+            # classifier on attribute transfer generated sentences.
+            at_emb = self.gpt_embeddings(at_generated)
+            at_encode = self.conv1(at_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            at_encode = torch.mean(at_encode, dim=-1)   # (B, dim_h)
+            prob_at_cls = self.classifier(at_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_at_cls = (prob_at_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_at_cls = torch.argmax(prob_at_cls, dim=-1)
+            acc_at_cls = (pred_at_cls == sampled_cond_labels).float()
+            # classifier on conditional generated sentences.
+            cg_emb = self.gpt_embeddings(cg_generated)
+            cg_encode = self.conv1(cg_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            cg_encode = torch.mean(cg_encode, dim=-1)   # (B, dim_h)
+            prob_cg_cls = self.classifier(cg_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_cg_cls = (prob_cg_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_cg_cls = torch.argmax(prob_cg_cls, dim=-1)
+            acc_cg_cls = (pred_cg_cls == sampled_cond_labels).float()
+            result = {
+                    'sampled_cond_labels': sampled_cond_labels,
+                    'cond_labels': cond_labels,
+                    'tgt_seq_ids': tgt_seq_ids,
+                    'generated': generated,
+                    'at_generated': at_generated,
+                    'cg_generated': cg_generated,
+                    'acc_encode_z_dis': acc_encode_z_dis,
+                    'acc_gen_z_dis': acc_gen_z_dis,
+                    'acc_encode_z_cls': acc_encode_z_cls,
+                    'acc_cls': acc_cls,
+                    'acc_ge_cls': acc_ge_cls,
+                    'acc_at_cls': acc_at_cls,
+                    'acc_cg_cls': acc_cg_cls,
+                    'pred_cls': pred_cls,
+                    'pred_ge_cls': pred_ge_cls,
+                    'pred_at_cls': pred_at_cls,
+                    'pred_cg_cls': pred_cg_cls,
+                    }
+            return result
+        loss_dict = {
+                'loss': loss,
+                'loss_rec': loss_rec,
+                'loss_encoder': loss_encoder,
+                'loss_lsc': loss_lsc,
+                'loss_lsd': loss_lsd,
+                'loss_lsg': loss_lsg,
+                'loss_cls': loss_cls,
+        }
+        acc_dict = {
+                'acc_encode_z_dis': acc_encode_z_dis,
+                'acc_gen_z_dis': acc_gen_z_dis,
+                'acc_encode_z_cls': acc_encode_z_cls,
+                'acc_cls': acc_cls,
+        }
+        return loss_dict, acc_dict
+    def sample_sequence_conditional_batch(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device)
+        context = context.unsqueeze(0).repeat(num_samples, 1)
+        generated = context # (B, 1)
+        # with torch.no_grad():
+        while generated.size(-1) < self.args.block_size:
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]
+            next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, vocab_size)
+            filtered_logits = F.softmax(filtered_logits, dim=-1)
+            next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+        return generated    # (B, seq_len)
+    def top_k_top_p_filtering_batch(self, logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+        """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+            Args:
+                logits: logits distribution shape (vocabulary size)
+                top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+                top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                    Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+            From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+        """
+        # assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+        top_k = min(top_k, logits.size(-1))  # Safety check
+        if top_k > 0:
+            # Remove all tokens with a probability less than the last token of the top-k
+            threshold = torch.topk(logits, top_k, dim=-1)[0][:, -1, None]
+            logits.masked_fill_(logits < threshold, filter_value)   #  (B, vocab_size)
+        if top_p > 0.0:
+            sorted_logits, sorted_indices = torch.sort(logits, descending=True)         # (B, vocab_size)
+            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)   # (B, vocab_size)
+            # Remove tokens with cumulative probability above the threshold
+            sorted_indices_to_remove = cumulative_probs > top_p
+            # Shift the indices to the right to keep also the first token above the threshold
+            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+            sorted_indices_to_remove[..., 0] = 0
+            indices_to_remove = sorted_indices[sorted_indices_to_remove]
+            logits.masked_fill_(indices_to_remove, filter_value)
+        return logits

Optimus/code/examples/big_ae/modules/cara.py ADDED Viewed

	@@ -0,0 +1,374 @@

+import math
+import torch
+import torch.nn as nn
+from .utils import log_sum_exp
+import pdb
+import sys
+sys.path.append('../../')
+from pytorch_transformers.modeling_bert import BertEmbeddings
+import torch.nn.functional as F
+class CARA(nn.Module):
+    def __init__(self, encoder, decoder, tokenizer_encoder, tokenizer_decoder, args): #
+        super(CARA, self).__init__()
+        self.encoder = encoder
+        self.decoder = decoder
+        self.tokenizer_encoder = tokenizer_encoder
+        self.tokenizer_decoder = tokenizer_decoder
+        self.args = args
+        self.nz = args.latent_size
+        self.bos_token_id_list = self.tokenizer_decoder.encode(self.tokenizer_decoder.bos_token)
+        self.pad_token_id = self.tokenizer_decoder.encode(self.tokenizer_decoder.pad_token)[0]
+        # connector: from Bert hidden units to the latent space
+        self.linear = nn.Linear(encoder.config.hidden_size, self.nz, bias=False)
+        # # Standard Normal prior
+        # loc = torch.zeros(self.nz, device=args.device)
+        # scale = torch.ones(self.nz, device=args.device)
+        # self.prior = torch.distributions.normal.Normal(loc, scale)
+        self.label_embedding = nn.Embedding(args.label_size, self.nz, padding_idx=0)    # use the same size as latent_z so as to use the same decoder.linear()
+        self.latent_generator = nn.Linear(self.nz, self.nz)
+        self.latent_classifier = nn.Linear(self.nz, args.label_size if args.label_size > 2 else 1)
+        self.latent_discriminator = nn.Linear(self.nz, 1)
+        self.gpt_embeddings = nn.Embedding(self.decoder.config.vocab_size, self.decoder.config.n_embd)
+        self.gpt_embeddings.weight.data = decoder.transformer.wte.weight.data
+        self.conv1 = nn.Conv1d(self.encoder.config.hidden_size, self.encoder.config.hidden_size, 3)
+        self.classifier = nn.Linear(self.encoder.config.hidden_size, 1 if args.label_size <= 2 else args.label_size)
+        self.CrossEntropyLoss = torch.nn.CrossEntropyLoss()
+        self.BCEWithLogitsLoss = torch.nn.BCEWithLogitsLoss()
+    def forward(self, input_seq_ids, tgt_seq_ids, cond_labels, attention_mask):
+        # inputs: (B, seq_len)
+        # labels: (B, seq_len)
+        # cond_labels: (B), conditional labels.
+        ones_label = torch.ones_like(cond_labels).to(dtype=torch.float32)
+        zeros_label = torch.zeros_like(cond_labels).to(dtype=torch.float32)
+        random_noise = torch.nn.init.normal_(torch.empty(input_seq_ids.size(0), self.nz)).to(device=input_seq_ids.device, dtype=torch.float32)
+        # Encode inputs
+        outputs = self.encoder(input_seq_ids, attention_mask=attention_mask)
+        pooled_hidden_fea = outputs[1]  # (B, dim_h)
+        # Encode z
+        latent_z = self.linear(pooled_hidden_fea)    # (B, nz)
+        # Generate z
+        gen_z = self.latent_generator(random_noise)  # (B, nz)
+        #################### Latent discriminator for sampling from a simple distribution ####################
+        prob_encode_z_dis = self.latent_discriminator(latent_z).squeeze(1).float()  # (B)
+        prob_gen_z_dis = self.latent_discriminator(gen_z).squeeze(1).float()  # (B)
+        # Train latent discriminator
+        loss_lsd = self.BCEWithLogitsLoss(prob_gen_z_dis, zeros_label) + self.BCEWithLogitsLoss(prob_encode_z_dis, ones_label)
+        acc_encode_z_dis = ((prob_encode_z_dis >= 0).float() == ones_label).float()
+        acc_gen_z_dis = ((prob_gen_z_dis >= 0).float() == zeros_label).float()
+        # Train sampler adversarially
+        loss_lsg = self.BCEWithLogitsLoss(prob_gen_z_dis, ones_label)
+        ####################  Latent classifier for disentanglement ####################
+        prob_encode_z_cls = self.latent_classifier(latent_z)  # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_encode_z_cls = prob_encode_z_cls.squeeze(1)  # (B)
+            # Train latent classifier
+            loss_lsc = self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+            acc_encode_z_cls = ((prob_encode_z_cls >= 0).float() == cond_labels.float()).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+        else:
+            # Train latent classifier
+            loss_lsc = self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+            acc_encode_z_cls = (torch.argmax(prob_encode_z_cls, dim=-1) == cond_labels).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+        #################### Recontruction loss with latent z and label emb ####################
+        # Embed labels
+        label_emb = self.label_embedding(cond_labels)  # (B, hidden_size)
+        # past_label = self.decoder.linear(label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        if self.args.label_size <= 2:
+            sampled_cond_labels = 1 - cond_labels
+        else:
+            raise NotImplementedError    # todo: currently only implemented for binary labels. need to change for multi-class labels.
+        sampled_label_emb = self.label_embedding(sampled_cond_labels)  # (B, hidden_size)
+        # past_sampled_label = self.decoder.linear(sampled_label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        past_sampled_label = sampled_label_emb
+        # Generate based on encoded z and gt labels. (reconstruction)
+        # past_z = self.decoder.linear(latent_z)    # (B, n_blocks * hidden_size)
+        past_z = latent_z
+        # gen_past_z = self.decoder.linear(gen_z)    # (B, n_blocks * hidden_size)
+        gen_past_z = gen_z    # (B, n_blocks * hidden_size)
+        # past = torch.cat([past_z.unsqueeze(1), past_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+        past = latent_z + label_emb # (B, n_blocks * hidden_size)
+        outputs = self.decoder(input_ids=tgt_seq_ids, past=past, labels=tgt_seq_ids, label_ignore=self.pad_token_id)
+        loss_rec = outputs[0]
+        ####################  Train a classifier in the observation space ####################
+        tgt_emb = self.gpt_embeddings(tgt_seq_ids)
+        tgt_encode = self.conv1(tgt_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        tgt_encode = torch.mean(tgt_encode, dim=-1) # (B, dim_h)
+        prob_cls = self.classifier(tgt_encode)   # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_cls = prob_cls.squeeze(1)
+            loss_cls = self.BCEWithLogitsLoss(prob_cls, cond_labels.float())
+            pred_cls = (prob_cls >= 0).to(dtype=torch.long)
+        else:
+            loss_cls = self.CrossEntropyLoss(prob_cls, cond_labels)
+            pred_cls = torch.argmax(prob_cls, dim=-1)
+        acc_cls = (pred_cls == cond_labels).float()
+        # Generate based on encoded z and sampled labels (attribute transfer)
+        # at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+        # at_generated_soft = self.sample_sequence_conditional_batch_soft(past=at_past, context=self.bos_token_id_list) # (B, seq_len, vocab_size)
+        # # Classifier on attribute transfer generated sentences. Train Generator on attribute transfer.
+        # at_soft_emb = torch.matmul(at_generated_soft, self.gpt_embeddings.weight)
+        # at_soft_encode = self.conv1(at_soft_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        # at_soft_encode = torch.mean(at_soft_encode, dim=-1)   # (B, dim_h)
+        # prob_at_soft_cls = self.classifier(at_soft_encode)    # (B, 1)
+        # if self.args.label_size <= 2:
+        #     prob_at_soft_cls = prob_at_soft_cls.squeeze(1)
+        #     loss_at_soft_cls = self.BCEWithLogitsLoss(prob_at_soft_cls, sampled_cond_labels.float())
+        #     pred_at_soft_cls = (prob_at_soft_cls >= 0).to(torch.long)
+        # else:
+        #     loss_at_soft_cls = self.CrossEntropyLoss(prob_at_soft_cls, sampled_cond_labels)
+        #     pred_at_soft_cls = torch.argmax(prob_at_soft_cls, dim=-1)
+        # acc_at_soft_cls = (pred_at_soft_cls == sampled_cond_labels).float()
+        # Loss
+        loss_latent_space = (loss_encoder + loss_lsc) + (loss_lsd + loss_lsg) + self.args.beta_cls * loss_cls # + loss_at_soft_cls
+        loss = loss_rec + 0.0 * loss_latent_space
+        if not self.training:
+            # Generate based on encoded z and gt labels
+            generated = self.sample_sequence_conditional_batch(past=past, context=self.bos_token_id_list)
+            # Generate based on encoded z and sampled labels (attribute transfer)
+            # at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            at_past = past_z + past_sampled_label # (B, n_blocks * hidden_size)
+            at_generated = self.sample_sequence_conditional_batch(past=at_past, context=self.bos_token_id_list) # (B, seq_len)
+            # Generate based on sampled z and sampled labels. (conditional generation)
+            # cg_past = torch.cat([gen_past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            cg_past = gen_past_z +  past_sampled_label # (B, n_blocks * hidden_size)
+            cg_generated = self.sample_sequence_conditional_batch(past=cg_past, context=self.bos_token_id_list) # (B, seq_len)
+            # classifier on gt generated sentences.
+            ge_emb = self.gpt_embeddings(generated)
+            ge_encode = self.conv1(ge_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            ge_encode = torch.mean(ge_encode, dim=-1)   # (B, dim_h)
+            prob_ge_cls = self.classifier(ge_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_ge_cls = (prob_ge_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_ge_cls = torch.argmax(prob_ge_cls, dim=-1)
+            acc_ge_cls = (pred_ge_cls == cond_labels).float()
+            # classifier on attribute transfer generated sentences.
+            at_emb = self.gpt_embeddings(at_generated)
+            at_encode = self.conv1(at_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            at_encode = torch.mean(at_encode, dim=-1)   # (B, dim_h)
+            prob_at_cls = self.classifier(at_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_at_cls = (prob_at_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_at_cls = torch.argmax(prob_at_cls, dim=-1)
+            acc_at_cls = (pred_at_cls == sampled_cond_labels).float()
+            # classifier on conditional generated sentences.
+            cg_emb = self.gpt_embeddings(cg_generated)
+            cg_encode = self.conv1(cg_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            cg_encode = torch.mean(cg_encode, dim=-1)   # (B, dim_h)
+            prob_cg_cls = self.classifier(cg_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_cg_cls = (prob_cg_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_cg_cls = torch.argmax(prob_cg_cls, dim=-1)
+            acc_cg_cls = (pred_cg_cls == sampled_cond_labels).float()
+            result = {
+                    'sampled_cond_labels': sampled_cond_labels,
+                    'cond_labels': cond_labels,
+                    'tgt_seq_ids': tgt_seq_ids,
+                    'generated': generated,
+                    'at_generated': at_generated,
+                    'cg_generated': cg_generated,
+                    'acc_encode_z_dis': acc_encode_z_dis,
+                    'acc_gen_z_dis': acc_gen_z_dis,
+                    'acc_encode_z_cls': acc_encode_z_cls,
+                    'acc_cls': acc_cls,
+                    'acc_ge_cls': acc_ge_cls,
+                    'acc_at_cls': acc_at_cls,
+                    'acc_cg_cls': acc_cg_cls,
+                    'pred_cls': pred_cls,
+                    'pred_ge_cls': pred_ge_cls,
+                    'pred_at_cls': pred_at_cls,
+                    'pred_cg_cls': pred_cg_cls,
+                    }
+            return result
+        loss_dict = {
+                'loss': loss,
+                'loss_rec': loss_rec,
+                'loss_encoder': loss_encoder,
+                'loss_lsc': loss_lsc,
+                'loss_lsd': loss_lsd,
+                'loss_lsg': loss_lsg,
+                'loss_cls': loss_cls,
+                # 'loss_at_soft_cls': loss_at_soft_cls,
+        }
+        acc_dict = {
+                'acc_encode_z_dis': acc_encode_z_dis,
+                'acc_gen_z_dis': acc_gen_z_dis,
+                'acc_encode_z_cls': acc_encode_z_cls,
+                'acc_cls': acc_cls,
+                # 'acc_at_soft_cls': acc_at_soft_cls,
+        }
+        return loss_dict, acc_dict
+    def sample_sequence_conditional_batch(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device)
+        context = context.unsqueeze(0).repeat(num_samples, 1)
+        generated = context # (B, 1)
+        # with torch.no_grad():
+        while generated.size(-1) < self.args.block_size:
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]
+            # softmax sample
+            next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, 1, vocab_size)
+            filtered_logits = F.softmax(filtered_logits, dim=-1)
+            next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+        return generated    # (B, seq_len)
+    def top_k_top_p_filtering_batch(self, logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+        """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+            Args:
+                logits: logits distribution shape (vocabulary size)
+                top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+                top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                    Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+            From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+        """
+        # assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+        top_k = min(top_k, logits.size(-1))  # Safety check
+        if top_k > 0:
+            # Remove all tokens with a probability less than the last token of the top-k
+            threshold = torch.topk(logits, top_k, dim=-1)[0][:, -1, None]
+            logits.masked_fill_(logits < threshold, filter_value)   #  (B, vocab_size)
+        if top_p > 0.0:
+            sorted_logits, sorted_indices = torch.sort(logits, descending=True)         # (B, vocab_size)
+            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)   # (B, vocab_size)
+            # Remove tokens with cumulative probability above the threshold
+            sorted_indices_to_remove = cumulative_probs > top_p
+            # Shift the indices to the right to keep also the first token above the threshold
+            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+            sorted_indices_to_remove[..., 0] = 0
+            indices_to_remove = sorted_indices[sorted_indices_to_remove]
+            logits.masked_fill_(indices_to_remove, filter_value)
+        return logits
+    def sample_sequence_conditional_batch_soft(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device).unsqueeze(0).repeat(num_samples, 1)     # (B, 1)
+        context_soft = torch.FloatTensor(num_samples, self.decoder.config.vocab_size).zero_().to(device=past.device)    # (B, vocab_size)
+        context_soft.scatter_(1, context, 1)  # (B, vocab_size)
+        generated_soft = context_soft.unsqueeze(1) # (B, 1, vocab_size)
+        # with torch.no_grad():
+        while generated_soft.size(1) < self.args.block_size:    # generated_soft: (B, seq_len, vocab_size)
+            inputs = {'soft_ids': generated_soft, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]  # (B, seq_len, vocab_size)
+            # Gumbel softmax sample
+            next_tokens_soft = gumbel_softmax(logits=lm_logits[:, -1:, :], temperature=self.args.soft_temperature, hard=False)  # (B, 1, vocab_size)
+            generated_soft = torch.cat((generated_soft, next_tokens_soft), dim=1)   # (B, seq_len+1, vocab_size)
+            # # softmax sample
+            # next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            # filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, 1, vocab_size)
+            # filtered_logits = F.softmax(filtered_logits, dim=-1)
+            # next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            # generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+            next_tokens = torch.argmax(next_tokens_soft, dim=-1)    # (B, 1)
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+        return generated_soft    # (B, seq_len, vocab_size)
+### Gumbel Softmax
+def gumbel_softmax(logits, temperature, hard=False):
+    """Sample from the Gumbel-Softmax distribution and optionally discretize.
+        Args:
+            logits: [..., n_class] unnormalized log-probs
+            temperature: non-negative scalar
+            hard: if True, take argmax, but differentiate w.r.t. soft sample y
+        Returns:
+            [..., n_class] sample from the Gumbel-Softmax distribution.
+            If hard=True, then the returned sample will be one-hot, otherwise it will be a probabilitiy distribution that sums to 1 across classes
+    """
+    y = gumbel_softmax_sample(logits, temperature)  # (..., n_class)
+    if hard:    # return onehot
+        shape = y.size()
+        _, ind = y.max(dim=-1)
+        y_hard = torch.zeros_like(y).view(-1, shape[-1])
+        y_hard.scatter_(1, ind.view(-1, 1), 1)  # one hot
+        y_hard = y_hard.view(*shape)
+        # Set gradients w.r.t. y_hard gradients w.r.t. y
+        y = (y_hard - y).detach() + y
+    return y    # (..., n_class)
+from torch.nn import functional as F
+def gumbel_softmax_sample(logits, temperature):
+    y = logits + sample_gumbel(logits.size(), logits.device)
+    return F.softmax(y / temperature, dim=-1)
+def sample_gumbel(shape, device, eps=1e-20):
+    U = torch.rand(shape).to(device=device)
+    return -torch.log(-torch.log(U + eps) + eps)

Optimus/code/examples/big_ae/modules/ctrl_gen.py ADDED Viewed

	@@ -0,0 +1,371 @@

+import math
+import torch
+import torch.nn as nn
+from .utils import log_sum_exp
+import pdb
+import sys
+sys.path.append('../../')
+from pytorch_transformers.modeling_bert import BertEmbeddings
+import torch.nn.functional as F
+class Ctrl_Gen(nn.Module):
+    def __init__(self, encoder, decoder, tokenizer_encoder, tokenizer_decoder, args): #
+        super(Ctrl_Gen, self).__init__()
+        self.encoder = encoder
+        self.decoder = decoder
+        self.tokenizer_encoder = tokenizer_encoder
+        self.tokenizer_decoder = tokenizer_decoder
+        self.args = args
+        self.nz = args.latent_size
+        self.bos_token_id_list = self.tokenizer_decoder.encode(self.tokenizer_decoder.bos_token)
+        self.pad_token_id = self.tokenizer_decoder.encode(self.tokenizer_decoder.pad_token)[0]
+        # connector: from Bert hidden units to the latent space
+        self.linear = nn.Linear(encoder.config.hidden_size, self.nz, bias=False)
+        # # Standard Normal prior
+        # loc = torch.zeros(self.nz, device=args.device)
+        # scale = torch.ones(self.nz, device=args.device)
+        # self.prior = torch.distributions.normal.Normal(loc, scale)
+        self.label_embedding = nn.Embedding(args.label_size, self.nz, padding_idx=0)    # use the same size as latent_z so as to use the same decoder.linear()
+        self.latent_generator = nn.Linear(self.nz, self.nz)
+        self.latent_classifier = nn.Linear(self.nz, args.label_size if args.label_size > 2 else 1)
+        self.latent_discriminator = nn.Linear(self.nz, 1)
+        self.gpt_embeddings = nn.Embedding(self.decoder.config.vocab_size, self.decoder.config.n_embd)
+        self.gpt_embeddings.weight.data = decoder.transformer.wte.weight.data
+        self.conv1 = nn.Conv1d(self.encoder.config.hidden_size, self.encoder.config.hidden_size, 3)
+        self.classifier = nn.Linear(self.encoder.config.hidden_size, 1 if args.label_size <= 2 else args.label_size)
+        self.CrossEntropyLoss = torch.nn.CrossEntropyLoss()
+        self.BCEWithLogitsLoss = torch.nn.BCEWithLogitsLoss()
+    def forward(self, input_seq_ids, tgt_seq_ids, cond_labels, attention_mask):
+        # inputs: (B, seq_len)
+        # labels: (B, seq_len)
+        # cond_labels: (B), conditional labels.
+        ones_label = torch.ones_like(cond_labels).to(dtype=torch.float32)
+        zeros_label = torch.zeros_like(cond_labels).to(dtype=torch.float32)
+        random_noise = torch.nn.init.normal_(torch.empty(input_seq_ids.size(0), self.nz)).to(device=input_seq_ids.device, dtype=torch.float32)
+        # Encode inputs
+        outputs = self.encoder(input_seq_ids, attention_mask=attention_mask)
+        pooled_hidden_fea = outputs[1]  # (B, dim_h)
+        # Encode z
+        latent_z = self.linear(pooled_hidden_fea)    # (B, nz)
+        # Generate z
+        gen_z = self.latent_generator(random_noise)  # (B, nz)
+        # Latent discriminator
+        prob_encode_z_dis = self.latent_discriminator(latent_z).squeeze(1).float()  # (B)
+        prob_gen_z_dis = self.latent_discriminator(gen_z).squeeze(1).float()  # (B)
+        # Train latent discriminator
+        loss_lsd = self.BCEWithLogitsLoss(prob_gen_z_dis, zeros_label) + self.BCEWithLogitsLoss(prob_encode_z_dis, ones_label)
+        acc_encode_z_dis = ((prob_encode_z_dis >= 0).float() == ones_label).float()
+        acc_gen_z_dis = ((prob_gen_z_dis >= 0).float() == zeros_label).float()
+        # Train sampler adversarially
+        loss_lsg = self.BCEWithLogitsLoss(prob_gen_z_dis, ones_label)
+        # Latent classifier
+        prob_encode_z_cls = self.latent_classifier(latent_z)  # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_encode_z_cls = prob_encode_z_cls.squeeze(1)  # (B)
+            # Train latent classifier
+            loss_lsc = self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+            acc_encode_z_cls = ((prob_encode_z_cls >= 0).float() == cond_labels.float()).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+        else:
+            # Train latent classifier
+            loss_lsc = self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+            acc_encode_z_cls = (torch.argmax(prob_encode_z_cls, dim=-1) == cond_labels).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+        # Embed labels
+        label_emb = self.label_embedding(cond_labels)  # (B, hidden_size)
+        # past_label = self.decoder.linear(label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        if self.args.label_size <= 2:
+            sampled_cond_labels = 1 - cond_labels
+        else:
+            raise NotImplementedError    # todo: currently only implemented for binary labels. need to change for multi-class labels.
+        sampled_label_emb = self.label_embedding(sampled_cond_labels)  # (B, hidden_size)
+        # past_sampled_label = self.decoder.linear(sampled_label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        past_sampled_label = sampled_label_emb
+        # Generate based on encoded z and gt labels. (reconstruction)
+        # past_z = self.decoder.linear(latent_z)    # (B, n_blocks * hidden_size)
+        past_z = latent_z
+        # gen_past_z = self.decoder.linear(gen_z)    # (B, n_blocks * hidden_size)
+        gen_past_z = gen_z    # (B, n_blocks * hidden_size)
+        # past = torch.cat([past_z.unsqueeze(1), past_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+        past = latent_z + label_emb # (B, n_blocks * hidden_size)
+        outputs = self.decoder(input_ids=tgt_seq_ids, past=past, labels=tgt_seq_ids, label_ignore=self.pad_token_id)
+        loss_rec = outputs[0]
+        # Train a classifier in the observation space
+        tgt_emb = self.gpt_embeddings(tgt_seq_ids)
+        tgt_encode = self.conv1(tgt_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        tgt_encode = torch.mean(tgt_encode, dim=-1) # (B, dim_h)
+        prob_cls = self.classifier(tgt_encode)   # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_cls = prob_cls.squeeze(1)
+            loss_cls = self.BCEWithLogitsLoss(prob_cls, cond_labels.float())
+            pred_cls = (prob_cls >= 0).to(dtype=torch.long)
+        else:
+            loss_cls = self.CrossEntropyLoss(prob_cls, cond_labels)
+            pred_cls = torch.argmax(prob_cls, dim=-1)
+        acc_cls = (pred_cls == cond_labels).float()
+        # Generate based on encoded z and sampled labels (attribute transfer)
+        # at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+        # at_generated_soft = self.sample_sequence_conditional_batch_soft(past=at_past, context=self.bos_token_id_list) # (B, seq_len, vocab_size)
+        # # Classifier on attribute transfer generated sentences. Train Generator on attribute transfer.
+        # at_soft_emb = torch.matmul(at_generated_soft, self.gpt_embeddings.weight)
+        # at_soft_encode = self.conv1(at_soft_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        # at_soft_encode = torch.mean(at_soft_encode, dim=-1)   # (B, dim_h)
+        # prob_at_soft_cls = self.classifier(at_soft_encode)    # (B, 1)
+        # if self.args.label_size <= 2:
+        #     prob_at_soft_cls = prob_at_soft_cls.squeeze(1)
+        #     loss_at_soft_cls = self.BCEWithLogitsLoss(prob_at_soft_cls, sampled_cond_labels.float())
+        #     pred_at_soft_cls = (prob_at_soft_cls >= 0).to(torch.long)
+        # else:
+        #     loss_at_soft_cls = self.CrossEntropyLoss(prob_at_soft_cls, sampled_cond_labels)
+        #     pred_at_soft_cls = torch.argmax(prob_at_soft_cls, dim=-1)
+        # acc_at_soft_cls = (pred_at_soft_cls == sampled_cond_labels).float()
+        # Loss
+        loss = loss_rec + loss_encoder + loss_lsc + loss_lsd + loss_lsg + self.args.beta_cls * loss_cls # + loss_at_soft_cls
+        if not self.training:
+            # Generate based on encoded z and gt labels
+            generated = self.sample_sequence_conditional_batch(past=past, context=self.bos_token_id_list)
+            # Generate based on encoded z and sampled labels (attribute transfer)
+            # at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            at_past = past_z + past_sampled_label # (B, n_blocks * hidden_size)
+            at_generated = self.sample_sequence_conditional_batch(past=at_past, context=self.bos_token_id_list) # (B, seq_len)
+            # Generate based on sampled z and sampled labels. (conditional generation)
+            # cg_past = torch.cat([gen_past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            cg_past = gen_past_z +  past_sampled_label # (B, n_blocks * hidden_size)
+            cg_generated = self.sample_sequence_conditional_batch(past=cg_past, context=self.bos_token_id_list) # (B, seq_len)
+            # classifier on gt generated sentences.
+            ge_emb = self.gpt_embeddings(generated)
+            ge_encode = self.conv1(ge_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            ge_encode = torch.mean(ge_encode, dim=-1)   # (B, dim_h)
+            prob_ge_cls = self.classifier(ge_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_ge_cls = (prob_ge_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_ge_cls = torch.argmax(prob_ge_cls, dim=-1)
+            acc_ge_cls = (pred_ge_cls == cond_labels).float()
+            # classifier on attribute transfer generated sentences.
+            at_emb = self.gpt_embeddings(at_generated)
+            at_encode = self.conv1(at_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            at_encode = torch.mean(at_encode, dim=-1)   # (B, dim_h)
+            prob_at_cls = self.classifier(at_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_at_cls = (prob_at_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_at_cls = torch.argmax(prob_at_cls, dim=-1)
+            acc_at_cls = (pred_at_cls == sampled_cond_labels).float()
+            # classifier on conditional generated sentences.
+            cg_emb = self.gpt_embeddings(cg_generated)
+            cg_encode = self.conv1(cg_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            cg_encode = torch.mean(cg_encode, dim=-1)   # (B, dim_h)
+            prob_cg_cls = self.classifier(cg_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_cg_cls = (prob_cg_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_cg_cls = torch.argmax(prob_cg_cls, dim=-1)
+            acc_cg_cls = (pred_cg_cls == sampled_cond_labels).float()
+            result = {
+                    'sampled_cond_labels': sampled_cond_labels,
+                    'cond_labels': cond_labels,
+                    'tgt_seq_ids': tgt_seq_ids,
+                    'generated': generated,
+                    'at_generated': at_generated,
+                    'cg_generated': cg_generated,
+                    'acc_encode_z_dis': acc_encode_z_dis,
+                    'acc_gen_z_dis': acc_gen_z_dis,
+                    'acc_encode_z_cls': acc_encode_z_cls,
+                    'acc_cls': acc_cls,
+                    'acc_ge_cls': acc_ge_cls,
+                    'acc_at_cls': acc_at_cls,
+                    'acc_cg_cls': acc_cg_cls,
+                    'pred_cls': pred_cls,
+                    'pred_ge_cls': pred_ge_cls,
+                    'pred_at_cls': pred_at_cls,
+                    'pred_cg_cls': pred_cg_cls,
+                    }
+            return result
+        loss_dict = {
+                'loss': loss,
+                'loss_rec': loss_rec,
+                'loss_encoder': loss_encoder,
+                'loss_lsc': loss_lsc,
+                'loss_lsd': loss_lsd,
+                'loss_lsg': loss_lsg,
+                'loss_cls': loss_cls,
+                # 'loss_at_soft_cls': loss_at_soft_cls,
+        }
+        acc_dict = {
+                'acc_encode_z_dis': acc_encode_z_dis,
+                'acc_gen_z_dis': acc_gen_z_dis,
+                'acc_encode_z_cls': acc_encode_z_cls,
+                'acc_cls': acc_cls,
+                # 'acc_at_soft_cls': acc_at_soft_cls,
+        }
+        return loss_dict, acc_dict
+    def sample_sequence_conditional_batch(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device)
+        context = context.unsqueeze(0).repeat(num_samples, 1)
+        generated = context # (B, 1)
+        # with torch.no_grad():
+        while generated.size(-1) < self.args.block_size:
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]
+            # softmax sample
+            next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, 1, vocab_size)
+            filtered_logits = F.softmax(filtered_logits, dim=-1)
+            next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+        return generated    # (B, seq_len)
+    def top_k_top_p_filtering_batch(self, logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+        """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+            Args:
+                logits: logits distribution shape (vocabulary size)
+                top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+                top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                    Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+            From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+        """
+        # assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+        top_k = min(top_k, logits.size(-1))  # Safety check
+        if top_k > 0:
+            # Remove all tokens with a probability less than the last token of the top-k
+            threshold = torch.topk(logits, top_k, dim=-1)[0][:, -1, None]
+            logits.masked_fill_(logits < threshold, filter_value)   #  (B, vocab_size)
+        if top_p > 0.0:
+            sorted_logits, sorted_indices = torch.sort(logits, descending=True)         # (B, vocab_size)
+            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)   # (B, vocab_size)
+            # Remove tokens with cumulative probability above the threshold
+            sorted_indices_to_remove = cumulative_probs > top_p
+            # Shift the indices to the right to keep also the first token above the threshold
+            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+            sorted_indices_to_remove[..., 0] = 0
+            indices_to_remove = sorted_indices[sorted_indices_to_remove]
+            logits.masked_fill_(indices_to_remove, filter_value)
+        return logits
+    def sample_sequence_conditional_batch_soft(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device).unsqueeze(0).repeat(num_samples, 1)     # (B, 1)
+        context_soft = torch.FloatTensor(num_samples, self.decoder.config.vocab_size).zero_().to(device=past.device)    # (B, vocab_size)
+        context_soft.scatter_(1, context, 1)  # (B, vocab_size)
+        generated_soft = context_soft.unsqueeze(1) # (B, 1, vocab_size)
+        # with torch.no_grad():
+        while generated_soft.size(1) < self.args.block_size:    # generated_soft: (B, seq_len, vocab_size)
+            inputs = {'soft_ids': generated_soft, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]  # (B, seq_len, vocab_size)
+            # Gumbel softmax sample
+            next_tokens_soft = gumbel_softmax(logits=lm_logits[:, -1:, :], temperature=self.args.soft_temperature, hard=False)  # (B, 1, vocab_size)
+            generated_soft = torch.cat((generated_soft, next_tokens_soft), dim=1)   # (B, seq_len+1, vocab_size)
+            # # softmax sample
+            # next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            # filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, 1, vocab_size)
+            # filtered_logits = F.softmax(filtered_logits, dim=-1)
+            # next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            # generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+            next_tokens = torch.argmax(next_tokens_soft, dim=-1)    # (B, 1)
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+        return generated_soft    # (B, seq_len, vocab_size)
+### Gumbel Softmax
+def gumbel_softmax(logits, temperature, hard=False):
+    """Sample from the Gumbel-Softmax distribution and optionally discretize.
+        Args:
+            logits: [..., n_class] unnormalized log-probs
+            temperature: non-negative scalar
+            hard: if True, take argmax, but differentiate w.r.t. soft sample y
+        Returns:
+            [..., n_class] sample from the Gumbel-Softmax distribution.
+            If hard=True, then the returned sample will be one-hot, otherwise it will be a probabilitiy distribution that sums to 1 across classes
+    """
+    y = gumbel_softmax_sample(logits, temperature)  # (..., n_class)
+    if hard:    # return onehot
+        shape = y.size()
+        _, ind = y.max(dim=-1)
+        y_hard = torch.zeros_like(y).view(-1, shape[-1])
+        y_hard.scatter_(1, ind.view(-1, 1), 1)  # one hot
+        y_hard = y_hard.view(*shape)
+        # Set gradients w.r.t. y_hard gradients w.r.t. y
+        y = (y_hard - y).detach() + y
+    return y    # (..., n_class)
+from torch.nn import functional as F
+def gumbel_softmax_sample(logits, temperature):
+    y = logits + sample_gumbel(logits.size(), logits.device)
+    return F.softmax(y / temperature, dim=-1)
+def sample_gumbel(shape, device, eps=1e-20):
+    U = torch.rand(shape).to(device=device)
+    return -torch.log(-torch.log(U + eps) + eps)

Optimus/code/examples/big_ae/modules/decoders/dec_gpt2.py ADDED Viewed

	@@ -0,0 +1,358 @@

+# import torch
+import time
+import argparse
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
+import numpy as np
+from .decoder import DecoderBase
+class LSTMDecoder(DecoderBase):
+    """LSTM decoder with constant-length data"""
+    def __init__(self, args, vocab, model_init, emb_init):
+        super(LSTMDecoder, self).__init__()
+        self.ni = args.ni
+        self.nh = args.dec_nh
+        self.nz = args.nz
+        self.vocab = vocab
+        self.device = args.device
+        # no padding when setting padding_idx to -1
+        self.embed = nn.Embedding(len(vocab), args.ni, padding_idx=-1)
+        self.dropout_in = nn.Dropout(args.dec_dropout_in)
+        self.dropout_out = nn.Dropout(args.dec_dropout_out)
+        # for initializing hidden state and cell
+        self.trans_linear = nn.Linear(args.nz, args.dec_nh, bias=False)
+        # concatenate z with input
+        self.lstm = nn.LSTM(input_size=args.ni + args.nz,
+                            hidden_size=args.dec_nh,
+                            num_layers=1,
+                            batch_first=True)
+        # prediction layer
+        self.pred_linear = nn.Linear(args.dec_nh, len(vocab), bias=False)
+        vocab_mask = torch.ones(len(vocab))
+        # vocab_mask[vocab['<pad>']] = 0
+        self.loss = nn.CrossEntropyLoss(weight=vocab_mask, reduce=False)
+        self.reset_parameters(model_init, emb_init)
+    def reset_parameters(self, model_init, emb_init):
+        # for name, param in self.lstm.named_parameters():
+        #     # self.initializer(param)
+        #     if 'bias' in name:
+        #         nn.init.constant_(param, 0.0)
+        #         # model_init(param)
+        #     elif 'weight' in name:
+        #         model_init(param)
+        # model_init(self.trans_linear.weight)
+        # model_init(self.pred_linear.weight)
+        for param in self.parameters():
+            model_init(param)
+        emb_init(self.embed.weight)
+    def sample_text(self, input, z, EOS, device):
+        sentence = [input]
+        max_index = 0
+        input_word = input
+        batch_size, n_sample, _ = z.size()
+        seq_len = 1
+        z_ = z.expand(batch_size, seq_len, self.nz)
+        seq_len = input.size(1)
+        softmax = torch.nn.Softmax(dim=0)
+        while max_index != EOS and len(sentence) < 100:
+            # (batch_size, seq_len, ni)
+            word_embed = self.embed(input_word)
+            word_embed = torch.cat((word_embed, z_), -1)
+            c_init = self.trans_linear(z).unsqueeze(0)
+            h_init = torch.tanh(c_init)
+            if len(sentence) == 1:
+                h_init = h_init.squeeze(dim=1)
+                c_init = c_init.squeeze(dim=1)
+                output, hidden = self.lstm.forward(word_embed, (h_init, c_init))
+            else:
+                output, hidden = self.lstm.forward(word_embed, hidden)
+            # (batch_size * n_sample, seq_len, vocab_size)
+            output_logits = self.pred_linear(output)
+            output_logits = output_logits.view(-1)
+            probs = softmax(output_logits)
+            # max_index = torch.argmax(output_logits)
+            max_index = torch.multinomial(probs, num_samples=1)
+            input_word = torch.tensor([[max_index]]).to(device)
+            sentence.append(max_index)
+        return sentence
+    def decode(self, input, z):
+        """
+        Args:
+            input: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        """
+        # not predicting start symbol
+        # sents_len -= 1
+        batch_size, n_sample, _ = z.size()
+        seq_len = input.size(1)
+        # (batch_size, seq_len, ni)
+        word_embed = self.embed(input)
+        word_embed = self.dropout_in(word_embed)
+        if n_sample == 1:
+            z_ = z.expand(batch_size, seq_len, self.nz)
+        else:
+            word_embed = word_embed.unsqueeze(1).expand(batch_size, n_sample, seq_len, self.ni) \
+                                   .contiguous()
+            # (batch_size * n_sample, seq_len, ni)
+            word_embed = word_embed.view(batch_size * n_sample, seq_len, self.ni)
+            z_ = z.unsqueeze(2).expand(batch_size, n_sample, seq_len, self.nz).contiguous()
+            z_ = z_.view(batch_size * n_sample, seq_len, self.nz)
+        # (batch_size * n_sample, seq_len, ni + nz)
+        word_embed = torch.cat((word_embed, z_), -1)
+        z = z.view(batch_size * n_sample, self.nz)
+        c_init = self.trans_linear(z).unsqueeze(0)
+        h_init = torch.tanh(c_init)
+        # h_init = self.trans_linear(z).unsqueeze(0)
+        # c_init = h_init.new_zeros(h_init.size())
+        output, _ = self.lstm(word_embed, (h_init, c_init))
+        output = self.dropout_out(output)
+        # (batch_size * n_sample, seq_len, vocab_size)
+        output_logits = self.pred_linear(output)
+        return output_logits
+    def reconstruct_error(self, x, z):
+        """Cross Entropy in the language case
+        Args:
+            x: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            loss: (batch_size, n_sample). Loss
+            across different sentence and z
+        """
+        #remove end symbol
+        src = x[:, :-1]
+        # remove start symbol
+        tgt = x[:, 1:]
+        batch_size, seq_len = src.size()
+        n_sample = z.size(1)
+        # (batch_size * n_sample, seq_len, vocab_size)
+        output_logits = self.decode(src, z)
+        if n_sample == 1:
+            tgt = tgt.contiguous().view(-1)
+        else:
+            # (batch_size * n_sample * seq_len)
+            tgt = tgt.unsqueeze(1).expand(batch_size, n_sample, seq_len) \
+                     .contiguous().view(-1)
+        # (batch_size * n_sample * seq_len)
+        loss = self.loss(output_logits.view(-1, output_logits.size(2)),
+                         tgt)
+        # (batch_size, n_sample)
+        return loss.view(batch_size, n_sample, -1).sum(-1)
+    def log_probability(self, x, z):
+        """Cross Entropy in the language case
+        Args:
+            x: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            log_p: (batch_size, n_sample).
+                log_p(x|z) across different x and z
+        """
+        return -self.reconstruct_error(x, z)
+    def greedy_decode(self, z):
+        return self.sample_decode(z, greedy=True)
+    def sample_decode(self, z, greedy=False):
+        """sample/greedy decoding from z
+        Args:
+            z: (batch_size, nz)
+        Returns: List1
+            List1: the decoded word sentence list
+        """
+        batch_size = z.size(0)
+        decoded_batch = [[] for _ in range(batch_size)]
+        # (batch_size, 1, nz)
+        c_init = self.trans_linear(z).unsqueeze(0)
+        h_init = torch.tanh(c_init)
+        decoder_hidden = (h_init, c_init)
+        decoder_input = torch.tensor([self.vocab["<s>"]] * batch_size, dtype=torch.long, device=self.device).unsqueeze(1)
+        end_symbol = torch.tensor([self.vocab["</s>"]] * batch_size, dtype=torch.long, device=self.device)
+        mask = torch.ones((batch_size), dtype=torch.uint8, device=self.device)
+        length_c = 1
+        while mask.sum().item() != 0 and length_c < 100:
+            # (batch_size, 1, ni) --> (batch_size, 1, ni+nz)
+            word_embed = self.embed(decoder_input)
+            word_embed = torch.cat((word_embed, z.unsqueeze(1)), dim=-1)
+            output, decoder_hidden = self.lstm(word_embed, decoder_hidden)
+            # (batch_size, 1, vocab_size) --> (batch_size, vocab_size)
+            decoder_output = self.pred_linear(output)
+            output_logits = decoder_output.squeeze(1)
+            # (batch_size)
+            if greedy:
+                max_index = torch.argmax(output_logits, dim=1)
+            else:
+                probs = F.softmax(output_logits, dim=1)
+                max_index = torch.multinomial(probs, num_samples=1).squeeze(1)
+            decoder_input = max_index.unsqueeze(1)
+            length_c += 1
+            for i in range(batch_size):
+                word = self.vocab.id2word(max_index[i].item())
+                if mask[i].item():
+                    decoded_batch[i].append(self.vocab.id2word(max_index[i].item()))
+            mask = torch.mul((max_index != end_symbol), mask)
+        return decoded_batch
+class VarLSTMDecoder(LSTMDecoder):
+    """LSTM decoder with constant-length data"""
+    def __init__(self, args, vocab, model_init, emb_init):
+        super(VarLSTMDecoder, self).__init__(args, vocab, model_init, emb_init)
+        self.embed = nn.Embedding(len(vocab), args.ni, padding_idx=vocab['<pad>'])
+        vocab_mask = torch.ones(len(vocab))
+        vocab_mask[vocab['<pad>']] = 0
+        self.loss = nn.CrossEntropyLoss(weight=vocab_mask, reduce=False)
+        self.reset_parameters(model_init, emb_init)
+    def decode(self, input, z):
+        """
+        Args:
+            input: tuple which contains x and sents_len
+                    x: (batch_size, seq_len)
+                    sents_len: long tensor of sentence lengths
+            z: (batch_size, n_sample, nz)
+        """
+        input, sents_len = input
+        # not predicting start symbol
+        sents_len = sents_len - 1
+        batch_size, n_sample, _ = z.size()
+        seq_len = input.size(1)
+        # (batch_size, seq_len, ni)
+        word_embed = self.embed(input)
+        word_embed = self.dropout_in(word_embed)
+        if n_sample == 1:
+            z_ = z.expand(batch_size, seq_len, self.nz)
+        else:
+            word_embed = word_embed.unsqueeze(1).expand(batch_size, n_sample, seq_len, self.ni) \
+                                   .contiguous()
+            # (batch_size * n_sample, seq_len, ni)
+            word_embed = word_embed.view(batch_size * n_sample, seq_len, self.ni)
+            z_ = z.unsqueeze(2).expand(batch_size, n_sample, seq_len, self.nz).contiguous()
+            z_ = z_.view(batch_size * n_sample, seq_len, self.nz)
+        # (batch_size * n_sample, seq_len, ni + nz)
+        word_embed = torch.cat((word_embed, z_), -1)
+        sents_len = sents_len.unsqueeze(1).expand(batch_size, n_sample).contiguous().view(-1)
+        packed_embed = pack_padded_sequence(word_embed, sents_len.tolist(), batch_first=True)
+        z = z.view(batch_size * n_sample, self.nz)
+        # h_init = self.trans_linear(z).unsqueeze(0)
+        # c_init = h_init.new_zeros(h_init.size())
+        c_init = self.trans_linear(z).unsqueeze(0)
+        h_init = torch.tanh(c_init)
+        output, _ = self.lstm(packed_embed, (h_init, c_init))
+        output, _ = pad_packed_sequence(output, batch_first=True)
+        output = self.dropout_out(output)
+        # (batch_size * n_sample, seq_len, vocab_size)
+        output_logits = self.pred_linear(output)
+        return output_logits
+    def reconstruct_error(self, x, z):
+        """Cross Entropy in the language case
+        Args:
+            x: tuple which contains x_ and sents_len
+                    x_: (batch_size, seq_len)
+                    sents_len: long tensor of sentence lengths
+            z: (batch_size, n_sample, nz)
+        Returns:
+            loss: (batch_size, n_sample). Loss
+            across different sentence and z
+        """
+        x, sents_len = x
+        #remove end symbol
+        src = x[:, :-1]
+        # remove start symbol
+        tgt = x[:, 1:]
+        batch_size, seq_len = src.size()
+        n_sample = z.size(1)
+        # (batch_size * n_sample, seq_len, vocab_size)
+        output_logits = self.decode((src, sents_len), z)
+        if n_sample == 1:
+            tgt = tgt.contiguous().view(-1)
+        else:
+            # (batch_size * n_sample * seq_len)
+            tgt = tgt.unsqueeze(1).expand(batch_size, n_sample, seq_len) \
+                     .contiguous().view(-1)
+        # (batch_size * n_sample * seq_len)
+        loss = self.loss(output_logits.view(-1, output_logits.size(2)),
+                         tgt)
+        # (batch_size, n_sample)
+        return loss.view(batch_size, n_sample, -1).sum(-1)

Optimus/code/examples/big_ae/modules/decoders/decoder.py ADDED Viewed

	@@ -0,0 +1,79 @@

+import torch
+import torch.nn as nn
+class DecoderBase(nn.Module):
+    """docstring for Decoder"""
+    def __init__(self):
+        super(DecoderBase, self).__init__()
+    def freeze(self):
+        for param in self.parameters():
+            param.requires_grad = False
+    def decode(self, x, z):
+        """
+        Args:
+            x: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        Returns: Tensor1
+            Tensor1: the output logits with size (batch_size * n_sample, seq_len, vocab_size)
+        """
+        raise NotImplementedError
+    def reconstruct_error(self, x, z):
+        """reconstruction loss
+        Args:
+            x: (batch_size, *)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            loss: (batch_size, n_sample). Loss
+            across different sentence and z
+        """
+        raise NotImplementedError
+    def beam_search_decode(self, z, K):
+        """beam search decoding
+        Args:
+            z: (batch_size, nz)
+            K: the beam size
+        Returns: List1
+            List1: the decoded word sentence list
+        """
+        raise NotImplementedError
+    def sample_decode(self, z):
+        """sampling from z
+        Args:
+            z: (batch_size, nz)
+        Returns: List1
+            List1: the decoded word sentence list
+        """
+        raise NotImplementedError
+    def greedy_decode(self, z):
+        """greedy decoding from z
+        Args:
+            z: (batch_size, nz)
+        Returns: List1
+            List1: the decoded word sentence list
+        """
+        raise NotImplementedError
+    def log_probability(self, x, z):
+        """
+        Args:
+            x: (batch_size, *)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            log_p: (batch_size, n_sample).
+                log_p(x|z) across different x and z
+        """
+        raise NotImplementedError

Optimus/code/examples/big_ae/modules/encoders/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .enc_lstm import *

Optimus/code/examples/big_ae/modules/encoders/enc_lstm.py ADDED Viewed

	@@ -0,0 +1,126 @@

+from itertools import chain
+import math
+import torch
+import torch.nn as nn
+from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
+from .gaussian_encoder import GaussianEncoderBase
+from ..utils import log_sum_exp
+class GaussianLSTMEncoder(GaussianEncoderBase):
+    """Gaussian LSTM Encoder with constant-length input"""
+    def __init__(self, args, vocab_size, model_init, emb_init):
+        super(GaussianLSTMEncoder, self).__init__()
+        self.ni = args.ni
+        self.nh = args.enc_nh
+        self.nz = args.nz
+        self.args = args
+        self.embed = nn.Embedding(vocab_size, args.ni)
+        self.lstm = nn.LSTM(input_size=args.ni,
+                            hidden_size=args.enc_nh,
+                            num_layers=1,
+                            batch_first=True,
+                            dropout=0)
+        self.linear = nn.Linear(args.enc_nh, 2 * args.nz, bias=False)
+        self.reset_parameters(model_init, emb_init)
+    def reset_parameters(self, model_init, emb_init):
+        # for name, param in self.lstm.named_parameters():
+        #     # self.initializer(param)
+        #     if 'bias' in name:
+        #         nn.init.constant_(param, 0.0)
+        #         # model_init(param)
+        #     elif 'weight' in name:
+        #         model_init(param)
+        # model_init(self.linear.weight)
+        # emb_init(self.embed.weight)
+        for param in self.parameters():
+            model_init(param)
+        emb_init(self.embed.weight)
+    def forward(self, input):
+        """
+        Args:
+            x: (batch_size, seq_len)
+        Returns: Tensor1, Tensor2
+            Tensor1: the mean tensor, shape (batch, nz)
+            Tensor2: the logvar tensor, shape (batch, nz)
+        """
+        # (batch_size, seq_len-1, args.ni)
+        word_embed = self.embed(input)
+        _, (last_state, last_cell) = self.lstm(word_embed)
+        mean, logvar = self.linear(last_state).chunk(2, -1)
+        # fix variance as a pre-defined value
+        if self.args.fix_var > 0:
+            logvar = mean.new_tensor([[[math.log(self.args.fix_var)]]]).expand_as(mean)
+        return mean.squeeze(0), logvar.squeeze(0)
+    # def eval_inference_mode(self, x):
+    #     """compute the mode points in the inference distribution
+    #     (in Gaussian case)
+    #     Returns: Tensor
+    #         Tensor: the posterior mode points with shape (*, nz)
+    #     """
+    #     # (batch_size, nz)
+    #     mu, logvar = self.forward(x)
+class VarLSTMEncoder(GaussianLSTMEncoder):
+    """Gaussian LSTM Encoder with variable-length input"""
+    def __init__(self, args, vocab_size, model_init, emb_init):
+        super(VarLSTMEncoder, self).__init__(args, vocab_size, model_init, emb_init)
+    def forward(self, input):
+        """
+        Args:
+            input: tuple which contains x and sents_len
+                    x: (batch_size, seq_len)
+                    sents_len: long tensor of sentence lengths
+        Returns: Tensor1, Tensor2
+            Tensor1: the mean tensor, shape (batch, nz)
+            Tensor2: the logvar tensor, shape (batch, nz)
+        """
+        input, sents_len = input
+        # (batch_size, seq_len, args.ni)
+        word_embed = self.embed(input)
+        packed_embed = pack_padded_sequence(word_embed, sents_len.tolist(), batch_first=True)
+        _, (last_state, last_cell) = self.lstm(packed_embed)
+        mean, logvar = self.linear(last_state).chunk(2, -1)
+        return mean.squeeze(0), logvar.squeeze(0)
+    def encode(self, input, nsamples):
+        """perform the encoding and compute the KL term
+        Args:
+            input: tuple which contains x and sents_len
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+        # (batch_size, nz)
+        mu, logvar = self.forward(input)
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mu, logvar, nsamples)
+        KL = 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+        return z, KL

Optimus/code/examples/big_ae/modules/encoders/encoder.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import math
+import torch
+import torch.nn as nn
+from ..utils import log_sum_exp
+class EncoderBase(nn.Module):
+    """docstring for EncoderBase"""
+    def __init__(self):
+        super(EncoderBase, self).__init__()
+    def forward(self, x):
+        """
+        Args:
+            x: (batch_size, *)
+        Returns: the tensors required to parameterize a distribution.
+        E.g. for Gaussian encoder it returns the mean and variance tensors
+        """
+        raise NotImplementedError
+    def sample(self, input, nsamples):
+        """sampling from the encoder
+        Returns: Tensor1
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+        """
+        raise NotImplementedError
+    def encode(self, input, nsamples):
+        """perform the encoding and compute the KL term
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+        raise NotImplementedError
+    def eval_inference_dist(self, x, z, param=None):
+        """this function computes log q(z | x)
+        Args:
+            z: tensor
+                different z points that will be evaluated, with
+                shape [batch, nsamples, nz]
+        Returns: Tensor1
+            Tensor1: log q(z|x) with shape [batch, nsamples]
+        """
+        raise NotImplementedError
+    def calc_mi(self, x):
+        """Approximate the mutual information between x and z
+        I(x, z) = E_xE_{q(z|x)}log(q(z|x)) - E_xE_{q(z|x)}log(q(z))
+        Returns: Float
+        """
+        raise NotImplementedError

Optimus/code/examples/big_ae/modules/encoders/gaussian_encoder.py ADDED Viewed

	@@ -0,0 +1,147 @@

+import math
+import torch
+import torch.nn as nn
+from .encoder import EncoderBase
+from ..utils import log_sum_exp
+class GaussianEncoderBase(EncoderBase):
+    """docstring for EncoderBase"""
+    def __init__(self):
+        super(GaussianEncoderBase, self).__init__()
+    def freeze(self):
+        for param in self.parameters():
+            param.requires_grad = False
+    def forward(self, x):
+        """
+        Args:
+            x: (batch_size, *)
+        Returns: Tensor1, Tensor2
+            Tensor1: the mean tensor, shape (batch, nz)
+            Tensor2: the logvar tensor, shape (batch, nz)
+        """
+        raise NotImplementedError
+    def encode_stats(self, x):
+        return self.forward(x)
+    def sample(self, input, nsamples):
+        """sampling from the encoder
+        Returns: Tensor1
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+        """
+        # (batch_size, nz)
+        mu, logvar = self.forward(input)
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mu, logvar, nsamples)
+        return z, (mu, logvar)
+    def encode(self, input, nsamples):
+        """perform the encoding and compute the KL term
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+        # (batch_size, nz)
+        mu, logvar = self.forward(input)
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mu, logvar, nsamples)
+        KL = 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+        return z, KL
+    def reparameterize(self, mu, logvar, nsamples=1):
+        """sample from posterior Gaussian family
+        Args:
+            mu: Tensor
+                Mean of gaussian distribution with shape (batch, nz)
+            logvar: Tensor
+                logvar of gaussian distibution with shape (batch, nz)
+        Returns: Tensor
+            Sampled z with shape (batch, nsamples, nz)
+        """
+        batch_size, nz = mu.size()
+        std = logvar.mul(0.5).exp()
+        mu_expd = mu.unsqueeze(1).expand(batch_size, nsamples, nz)
+        std_expd = std.unsqueeze(1).expand(batch_size, nsamples, nz)
+        eps = torch.zeros_like(std_expd).normal_()
+        return mu_expd + torch.mul(eps, std_expd)
+    def eval_inference_dist(self, x, z, param=None):
+        """this function computes log q(z | x)
+        Args:
+            z: tensor
+                different z points that will be evaluated, with
+                shape [batch, nsamples, nz]
+        Returns: Tensor1
+            Tensor1: log q(z|x) with shape [batch, nsamples]
+        """
+        nz = z.size(2)
+        if not param:
+            mu, logvar = self.forward(x)
+        else:
+            mu, logvar = param
+        # (batch_size, 1, nz)
+        mu, logvar = mu.unsqueeze(1), logvar.unsqueeze(1)
+        var = logvar.exp()
+        # (batch_size, nsamples, nz)
+        dev = z - mu
+        # (batch_size, nsamples)
+        log_density = -0.5 * ((dev ** 2) / var).sum(dim=-1) - \
+            0.5 * (nz * math.log(2 * math.pi) + logvar.sum(-1))
+        return log_density
+    def calc_mi(self, x):
+        """Approximate the mutual information between x and z
+        I(x, z) = E_xE_{q(z|x)}log(q(z|x)) - E_xE_{q(z|x)}log(q(z))
+        Returns: Float
+        """
+        # [x_batch, nz]
+        mu, logvar = self.forward(x)
+        x_batch, nz = mu.size()
+        # E_{q(z|x)}log(q(z|x)) = -0.5*nz*log(2*\pi) - 0.5*(1+logvar).sum(-1)
+        neg_entropy = (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (1 + logvar).sum(-1)).mean()
+        # [z_batch, 1, nz]
+        z_samples = self.reparameterize(mu, logvar, 1)
+        # [1, x_batch, nz]
+        mu, logvar = mu.unsqueeze(0), logvar.unsqueeze(0)
+        var = logvar.exp()
+        # (z_batch, x_batch, nz)
+        dev = z_samples - mu
+        # (z_batch, x_batch)
+        log_density = -0.5 * ((dev ** 2) / var).sum(dim=-1) - \
+            0.5 * (nz * math.log(2 * math.pi) + logvar.sum(-1))
+        # log q(z): aggregate posterior
+        # [z_batch]
+        log_qz = log_sum_exp(log_density, dim=1) - math.log(x_batch)
+        return (neg_entropy - log_qz.mean(-1)).item()

Optimus/code/examples/big_ae/modules/spacefusion.py ADDED Viewed

	@@ -0,0 +1,143 @@

+from .vae import VAE
+import numpy as np
+import torch, copy, pdb
+import torch.nn.functional as F
+from torch import nn
+import pdb
+def set_trainable(module, value):
+    for param in module.parameters():
+        param.requires_grad = value
+class SpaceFusion(VAE):
+    def __init__(self, encoder, decoder,  tokenizer_encoder, tokenizer_decoder, args):
+        super(SpaceFusion, self).__init__(encoder, decoder,  tokenizer_encoder, tokenizer_decoder, args)
+        children = [v for v in encoder.encoder.layer.children()]    # list of 12 BertLayer
+        self.num_s2s_bert_layer = args.num_s2s_bert_layer
+        self.S2S_layers = nn.ModuleList([copy.deepcopy(c) for c in children[-args.num_s2s_bert_layer:] ])    # the last layer of encoder
+        self.S2S_pooler = copy.deepcopy(encoder.pooler)
+        self.ix_turn_sep = tokenizer_encoder.convert_tokens_to_ids('[SEP]')
+        if args.freeze_bert:
+            print('@'*20 + f' freezing BERT {args.num_frozen_bert_layer} layers')
+            for child in children[:args.num_frozen_bert_layer]:
+                set_trainable(child, False)
+    def ids2speaker(self, ids):
+        # 0 for speaker A, 1 for speaker B
+        N, T = ids.shape
+        speaker = np.zeros((N, T))
+        sep = ids == self.ix_turn_sep
+        for i in range(N):
+            is_B = False    # start with speaker A
+            for t in range(T):
+                speaker[i,t] = int(is_B)
+                if sep[i,t].item():
+                    is_B = not is_B
+        # make sure the final speaker is speaker B (so response is always speaker A)
+        if not is_B:
+            speaker = 1 - speaker
+        return torch.LongTensor(speaker).to(ids.device)
+    def forward(self, inputs_src, inputs_tgt, labels_tgt, return_vec=False):  # [batch, time]
+        # toggle config to get desired encoder output
+        self.encoder.encoder.output_attentions = False
+        self.encoder.encoder.output_hidden_states = True
+        # AE encoder
+        mask = (inputs_tgt > 0).float().to(inputs_src.device)
+        outputs = self.encoder(inputs_tgt, attention_mask=mask)
+        z_AE, _ = self.connect(outputs[1])
+        z_AE = z_AE.squeeze(1)
+        # S2S encoder
+        mask = (inputs_src > 0).float()
+        speaker = self.ids2speaker(inputs_src)
+        outputs = self.encoder(inputs_src, attention_mask=mask, token_type_ids=speaker)
+        _, _, all_layer_attn = outputs      # last_layer_attn, pooled, all_layer_attn = outputs
+        seq_z_prev = all_layer_attn[-self.num_s2s_bert_layer-1]     # seq of z at layer 11 ()
+        for s2s in self.S2S_layers:
+            layer_outputs = s2s(seq_z_prev, attention_mask=mask.unsqueeze(1).unsqueeze(1))
+            seq_z_prev = layer_outputs[0]
+        z_S2S = self.encoder.pooler(layer_outputs[0])
+        z_S2S, _ = self.connect(z_S2S)
+        z_S2S = z_S2S.squeeze(1)
+        if return_vec:
+            return z_AE, z_S2S
+        # interpolation/smoothness
+        u = torch.FloatTensor(np.random.random((z_AE.shape[0], 1))).to(inputs_tgt.device)
+        z_interp = u * z_AE + (1 - u) * z_S2S
+        std = 0.1
+        noise = torch.FloatTensor(np.random.normal(size=z_interp.shape) * std).to(z_interp.device)
+        z_interp = z_interp + noise
+        loss_rec = 0
+        z_idx = 0
+        for z in [z_AE, z_S2S, z_interp]:
+            #pdb.set_trace()
+            past = z # past = self.decoder.linear(z)
+            outputs = self.decoder(input_ids=labels_tgt, past=past, labels=labels_tgt, label_ignore=self.pad_token_id)
+            if z_idx == 1:
+                loss_rec = loss_rec + 1.0 * outputs[0]
+            else:
+                loss_rec = loss_rec + outputs[0]
+            z_idx += 1
+        loss_rec = loss_rec/3
+        # fusion/regularization
+        L_pull = self.dist_pair(z_AE, z_S2S)
+        L_push = torch.stack([self.dist_batch(z) for z in [z_AE, z_S2S]]).min()
+        loss_reg = (L_pull - L_push * 2) / np.sqrt(z.shape[-1])
+        loss = loss_rec + self.args.beta * loss_reg
+        return loss_rec, loss_reg, loss
+    def sent2latent(self, inputs_src):
+        # toggle config to get desired encoder output
+        self.encoder.encoder.output_attentions = False
+        self.encoder.encoder.output_hidden_states = True
+        # S2S encoder
+        mask = (inputs_src > 0).float()
+        speaker = self.ids2speaker(inputs_src)
+        outputs = self.encoder(inputs_src, attention_mask=mask, token_type_ids=speaker)
+        _, _, all_layer_attn = outputs      # last_layer_attn, pooled, all_layer_attn = outputs
+        # seq_z_prev = all_layer_attn[-2]     # seq of z at layer 11 ()
+        # layer_outputs = self.S2S_layer(seq_z_prev, attention_mask=mask.unsqueeze(1).unsqueeze(1))
+        seq_z_prev = all_layer_attn[-self.num_s2s_bert_layer-1]     # seq of z at layer 11 ()
+        for s2s in self.S2S_layers:
+            layer_outputs = s2s(seq_z_prev, attention_mask=mask.unsqueeze(1).unsqueeze(1))
+            seq_z_prev = layer_outputs[0]
+        z_S2S = self.encoder.pooler(layer_outputs[0])
+        z_S2S, _ = self.connect(z_S2S)
+        z_S2S = z_S2S.squeeze(1)
+        return z_S2S
+    def dist_pair(self, a, b):
+        return F.pairwise_distance(a, b).mean()
+    def dist_batch(self, vec):
+        n = vec.shape[0]
+        dmin = []
+        for i in range(n):
+            dd = F.pairwise_distance(vec[i:i+1,:].repeat(n,1), vec)
+            dmin.append(dd.min())
+        return torch.stack(dmin).mean()

Optimus/code/examples/big_ae/modules/utils.py ADDED Viewed

	@@ -0,0 +1,40 @@

+import torch
+def safe_log(z):
+    return torch.log(z + 1e-7)
+def log_sum_exp(value, dim=None, keepdim=False):
+    """Numerically stable implementation of the operation
+    value.exp().sum(dim, keepdim).log()
+    """
+    if dim is not None:
+        m, _ = torch.max(value, dim=dim, keepdim=True)
+        value0 = value - m
+        if keepdim is False:
+            m = m.squeeze(dim)
+        return m + torch.log(torch.sum(torch.exp(value0), dim=dim, keepdim=keepdim))
+    else:
+        m = torch.max(value)
+        sum_exp = torch.sum(torch.exp(value - m))
+        return m + torch.log(sum_exp)
+def generate_grid(zmin, zmax, dz, device, ndim=2):
+    """generate a 1- or 2-dimensional grid
+    Returns: Tensor, int
+        Tensor: The grid tensor with shape (k^2, 2),
+            where k=(zmax - zmin)/dz
+        int: k
+    """
+    if ndim == 2:
+        x = torch.arange(zmin, zmax, dz)
+        k = x.size(0)
+        x1 = x.unsqueeze(1).repeat(1, k).view(-1)
+        x2 = x.repeat(k)
+        return torch.cat((x1.unsqueeze(-1), x2.unsqueeze(-1)), dim=-1).to(device), k
+    elif ndim == 1:
+        return torch.arange(zmin, zmax, dz).unsqueeze(1).to(device)

Optimus/code/examples/big_ae/modules/vae.py ADDED Viewed

	@@ -0,0 +1,638 @@

+import math
+import torch
+import torch.nn as nn
+from .utils import log_sum_exp
+import pdb
+import logging
+logger = logging.getLogger(__name__)
+class VAE(nn.Module):
+    """VAE with normal prior"""
+    def __init__(self, encoder, decoder,  tokenizer_encoder, tokenizer_decoder, args): #
+        super(VAE, self).__init__()
+        self.encoder = encoder
+        self.decoder = decoder
+        self.args = args
+        self.nz = args.latent_size
+        self.eos_token_id = tokenizer_decoder.convert_tokens_to_ids([tokenizer_decoder.eos_token])[0]
+        self.pad_token_id = tokenizer_decoder.convert_tokens_to_ids([tokenizer_decoder.pad_token])[0]
+        # connector: from Bert hidden units to the latent space
+        # self.linear = nn.Linear(args.nz, 2 * args.nz, bias=False)
+        # Standard Normal prior
+        loc = torch.zeros(self.nz, device=args.device)
+        scale = torch.ones(self.nz, device=args.device)
+        self.prior = torch.distributions.normal.Normal(loc, scale)
+    def connect(self, bert_fea, nsamples=1):
+        """
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+        # (batch_size, nz)
+        mean, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+        # pdb.set_trace()
+        # mean, logvar = mean.squeeze(0), logvar.squeeze(0)
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mean, logvar, nsamples)
+        KL = 0.5 * (mean.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+        return z, KL
+    def connect_deterministic(self, bert_fea, nsamples=1):
+        """
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+        # (batch_size, nz)
+        mean, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+        # pdb.set_trace()
+        # mean, logvar = mean.squeeze(0), logvar.squeeze(0)
+        logvar.fill_(.0)
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mean, logvar, nsamples)
+        KL = 0.5 * (mean.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+        return z, KL
+    def reparameterize(self, mu, logvar, nsamples=1):
+        """sample from posterior Gaussian family
+        Args:
+            mu: Tensor
+                Mean of gaussian distribution with shape (batch, nz)
+            logvar: Tensor
+                logvar of gaussian distibution with shape (batch, nz)
+        Returns: Tensor
+            Sampled z with shape (batch, nsamples, nz)
+        """
+        batch_size, nz = mu.size()
+        std = logvar.mul(0.5).exp()
+        mu_expd = mu.unsqueeze(1).expand(batch_size, nsamples, nz)
+        std_expd = std.unsqueeze(1).expand(batch_size, nsamples, nz)
+        eps = torch.zeros_like(std_expd).normal_()
+        return mu_expd + torch.mul(eps, std_expd)
+    def forward(self, inputs, labels):
+        # pdb.set_trace()
+        attention_mask=(inputs > 0).float()
+        # logger.info(inputs)
+        # logger.info(attention_mask)
+        # logger.info(labels)
+        reconstrution_mask=(labels != 50257).float() # 50257 is the padding token for GPT2
+        sent_length = torch.sum(reconstrution_mask, dim=1)
+        outputs = self.encoder(inputs, attention_mask)
+        pooled_hidden_fea = outputs[1]  # model outputs are always tuple in pytorch-transformers (see doc)
+        if self.args.fb_mode==0:
+            # Connect hidden feature to the latent space
+            latent_z, loss_kl = self.connect(pooled_hidden_fea)
+            latent_z = latent_z.squeeze(1)
+            # Decoding
+            outputs = self.decoder(input_ids=labels, past=latent_z, labels=labels, label_ignore=self.pad_token_id)
+            loss_rec = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+        elif self.args.fb_mode==1:
+            # Connect hidden feature to the latent space
+            mu, logvar = self.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+            latent_z = self.reparameterize(mu, logvar, nsamples=1)
+            latent_z = latent_z.squeeze(1)
+            loss_kl = 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1)
+            kl_mask = (loss_kl > self.args.dim_target_kl).float()
+            loss_kl = (kl_mask * loss_kl).sum(dim=1)
+            # pdb.set_trace()
+            # past = self.decoder.linear(latent_z)
+            # Decoding
+            outputs = self.decoder(input_ids=labels, past=latent_z, labels=labels, label_ignore=self.pad_token_id)
+            loss_rec = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+        elif self.args.fb_mode==2:
+            # Connect hidden feature to the latent space
+            latent_z, loss_kl = self.connect_deterministic(pooled_hidden_fea)
+            latent_z = latent_z.squeeze(1)
+            # past = self.decoder.linear(latent_z)
+            # Decoding
+            outputs = self.decoder(input_ids=labels, past=latent_z, labels=labels, label_ignore=self.pad_token_id)
+            loss_rec = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+        # pdb.set_trace()
+        if self.args.length_weighted_loss:
+            loss = loss_rec / sent_length + self.args.beta * loss_kl
+        else:
+            loss = loss_rec + self.args.beta * loss_kl
+        return loss_rec, loss_kl, loss
+    def encoder_sample(self, bert_fea, nsamples):
+        """sampling from the encoder
+        Returns: Tensor1
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+        """
+        # (batch_size, nz)
+        mu, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+        mu, logvar = mu.squeeze(0), logvar.squeeze(0)
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mu, logvar, nsamples)
+        return z, (mu, logvar)
+    def encode_stats(self, x):
+        """
+        Returns: Tensor1, Tensor2
+            Tensor1: the mean of latent z with shape [batch, nz]
+            Tensor2: the logvar of latent z with shape [batch, nz]
+        """
+        return self.encoder.encode_stats(x)
+    def decode(self, z, strategy, K=10):
+        """generate samples from z given strategy
+        Args:
+            z: [batch, nsamples, nz]
+            strategy: "beam" or "greedy" or "sample"
+            K: the beam width parameter
+        Returns: List1
+            List1: a list of decoded word sequence
+        """
+        if strategy == "beam":
+            return self.decoder.beam_search_decode(z, K)
+        elif strategy == "greedy":
+            return self.decoder.greedy_decode(z)
+        elif strategy == "sample":
+            return self.decoder.sample_decode(z)
+        else:
+            raise ValueError("the decoding strategy is not supported")
+    def reconstruct(self, x, decoding_strategy="greedy", K=5):
+        """reconstruct from input x
+        Args:
+            x: (batch, *)
+            decoding_strategy: "beam" or "greedy" or "sample"
+            K: the beam width parameter
+        Returns: List1
+            List1: a list of decoded word sequence
+        """
+        z = self.sample_from_inference(x).squeeze(1)
+        return self.decode(z, decoding_strategy, K)
+    def log_probability(self, x, z):
+        """Cross Entropy in the language case
+        Args:
+            x: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            log_p: (batch_size, n_sample).
+                log_p(x|z) across different x and z
+        """
+        outputs = self.decoder(input_ids=x, past=z, labels=x, label_ignore=self.pad_token_id)
+        loss_rec = outputs[0]
+        return -loss_rec
+    def loss_iw(self, x0, x1, nsamples=50, ns=1):
+        """
+        Args:
+            x: if the data is constant-length, x is the data tensor with
+                shape (batch, *). Otherwise x is a tuple that contains
+                the data tensor and length list
+        Returns: Tensor1, Tensor2, Tensor3
+            Tensor1: total loss [batch]
+            Tensor2: reconstruction loss shape [batch]
+            Tensor3: KL loss shape [batch]
+        """
+        # encoding into bert features
+        bert_fea = self.encoder(x0)[1]
+        # (batch_size, nz)
+        mu, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+        ##################
+        # compute KL
+        ##################
+        # pdb.set_trace()
+        KL = 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+        # mu, logvar = mu.squeeze(0), logvar.squeeze(0)
+        ll_tmp, rc_tmp = [], []
+        for _ in range(int(nsamples / ns)):
+            # (batch, nsamples, nz)
+            z = self.reparameterize(mu, logvar, ns)
+            # past = self.decoder.linear(z)
+            past = z
+            # [batch, nsamples]
+            log_prior = self.eval_prior_dist(z)
+            log_gen = self.eval_cond_ll(x1, past)
+            log_infer = self.eval_inference_dist(z, (mu, logvar))
+            # pdb.set_trace()
+            log_gen = log_gen.unsqueeze(0).contiguous().view(z.shape[0],-1)
+            # pdb.set_trace()
+            rc_tmp.append(log_gen)
+            ll_tmp.append(log_gen + log_prior - log_infer)
+        log_prob_iw = log_sum_exp(torch.cat(ll_tmp, dim=-1), dim=-1) - math.log(nsamples)
+        log_gen_iw = torch.mean(torch.cat(rc_tmp, dim=-1), dim=-1)
+        return log_prob_iw, log_gen_iw , KL
+    def nll_iw(self, x0, x1, nsamples, ns=1):
+        """compute the importance weighting estimate of the log-likelihood
+        Args:
+            x0, x1:  two different tokenization results of x, where x is the data tensor with shape (batch, *).
+            nsamples: Int
+                the number of samples required to estimate marginal data likelihood
+        Returns: Tensor1
+            Tensor1: the estimate of log p(x), shape [batch]
+        """
+        # compute iw every ns samples to address the memory issue
+        # nsamples = 500, ns = 100
+        # nsamples = 500, ns = 10
+        # TODO: note that x is forwarded twice in self.encoder.sample(x, ns) and self.eval_inference_dist(x, z, param)
+        #.      this problem is to be solved in order to speed up
+        tmp = []
+        for _ in range(int(nsamples / ns)):
+            # [batch, ns, nz]
+            # Chunyuan:
+            # encoding into bert features
+            pooled_hidden_fea = self.encoder(x0)[1]
+            # param is the parameters required to evaluate q(z|x)
+            z, param = self.encoder_sample(pooled_hidden_fea, ns)
+            # [batch, ns]
+            log_comp_ll = self.eval_complete_ll(x1, z)
+            log_infer_ll = self.eval_inference_dist(z, param)
+            tmp.append(log_comp_ll - log_infer_ll)
+        ll_iw = log_sum_exp(torch.cat(tmp, dim=-1), dim=-1) - math.log(nsamples)
+        return ll_iw
+    def KL(self, x):
+        _, KL = self.encode(x, 1)
+        return KL
+    def eval_prior_dist(self, zrange):
+        """perform grid search to calculate the true posterior
+        Args:
+            zrange: tensor
+                different z points that will be evaluated, with
+                shape (k^2, nz), where k=(zmax - zmin)/space
+        """
+        # (k^2)
+        return self.prior.log_prob(zrange).sum(dim=-1)
+    def eval_complete_ll(self, x, z):
+        """compute log p(z,x)
+        Args:
+            x: Tensor
+                input with shape [batch, seq_len]
+            z: Tensor
+                evaluation points with shape [batch, nsamples, nz]
+        Returns: Tensor1
+            Tensor1: log p(z,x) Tensor with shape [batch, nsamples]
+        """
+        # [batch, nsamples]
+        log_prior = self.eval_prior_dist(z)
+        log_gen = self.eval_cond_ll(x, z)
+        return log_prior + log_gen
+    def eval_cond_ll(self, x, z):
+        """compute log p(x|z)
+        """
+        x_shape = list(x.size())
+        z_shape = list(z.size())
+        if len(z_shape) == 3:
+            x = x.unsqueeze(1).repeat(1, z_shape[1], 1).contiguous().view(x_shape[0]*z_shape[1], x_shape[-1])
+            z = z.contiguous().view(x_shape[0]*z_shape[1], z_shape[-1])
+        return self.log_probability(x, z)
+    def eval_log_model_posterior(self, x, grid_z):
+        """perform grid search to calculate the true posterior
+         this function computes p(z|x)
+        Args:
+            grid_z: tensor
+                different z points that will be evaluated, with
+                shape (k^2, nz), where k=(zmax - zmin)/pace
+        Returns: Tensor
+            Tensor: the log posterior distribution log p(z|x) with
+                    shape [batch_size, K^2]
+        """
+        try:
+            batch_size = x.size(0)
+        except:
+            batch_size = x[0].size(0)
+        # (batch_size, k^2, nz)
+        grid_z = grid_z.unsqueeze(0).expand(batch_size, *grid_z.size()).contiguous()
+        # (batch_size, k^2)
+        log_comp = self.eval_complete_ll(x, grid_z)
+        # normalize to posterior
+        log_posterior = log_comp - log_sum_exp(log_comp, dim=1, keepdim=True)
+        return log_posterior
+    def sample_from_inference(self, x, nsamples=1):
+        """perform sampling from inference net
+        Returns: Tensor
+            Tensor: samples from infernece nets with
+                shape (batch_size, nsamples, nz)
+        """
+        z, _ = self.encoder.sample(x, nsamples)
+        return z
+    def sample_from_posterior(self, x, nsamples):
+        """perform MH sampling from model posterior
+        Returns: Tensor
+            Tensor: samples from model posterior with
+                shape (batch_size, nsamples, nz)
+        """
+        # use the samples from inference net as initial points
+        # for MCMC sampling. [batch_size, nsamples, nz]
+        cur = self.encoder.sample_from_inference(x, 1)
+        cur_ll = self.eval_complete_ll(x, cur)
+        total_iter = self.args.mh_burn_in + nsamples * self.args.mh_thin
+        samples = []
+        for iter_ in range(total_iter):
+            next = torch.normal(mean=cur,
+                std=cur.new_full(size=cur.size(), fill_value=self.args.mh_std))
+            # [batch_size, 1]
+            next_ll = self.eval_complete_ll(x, next)
+            ratio = next_ll - cur_ll
+            accept_prob = torch.min(ratio.exp(), ratio.new_ones(ratio.size()))
+            uniform_t = accept_prob.new_empty(accept_prob.size()).uniform_()
+            # [batch_size, 1]
+            mask = (uniform_t < accept_prob).float()
+            mask_ = mask.unsqueeze(2)
+            cur = mask_ * next + (1 - mask_) * cur
+            cur_ll = mask * next_ll + (1 - mask) * cur_ll
+            if iter_ >= self.args.mh_burn_in and (iter_ - self.args.mh_burn_in) % self.args.mh_thin == 0:
+                samples.append(cur.unsqueeze(1))
+        return torch.cat(samples, dim=1)
+    def calc_model_posterior_mean(self, x, grid_z):
+        """compute the mean value of model posterior, i.e. E_{z ~ p(z|x)}[z]
+        Args:
+            grid_z: different z points that will be evaluated, with
+                    shape (k^2, nz), where k=(zmax - zmin)/pace
+            x: [batch, *]
+        Returns: Tensor1
+            Tensor1: the mean value tensor with shape [batch, nz]
+        """
+        # [batch, K^2]
+        log_posterior = self.eval_log_model_posterior(x, grid_z)
+        posterior = log_posterior.exp()
+        # [batch, nz]
+        return torch.mul(posterior.unsqueeze(2), grid_z.unsqueeze(0)).sum(1)
+    def calc_infer_mean(self, x):
+        """
+        Returns: Tensor1
+            Tensor1: the mean of inference distribution, with shape [batch, nz]
+        """
+        mean, logvar = self.encoder.forward(x)
+        return mean
+    def eval_inference_dist(self, z, param):
+        """this function computes log q(z | x)
+        Args:
+            z: tensor
+                different z points that will be evaluated, with
+                shape [batch, nsamples, nz]
+        Returns: Tensor1
+            Tensor1: log q(z|x) with shape [batch, nsamples]
+        """
+        nz = z.size(2)
+        mu, logvar = param
+        # (batch_size, 1, nz)
+        mu, logvar = mu.unsqueeze(1), logvar.unsqueeze(1)
+        var = logvar.exp()
+        # (batch_size, nsamples, nz)
+        dev = z - mu
+        # (batch_size, nsamples)
+        log_density = -0.5 * ((dev ** 2) / var).sum(dim=-1) - \
+            0.5 * (nz * math.log(2 * math.pi) + logvar.sum(-1))
+        return log_density
+    def calc_mi(self, test_data_batch, args):
+        # calc_mi_v3
+        import math
+        from modules.utils import log_sum_exp
+        mi = 0
+        num_examples = 0
+        mu_batch_list, logvar_batch_list = [], []
+        neg_entropy = 0.
+        for batch_data in test_data_batch:
+            x0, _, _ = batch_data
+            x0 = x0.to(args.device)
+            # encoding into bert features
+            bert_fea = self.encoder(x0)[1]
+            (batch_size, nz)
+            mu, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+            x_batch, nz = mu.size()
+            #print(x_batch, end=' ')
+            num_examples += x_batch
+            # E_{q(z|x)}log(q(z|x)) = -0.5*nz*log(2*\pi) - 0.5*(1+logvar).sum(-1)
+            neg_entropy += (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (1 + logvar).sum(-1)).sum().item()
+            mu_batch_list += [mu.cpu()]
+            logvar_batch_list += [logvar.cpu()]
+            pdb.set_trace()
+        neg_entropy = neg_entropy / num_examples
+        ##print()
+        num_examples = 0
+        log_qz = 0.
+        for i in range(len(mu_batch_list)):
+            ###############
+            # get z_samples
+            ###############
+            mu, logvar = mu_batch_list[i].cuda(), logvar_batch_list[i].cuda()
+            # [z_batch, 1, nz]
+            z_samples = self.reparameterize(mu, logvar, 1)
+            z_samples = z_samples.view(-1, 1, nz)
+            num_examples += z_samples.size(0)
+            ###############
+            # compute density
+            ###############
+            # [1, x_batch, nz]
+            #mu, logvar = mu_batch_list[i].cuda(), logvar_batch_list[i].cuda()
+            #indices = list(np.random.choice(np.arange(len(mu_batch_list)), 10)) + [i]
+            indices = np.arange(len(mu_batch_list))
+            mu = torch.cat([mu_batch_list[_] for _ in indices], dim=0).cuda()
+            logvar = torch.cat([logvar_batch_list[_] for _ in indices], dim=0).cuda()
+            x_batch, nz = mu.size()
+            mu, logvar = mu.unsqueeze(0), logvar.unsqueeze(0)
+            var = logvar.exp()
+            # (z_batch, x_batch, nz)
+            dev = z_samples - mu
+            # (z_batch, x_batch)
+            log_density = -0.5 * ((dev ** 2) / var).sum(dim=-1) - \
+                0.5 * (nz * math.log(2 * math.pi) + logvar.sum(-1))
+            # log q(z): aggregate posterior
+            # [z_batch]
+            log_qz += (log_sum_exp(log_density, dim=1) - math.log(x_batch)).sum(-1)
+        log_qz /= num_examples
+        mi = neg_entropy - log_qz
+        return mi
+    def calc_au(self, eval_dataloader, args, delta=0.01):
+        """compute the number of active units
+        """
+        cnt = 0
+        for batch_data in eval_dataloader:
+            x0, _, _ = batch_data
+            x0 = x0.to(args.device)
+            # encoding into bert features
+            bert_fea = self.encoder(x0)[1]
+            # (batch_size, nz)
+            mean, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+            if cnt == 0:
+                means_sum = mean.sum(dim=0, keepdim=True)
+            else:
+                means_sum = means_sum + mean.sum(dim=0, keepdim=True)
+            cnt += mean.size(0)
+        # (1, nz)
+        mean_mean = means_sum / cnt
+        cnt = 0
+        for batch_data in eval_dataloader:
+            x0, _, _ = batch_data
+            x0 = x0.to(args.device)
+            # encoding into bert features
+            bert_fea = self.encoder(x0)[1]
+            # (batch_size, nz)
+            mean, _ = self.encoder.linear(bert_fea).chunk(2, -1)
+            if cnt == 0:
+                var_sum = ((mean - mean_mean) ** 2).sum(dim=0)
+            else:
+                var_sum = var_sum + ((mean - mean_mean) ** 2).sum(dim=0)
+            cnt += mean.size(0)
+        # (nz)
+        au_var = var_sum / (cnt - 1)
+        return (au_var >= delta).sum().item(), au_var

Optimus/code/examples/big_ae/run_data_filtering.py ADDED Viewed

	@@ -0,0 +1,507 @@

+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+from __future__ import absolute_import, division, print_function
+import pdb
+import argparse
+import glob
+import logging
+import os
+import pickle
+import json
+import random
+from pathlib import Path
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+from utils import (calc_iwnll, calc_mi, calc_au, BucketingDataLoader, MultipleFiles_DataLoader, BucketingMultipleFiles_DataLoader, frange_cycle_linear, frange_cycle_zero_linear)
+from modules import VAE
+# logging.getLogger("azure").setLevel(logging.WARNING)
+# logging.getLogger("TableService").setLevel(logging.WARNING)
+logger = logging.getLogger(__name__)
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+# ts = TableService(account_name=storage_name, account_key=key)
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+        file_path=args.input_file_path
+        dataloader = MultipleFiles_DataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=True, use_tensor=False)
+    else:
+        pass
+    return dataloader
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+def train(args, train_dataloader, model_vae, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    # train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    # train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+    # Prepare optimizer and schedule (linear warmup and decay)
+    # model_encoder, model_decoder, model_connector = model_vae.encoder,  model_vae.decoder, model_vae.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_vae.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_vae.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_vae, optimizer = amp.initialize(model_vae, optimizer, opt_level=args.fp16_opt_level)
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_vae = torch.nn.DataParallel(model_vae, device_ids=range(args.n_gpu)).to(args.device)
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+    files = Path(args.input_file_path)
+    num_files = len(list(files.glob('*seq64*.json')))
+    # create output file folder
+    if not os.path.exists(args.output_file_path) and args.local_rank in [-1, 0]:
+        os.makedirs(args.output_file_path)
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num files = %d", num_files)
+    logger.info("  Num examples of first file = %d", train_dataloader.num_examples)
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+    num_collected, num_dropped = 0, 0
+    model_vae.zero_grad()
+    num_train_epochs_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    n_iter = int(args.num_train_epochs) * len(train_dataloader)
+    tmp_list = []
+    dict_token_length = defaultdict(int)
+    if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(args.output_dir)
+    dict_file = os.path.join(args.output_dir, args.dataset.lower()+f'.length_freq.json' )
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in num_train_epochs_iterator:
+        for idx_file in range(num_files):
+            examples = []
+            cached_features_file = os.path.join(args.output_file_path, args.dataset.lower()+f'.segmented.nltk.split.seq64.{train_dataloader.file_idx}.json' )
+            logger.info(f"Epoch {epoch}, File idx {train_dataloader.file_idx}")
+            epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+            # if idx_file > 11:
+            #     break
+            for step, batch in enumerate(epoch_iterator):
+                inst, token_lengths = batch
+                dict_token_length[ token_lengths[0,0].item() ] += 1
+                if ( token_lengths> 256 ).sum().item()>0:
+                    over_length_tensor = ( token_lengths> 256 ).sum(-1)
+                    inst_ = [inst[i] for i in range(len(inst)) if over_length_tensor[i]==0 ]
+                    examples += inst_
+                    num_collected += len(inst_)
+                    num_dropped   += len(inst) - len(inst_)
+                    logger.info(f"{num_dropped} files filtered.")
+                else:
+                    examples += inst
+                    num_collected += len(inst)
+            # Good practice: save your data multiple times on Philly
+            if args.use_philly:
+                save_solid = False
+                while not save_solid:
+                    try:
+                        with open(cached_features_file, 'w') as fp:
+                            json.dump(examples, fp)
+                        save_solid = True
+                    except:
+                        pass
+            else:
+                with open(cached_features_file, 'w') as fp:
+                    json.dump(examples, fp)
+            logger.info(f"Saving features in the cached file at {cached_features_file}")
+        train_dataloader.reset()
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+    logger.info(dict_token_length)
+    # Good practice: save your dict multiple times on Philly
+    if args.use_philly:
+        save_solid = False
+        while not save_solid:
+            try:
+                with open(dict_file, 'w') as fp:
+                    json.dump(dict_token_length, fp)
+                save_solid = True
+            except:
+                pass
+    else:
+        with open(dict_file, 'w') as fp:
+            json.dump(dict_token_length, fp)
+    return num_collected, num_dropped
+def main():
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--input_file_path", default=None, type=str, required=True,
+                        help="The output directory where the input files will be written.")
+    parser.add_argument("--output_file_path", default=None, type=str, required=True,
+                        help="The output directory where the output files will be written.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the logs and results will be saved.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+    ## Other parameters
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.")
+    parser.add_argument("--ratio_zero", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")
+    parser.add_argument("--dim_target_kl", default=3.0, type=float,
+                        help="dim_target_kl free bit training mode.")
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+    # Precision & Distributed Training
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if os.path.exists(args.output_file_path) and os.listdir(args.output_file_path) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_file_path))
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+    # Setup CUDA, GPU & distributed training
+    logger.info(f'Local rank is {args.local_rank}')
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero)
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size)
+    try:
+        ts.create_table(table_name)
+    except:
+        pass
+    # Set seed
+    set_seed(args)
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+    ## Encoder
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+    # model_encoder.to(args.device)
+    ## Decoder
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size)
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+    # model_decoder.to(args.device)
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device) #
+    # on_gpu = next(model_vae.parameters()).is_cuda
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+    logger.info("Training/evaluation parameters %s", args)
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+        train_dataloader = build_dataload_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+        num_collected, num_dropped = train(args, train_dataloader, model_vae, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info(" num_collected = %s, num_dropped = %s", num_collected, num_dropped)
+if __name__ == "__main__":
+    main()

Optimus/code/examples/big_ae/run_dialog_dataloader.py ADDED Viewed

	@@ -0,0 +1,483 @@

+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+from __future__ import absolute_import, division, print_function
+import pdb
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+from utils import (calc_iwnll, calc_mi, calc_au, Dialog_BucketingDataLoader, TextDataset_Split, TextDataset_2Tokenizers, frange_cycle_linear, frange_cycle_zero_linear)
+from modules import VAE
+# logging.getLogger("azure").setLevel(logging.WARNING)
+# logging.getLogger("TableService").setLevel(logging.WARNING)
+logger = logging.getLogger(__name__)
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+# ts = TableService(account_name=storage_name, account_key=key)
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+            file_path=args.eval_data_file
+        dataloader = Dialog_BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=True)
+    else:
+        pass
+    return dataloader
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+def train(args, train_dataloader, model_vae, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    # train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    # train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+    # Prepare optimizer and schedule (linear warmup and decay)
+    # model_encoder, model_decoder, model_connector = model_vae.encoder,  model_vae.decoder, model_vae.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_vae.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_vae.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_vae, optimizer = amp.initialize(model_vae, optimizer, opt_level=args.fp16_opt_level)
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_vae = torch.nn.DataParallel(model_vae, device_ids=range(args.n_gpu)).to(args.device)
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", train_dataloader.num_examples)
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model_vae.zero_grad()
+    # model_vae = model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    n_iter = int(args.num_train_epochs) * len(train_dataloader)
+    beta_t_list = frange_cycle_zero_linear(n_iter, start=0.0, stop=args.beta,  n_cycle=1, ratio_increase=args.ratio_increase, ratio_zero=args.ratio_zero)
+    tmp_list = []
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            input_ids_bert_ctx, input_ids_bert, input_ids_gpt, token_lengths = batch
+            logger.info(f'Conxtext in Bert, Length {token_lengths[0]} ; Tokens: {input_ids_bert_ctx}')
+            logger.info(f'Response in Bert, Length {token_lengths[1]} ; Tokens: {input_ids_bert}')
+            logger.info(f'Response in GPT2, Length {token_lengths[2]} ; Tokens: {input_ids_gpt}')
+            # TODO: write donw training scripts for dialog response generation
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                global_step += 1
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+    return global_step
+def main():
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+    parser.add_argument("--use_pretrained_model", action='store_true',
+                        help="Use pre-trained auto-encoder models as the initialization")
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.")
+    parser.add_argument("--ratio_zero", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")
+    parser.add_argument("--dim_target_kl", default=3.0, type=float,
+                        help="dim_target_kl free bit training mode.")
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+    # Precision & Distributed Training
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero)
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size)
+    try:
+        ts.create_table(table_name)
+    except:
+        pass
+    # Set seed
+    set_seed(args)
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+    if args.use_pretrained_model:
+        args.encoder_model_type = args.encoder_model_type.lower()
+        args.decoder_model_type = args.decoder_model_type.lower()
+        global_step = args.gloabl_step_eval
+        output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}'.format(global_step))
+        checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        # Load a trained Encoder model and vocabulary
+        encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+        model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        model_encoder.to(args.device)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+        # Load a trained Decoder model and vocabulary
+        decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        model_decoder.to(args.device)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    else:
+        ## Encoder
+        encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+        encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+        model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+        # model_encoder.to(args.device)
+        ## Decoder
+        decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+        decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+        model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size)
+    pdb.set_trace()
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+    # model_decoder.to(args.device)
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device) #
+    # on_gpu = next(model_vae.parameters()).is_cuda
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+    logger.info("Training/evaluation parameters %s", args)
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+        train_dataloader = build_dataload_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+        global_step = train(args, train_dataloader, model_vae, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info(" global_step = %s", global_step)
+if __name__ == "__main__":
+    main()

Optimus/code/examples/big_ae/run_encoding_generation.py ADDED Viewed

	@@ -0,0 +1,487 @@

+#!/usr/bin/env python3
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+import torch
+import torch.nn.functional as F
+import numpy as np
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tqdm import tqdm, trange
+from pytorch_transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, BertConfig
+from pytorch_transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2ForLatentConnector
+from pytorch_transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
+from pytorch_transformers import XLNetLMHeadModel, XLNetTokenizer
+from pytorch_transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
+from pytorch_transformers import BertForLatentConnector, BertTokenizer
+from collections import defaultdict
+from modules import VAE
+from utils import (TextDataset_Split, TextDataset_2Tokenizers, BucketingDataLoader)
+import pdb
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+MAX_LENGTH = int(10000)  # Hardcoded max length to avoid infinite loop
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig)), ())
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer)
+}
+# Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
+# in https://github.com/rusiaaman/XLNet-gen#methodology
+# and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e
+PADDING_TEXT = """ In 1991, the remains of Russian Tsar Nicholas II and his family
+(except for Alexei and Maria) are discovered.
+The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+remainder of the story. 1883 Western Siberia,
+a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+Rasputin has a vision and denounces one of the men as a horse thief. Although his
+father initially slaps him for making such an accusation, Rasputin watches as the
+man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
+def set_seed(args):
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+            file_path=args.eval_data_file
+        dataloader = BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=False)
+    else:
+        pass
+    return dataloader
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (vocabulary size)
+            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[indices_to_remove] = filter_value
+    return logits
+def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, is_xlnet=False, device='cpu'):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    with torch.no_grad():
+        for _ in trange(length):
+            inputs = {'input_ids': generated}
+            if is_xlnet:
+                # XLNet is a direct (predict same token, not next token) and bi-directional model by default
+                # => need one additional dummy token in the input (will be masked), attention mask and target mapping (see model docstring)
+                input_ids = torch.cat((generated, torch.zeros((1, 1), dtype=torch.long, device=device)), dim=1)
+                perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float, device=device)
+                perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
+                target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float, device=device)
+                target_mapping[0, 0, -1] = 1.0  # predict last token
+                inputs = {'input_ids': input_ids, 'perm_mask': perm_mask, 'target_mapping': target_mapping}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+    return generated
+def sample_sequence_conditional(model, length, context, past=None, num_samples=1, temperature=1, top_k=0, top_p=0.0, device='cpu', decoder_tokenizer=None):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    with torch.no_grad():
+        while True:
+        # for _ in trange(length):
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+            # pdb.set_trace()
+            if next_token.unsqueeze(0)[0,0].item() == decoder_tokenizer.encode('<EOS>')[0]:
+                break
+    return generated
+# a wrapper function to choose between different play modes
+def evaluate_latent_space(args, model_vae, encoder_tokenizer, decoder_tokenizer, prefix=""):
+    eval_dataloader = build_dataload_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=False)
+    # Eval!
+    logger.info("***** Running recontruction evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataloader))
+    logger.info("  Batch size = %d", args.per_gpu_eval_batch_size)
+    model_vae.eval()
+    model_vae =  model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+    if args.play_mode == 'reconstrction':
+        result = calc_rec(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=100)
+        result_file_name = "eval_recontruction_results.txt"
+    elif args.play_mode == 'interpolation':
+        result = calc_interpolate(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=100)
+        result_file_name = "eval_interpolation_results.txt"
+    else:
+        logger.info("Please specify the corrent play mode [reconstrction, interpolation]")
+    eval_output_dir = args.output_dir
+    output_eval_file = os.path.join(eval_output_dir, result_file_name)
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval {} results *****".format(args.play_mode))
+        for key in sorted(result.keys()):
+            logger.info("  %s \n %s", key, str(result[key]))
+            writer.write("%s \n %s\n" % (key, str(result[key])))
+    return result
+def calc_rec(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=1):
+    count = 0
+    result = defaultdict(str)
+    for batch in tqdm(eval_dataloader, desc="Evaluating recontruction"):
+        # pdb.set_trace()
+        x0, x1, x_lengths = batch
+        max_len_values, _ = x_lengths.max(0)
+        x0 = x0[:,:max_len_values[0]]
+        x1 = x1[:,:max_len_values[1]]
+        x0 = x0.to(args.device)
+        x1 = x1.to(args.device)
+        x_lengths = x_lengths.to(args.device)
+        context_tokens = decoder_tokenizer.encode('<BOS>')
+        with torch.no_grad():
+            text_x0 = encoder_tokenizer.decode(x0[0,:x_lengths[0,0]].tolist(), clean_up_tokenization_spaces=True)[0]
+            # result["INPUT TEXT " + str(count)].append(text_x0)
+            pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
+            # Connect hidden feature to the latent space
+            # latent_z, loss_kl = model_vae.connect(pooled_hidden_fea)
+            mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+            latent_z = mean.squeeze(1)
+            past = latent_z
+            out = sample_sequence_conditional(
+                model=model_vae.decoder,
+                context=context_tokens,
+                past=past,
+                length=x_lengths[0,1], # Chunyuan: Fix length; or use <EOS> to complete a sentence
+                temperature=args.temperature,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                device=args.device,
+                decoder_tokenizer = decoder_tokenizer
+            )
+            text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+            text_x1 = text_x1.split()[1:-1]
+            text_x1 = ' '.join(text_x1) + '\n'
+            result[text_x0] = text_x1
+        count += 1
+        if count>args.total_sents:
+            break
+    return result
+def calc_interpolate(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=1):
+    count = 0
+    latent_codes = []
+    sample_interval = 0
+    for batch in tqdm(eval_dataloader, desc="Evaluating interpolation"):
+        # pdb.set_trace()
+        x0, x1, x_lengths = batch
+        max_len_values, _ = x_lengths.max(0)
+        x0 = x0[:,:max_len_values[0]]
+        x0 = x0.to(args.device)
+        x_lengths = x_lengths.to(args.device)
+        with torch.no_grad():
+            if sample_interval == 0 or sample_interval == args.total_sents:
+                text_x0 = encoder_tokenizer.decode(x0[0,:x_lengths[0,0]].tolist(), clean_up_tokenization_spaces=True)[0]
+                pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
+                # Connect hidden feature to the latent space
+                mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+                latent_z = mean.squeeze(1)
+                latent_codes.append(latent_z)
+                if sample_interval == 5:
+                    latent_codes.append(latent_z)
+                    sample_interval = 0
+                    continue
+            else:
+                sample_interval += 1
+                continue
+        count += 1
+        if count>args.total_sents:
+            break
+    context_tokens = decoder_tokenizer.encode('<BOS>')
+    result = defaultdict(str)
+    latent_codes_interpolation = []
+    num_steps = args.num_interpolation_steps
+    for step in range(num_steps+1):
+        latent_z = latent_codes[0] + (latent_codes[1] - latent_codes[0]) * step * 1.0/num_steps
+        past = latent_z
+        out = sample_sequence_conditional(
+            model=model_vae.decoder,
+            context=context_tokens,
+            past=past,
+            length=x_lengths[0,1], # Chunyuan: Fix length; or use <EOS> to complete a sentence
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            device=args.device,
+            decoder_tokenizer = decoder_tokenizer
+        )
+        text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+        text_x1 = text_x1.split()[1:-1]
+        text_x1 = ' '.join(text_x1)
+        result[step] = text_x1
+    return result
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--checkpoint_dir", default=None, type=str, required=True,
+                        help="The directory where checkpoints are saved.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default='Snli', type=str, help="The dataset.")
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--total_sents", default=10, type=int, help="Total sentences to test recontruction.")
+    parser.add_argument("--num_interpolation_steps", default=10, type=int, help="Total sentences to test recontruction.")
+    parser.add_argument("--play_mode", default="interpolation", type=str,
+                        help="interpolation or reconstruction.")
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    parser.add_argument("--per_gpu_train_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    ## Variational auto-encoder
+    parser.add_argument("--nz", default=32, type=int,
+                        help="Latent space dimension.")
+    parser.add_argument("--prompt", type=str, default="")
+    parser.add_argument("--padding_text", type=str, default="")
+    parser.add_argument("--length", type=int, default=20)
+    parser.add_argument("--temperature", type=float, default=1.0)
+    parser.add_argument("--top_k", type=int, default=0)
+    parser.add_argument("--top_p", type=float, default=0.9)
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+    args = parser.parse_args()
+    args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+    args.n_gpu = torch.cuda.device_count()
+    set_seed(args)
+    args.encoder_model_type = args.encoder_model_type.lower()
+    args.decoder_model_type = args.decoder_model_type.lower()
+    global_step = args.gloabl_step_eval
+    output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+    output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}'.format(global_step))
+    checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+    logger.info("Evaluate the following checkpoints: %s", checkpoints)
+    # Load a trained Encoder model and vocabulary that you have fine-tuned
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    model_encoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    # Load a trained Decoder model and vocabulary that you have fine-tuned
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    model_decoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    # Load full model
+    output_full_dir    = os.path.join(args.checkpoint_dir, 'checkpoint-full-{}'.format(global_step))
+    checkpoint = torch.load(os.path.join(output_full_dir, 'training.bin'))
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+    # Evaluation
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args)
+    model_vae.load_state_dict(checkpoint['model_state_dict'])
+    logger.info("Pre-trained Optimus is successfully loaded")
+    model_vae.to(args.device)
+    result = evaluate_latent_space(args, model_vae, tokenizer_encoder, tokenizer_decoder, prefix=global_step)
+if __name__ == '__main__':
+    main()

Optimus/code/examples/big_ae/run_generation_from_prior.py ADDED Viewed

	@@ -0,0 +1,414 @@

+#!/usr/bin/env python3
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+cwd = os.getcwd()
+print(f"Current working dir is {cwd}")
+import sys
+sys.path.append('./')
+pt_path = os.path.join( cwd, 'pytorch_transformers')
+sys.path.append(pt_path)
+print(f"Pytorch Transformer {pt_path}")
+import torch
+import torch.nn.functional as F
+import numpy as np
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tqdm import tqdm, trange
+from pytorch_transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, BertConfig
+from pytorch_transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2ForLatentConnector
+from pytorch_transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
+from pytorch_transformers import XLNetLMHeadModel, XLNetTokenizer
+from pytorch_transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
+from pytorch_transformers import BertForLatentConnector, BertTokenizer
+import pytorch_transformers
+from collections import defaultdict
+from modules import VAE
+from utils import (TextDataset_Split, TextDataset_2Tokenizers, BucketingDataLoader)
+from metrics import Bleu, SelfBleu
+import pdb
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+MAX_LENGTH = int(10000)  # Hardcoded max length to avoid infinite loop
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig)), ())
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer)
+}
+# Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
+# in https://github.com/rusiaaman/XLNet-gen#methodology
+# and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e
+PADDING_TEXT = """ In 1991, the remains of Russian Tsar Nicholas II and his family
+(except for Alexei and Maria) are discovered.
+The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+remainder of the story. 1883 Western Siberia,
+a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+Rasputin has a vision and denounces one of the men as a horse thief. Although his
+father initially slaps him for making such an accusation, Rasputin watches as the
+man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
+def set_seed(args):
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+            file_path=args.eval_data_file
+        dataloader = BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=False)
+    else:
+        pass
+    return dataloader
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0,  filter_value=-float('Inf')):
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (vocabulary size)
+            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+    # top-k
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+    # top-p
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[indices_to_remove] = filter_value
+    return logits
+def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, is_xlnet=False, device='cpu'):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    with torch.no_grad():
+        for _ in trange(length):
+            inputs = {'input_ids': generated}
+            if is_xlnet:
+                # XLNet is a direct (predict same token, not next token) and bi-directional model by default
+                # => need one additional dummy token in the input (will be masked), attention mask and target mapping (see model docstring)
+                input_ids = torch.cat((generated, torch.zeros((1, 1), dtype=torch.long, device=device)), dim=1)
+                perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float, device=device)
+                perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
+                target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float, device=device)
+                target_mapping[0, 0, -1] = 1.0  # predict last token
+                inputs = {'input_ids': input_ids, 'perm_mask': perm_mask, 'target_mapping': target_mapping}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+    return generated
+def sample_sequence_conditional(model, length, context, past=None, num_samples=1, temperature=1, top_k=0, top_p=0.0, device='cpu', decoder_tokenizer=None, max_seq_length=-1):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    gen_seq_length = 0
+    with torch.no_grad():
+        while True:
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+            gen_seq_length += 1
+            # pdb.set_trace()
+            if next_token.unsqueeze(0)[0,0].item() == decoder_tokenizer.encode('<EOS>')[0]:
+                break
+            if max_seq_length>0 and gen_seq_length>max_seq_length:
+                break
+    return generated
+def evaluate_generation_fromp_prior(model_vae, decoder_tokenizer, args, ns=1):
+    loc = torch.zeros([args.nz]).to(args.device)
+    scale = torch.ones([args.nz]).to(args.device)
+    prior = torch.distributions.normal.Normal(loc, scale)
+    context_tokens = decoder_tokenizer.encode('<BOS>')
+    count = 0
+    result = defaultdict(str)
+    for i in tqdm(range(args.num_sents)):
+        with torch.no_grad():
+            latent_z = prior.sample()
+            # pdb.set_trace()
+            past = model_vae.decoder.linear(latent_z.unsqueeze(0))
+            # pdb.set_trace()
+            out = sample_sequence_conditional(
+                model=model_vae.decoder,
+                context=context_tokens,
+                past=past,
+                length=args.max_seq_length, # Chunyuan: Fix length; or use <EOS> to complete a sentence
+                temperature=args.temperature,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                device=args.device,
+                decoder_tokenizer = decoder_tokenizer,
+                max_seq_length = args.max_seq_length
+            )
+            text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+            text_x1 = text_x1.split()[1:-1]
+            text_x1 = ' '.join(text_x1) + '\n'
+            result[i] = text_x1
+        if args.use_philly:
+            print("PROGRESS: {}%".format( round(100 * i /args.num_sents , 4)))
+    with open(args.output_generation_file, "w") as writer:
+        logger.info("***** SHOW generated sentences from prior *****")
+        for key in sorted(result.keys()):
+            # logger.info("  %s \n %s", key, str(result[key]))
+            # writer.write("%s \n %s\n" % (key, str(result[key])))
+            writer.write("%s" % str(result[key]))
+    return result
+# bleu = evaluate_bleu(results, args)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--checkpoint_dir", default=None, type=str, required=True,
+                        help="The directory where checkpoints are saved.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default='Snli', type=str, help="The dataset.")
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--total_sents", default=10, type=int, help="Total sentences to test recontruction.")
+    parser.add_argument("--num_sents", default=10, type=int, help="Total sentences to generate.")
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    parser.add_argument("--per_gpu_train_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    ## Variational auto-encoder
+    parser.add_argument("--nz", default=32, type=int,
+                        help="Latent space dimension.")
+    parser.add_argument("--prompt", type=str, default="")
+    parser.add_argument("--padding_text", type=str, default="")
+    parser.add_argument("--length", type=int, default=20)
+    parser.add_argument("--temperature", type=float, default=1.0)
+    parser.add_argument("--top_k", type=int, default=0)
+    parser.add_argument("--top_p", type=float, default=0.9)
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+    args = parser.parse_args()
+    args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+    args.n_gpu = torch.cuda.device_count()
+    set_seed(args)
+    args.encoder_model_type = args.encoder_model_type.lower()
+    args.decoder_model_type = args.decoder_model_type.lower()
+    global_step = args.gloabl_step_eval
+    output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+    output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}'.format(global_step))
+    checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+    logger.info("Evaluate the following checkpoints: %s", checkpoints)
+    # Load a trained Encoder model and vocabulary that you have fine-tuned
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    model_encoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    # Load a trained Decoder model and vocabulary that you have fine-tuned
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    model_decoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    # pdb.set_trace()
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+    # Evaluation
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device)
+    if not os.path.exists(args.output_dir): os.makedirs(args.output_dir)
+    args.output_generation_file = os.path.join(args.output_dir, f"generation_from_vae_prior_t{args.temperature}_p{args.top_p}.txt")
+    # args.output_generation_file = args.train_data_file
+    result = evaluate_generation_fromp_prior(model_vae, tokenizer_decoder, args)
+    bleu5 = Bleu(test_text= args.output_generation_file,
+                 real_text=args.eval_data_file,
+                 num_real_sentences=args.num_sents,
+                 num_fake_sentences=args.num_sents,
+                 gram=5).get_score()
+    logger.info(f'The bleu score is {bleu5}')
+    sbleu5 = SelfBleu(test_text= args.output_generation_file,
+                 num_sentences=args.num_sents,
+                 gram=5).get_score()
+    logger.info(f'The self-bleu score is {sbleu5}')
+    args.eval_results_file = os.path.join(args.output_dir, f"eval_results_t{args.temperature}_p{args.top_p}.txt")
+    eval_results = {'bleu5':bleu5 , 'sbleu5':sbleu5}
+    with open(args.eval_results_file, "w") as writer:
+        logger.info("***** SHOW the quantative evalution results *****")
+        for key in sorted(eval_results.keys()):
+            writer.write("%s %s" % (key, str(eval_results[key]))  )
+if __name__ == '__main__':
+    main()

Optimus/code/examples/big_ae/run_gpt2_generation.py ADDED Viewed

	@@ -0,0 +1,390 @@

+#!/usr/bin/env python3
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+cwd = os.getcwd()
+print(f"Current working dir is {cwd}")
+import sys
+sys.path.append('./')
+pt_path = os.path.join( cwd, 'pytorch_transformers')
+sys.path.append(pt_path)
+print(f"Pytorch Transformer {pt_path}")
+import torch
+import torch.nn.functional as F
+import numpy as np
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tqdm import tqdm, trange
+from pytorch_transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, BertConfig
+from pytorch_transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2ForLatentConnector
+from pytorch_transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
+from pytorch_transformers import XLNetLMHeadModel, XLNetTokenizer
+from pytorch_transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
+from pytorch_transformers import BertForLatentConnector, BertTokenizer
+import pytorch_transformers
+from collections import defaultdict
+from modules import VAE
+from utils import (TextDataset_Split, TextDataset_2Tokenizers, BucketingDataLoader)
+from metrics import Bleu, SelfBleu
+import pdb
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+MAX_LENGTH = int(10000)  # Hardcoded max length to avoid infinite loop
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig)), ())
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer)
+}
+# Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
+# in https://github.com/rusiaaman/XLNet-gen#methodology
+# and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e
+PADDING_TEXT = """ In 1991, the remains of Russian Tsar Nicholas II and his family
+(except for Alexei and Maria) are discovered.
+The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+remainder of the story. 1883 Western Siberia,
+a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+Rasputin has a vision and denounces one of the men as a horse thief. Although his
+father initially slaps him for making such an accusation, Rasputin watches as the
+man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
+def set_seed(args):
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+            file_path=args.eval_data_file
+        dataloader = BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=False)
+    else:
+        pass
+    return dataloader
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0,  filter_value=-float('Inf')):
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (vocabulary size)
+            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+    # top-k
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+    # top-p
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[indices_to_remove] = filter_value
+    return logits
+def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, is_xlnet=False, device='cpu', decoder_tokenizer=None, max_seq_length=-1):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    gen_seq_length = 0
+    with torch.no_grad():
+        while True:
+            inputs = {'input_ids': generated}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+            gen_seq_length += 1
+            if next_token.unsqueeze(0)[0,0].item() == decoder_tokenizer.encode('<EOS>')[0]:
+                break
+            if max_seq_length>0 and gen_seq_length>max_seq_length:
+                break
+    return generated
+def sample_sequence_conditional(model, length, context, past=None, num_samples=1, temperature=1, top_k=0, top_p=0.0, device='cpu', decoder_tokenizer=None, max_seq_length=-1):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    gen_seq_length = 0
+    with torch.no_grad():
+        while True:
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+            gen_seq_length += 1
+            # pdb.set_trace()
+            if next_token.unsqueeze(0)[0,0].item() == decoder_tokenizer.encode('<EOS>')[0]:
+                break
+            if max_seq_length>0 and gen_seq_length>max_seq_length:
+                break
+    return generated
+def evaluate_generation_from_gpt2(model, decoder_tokenizer, args, ns=1):
+    loc = torch.zeros([args.nz]).to(args.device)
+    scale = torch.ones([args.nz]).to(args.device)
+    prior = torch.distributions.normal.Normal(loc, scale)
+    context_tokens = decoder_tokenizer.encode('<BOS>')
+    count = 0
+    result = defaultdict(str)
+    for i in tqdm(range(args.num_sents)):
+        with torch.no_grad():
+            out = sample_sequence(
+                model=model,
+                context=context_tokens,
+                length=args.max_seq_length, # Chunyuan: Fix length; or use <EOS> to complete a sentence
+                temperature=args.temperature,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                device=args.device,
+                decoder_tokenizer = decoder_tokenizer,
+                max_seq_length = args.max_seq_length
+            )
+            text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+            text_x1 = text_x1.split()[1:-1]
+            text_x1 = ' '.join(text_x1) + '\n'
+            result[i] = text_x1
+        if args.use_philly:
+            print("PROGRESS: {}%".format( round(100 * i /args.num_sents , 4)))
+    with open(args.output_generation_file, "w") as writer:
+        logger.info("***** SHOW generated sentences from prior *****")
+        for key in sorted(result.keys()):
+            # logger.info("  %s \n %s", key, str(result[key]))
+            # writer.write("%s \n %s\n" % (key, str(result[key])))
+            writer.write("%s" % str(result[key]))
+    return result
+# bleu = evaluate_bleu(results, args)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--checkpoint_dir", default=None, type=str, required=True,
+                        help="The directory where checkpoints are saved.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default='Snli', type=str, help="The dataset.")
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--total_sents", default=10, type=int, help="Total sentences to test recontruction.")
+    parser.add_argument("--num_sents", default=10, type=int, help="Total sentences to generate.")
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    parser.add_argument("--per_gpu_train_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    ## Variational auto-encoder
+    parser.add_argument("--nz", default=32, type=int,
+                        help="Latent space dimension.")
+    parser.add_argument("--prompt", type=str, default="")
+    parser.add_argument("--padding_text", type=str, default="")
+    parser.add_argument("--length", type=int, default=20)
+    parser.add_argument("--temperature", type=float, default=1.0)
+    parser.add_argument("--top_k", type=int, default=0)
+    parser.add_argument("--top_p", type=float, default=0.9)
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+    args = parser.parse_args()
+    args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+    args.n_gpu = torch.cuda.device_count()
+    set_seed(args)
+    args.decoder_model_type = args.decoder_model_type.lower()
+    global_step = args.gloabl_step_eval
+    output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-{}'.format(global_step))
+    checkpoints = [ output_decoder_dir ]
+    logger.info("Evaluate the following checkpoints: %s", checkpoints)
+    # Load a trained Decoder model and vocabulary that you have fine-tuned
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    model_decoder = decoder_model_class.from_pretrained(output_decoder_dir)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    model_decoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    # pdb.set_trace()
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+    # Evaluation
+    if not os.path.exists(args.output_dir): os.makedirs(args.output_dir)
+    args.output_generation_file = os.path.join(args.output_dir, f"generation_from_gpt2_t{args.temperature}_p{args.top_p}.txt")
+    # args.output_generation_file = args.train_data_file
+    result = evaluate_generation_from_gpt2(model_decoder, tokenizer_decoder, args)
+    bleu5 = Bleu(test_text= args.output_generation_file,
+                 real_text=args.eval_data_file,
+                 num_real_sentences=args.num_sents,
+                 num_fake_sentences=args.num_sents,
+                 gram=5).get_score()
+    logger.info(f'The bleu score is {bleu5}')
+    sbleu5 = SelfBleu(test_text= args.output_generation_file,
+                 num_sentences=args.num_sents,
+                 gram=5).get_score()
+    logger.info(f'The self-bleu score is {sbleu5}')
+    args.eval_results_file = os.path.join(args.output_dir, f"eval_results_t{args.temperature}_p{args.top_p}.txt")
+    eval_results = {'bleu5':bleu5 , 'sbleu5':sbleu5}
+    with open(args.eval_results_file, "w") as writer:
+        logger.info("***** SHOW the quantative evalution results *****")
+        for key in sorted(eval_results.keys()):
+            writer.write("%s %s" % (key, str(eval_results[key]))  )
+if __name__ == '__main__':
+    main()

Optimus/code/examples/big_ae/run_latent_generation.py ADDED Viewed

	@@ -0,0 +1,577 @@

+#!/usr/bin/env python3
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+import torch
+import torch.nn.functional as F
+import numpy as np
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tqdm import tqdm, trange
+from pytorch_transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, BertConfig
+from pytorch_transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2ForLatentConnector
+from pytorch_transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
+from pytorch_transformers import XLNetLMHeadModel, XLNetTokenizer
+from pytorch_transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
+from pytorch_transformers import BertForLatentConnector, BertTokenizer
+from collections import defaultdict
+from modules import VAE
+from utils import (TextDataset_Split, TextDataset_2Tokenizers, BucketingDataLoader)
+import pdb
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+MAX_LENGTH = int(10000)  # Hardcoded max length to avoid infinite loop
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig)), ())
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer)
+}
+# Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
+# in https://github.com/rusiaaman/XLNet-gen#methodology
+# and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e
+PADDING_TEXT = """ In 1991, the remains of Russian Tsar Nicholas II and his family
+(except for Alexei and Maria) are discovered.
+The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+remainder of the story. 1883 Western Siberia,
+a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+Rasputin has a vision and denounces one of the men as a horse thief. Although his
+father initially slaps him for making such an accusation, Rasputin watches as the
+man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
+def set_seed(args):
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+            file_path=args.eval_data_file
+        dataloader = BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=False)
+    else:
+        pass
+    return dataloader
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (vocabulary size)
+            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[indices_to_remove] = filter_value
+    return logits
+def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, is_xlnet=False, device='cpu'):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    with torch.no_grad():
+        for _ in trange(length):
+            inputs = {'input_ids': generated}
+            if is_xlnet:
+                # XLNet is a direct (predict same token, not next token) and bi-directional model by default
+                # => need one additional dummy token in the input (will be masked), attention mask and target mapping (see model docstring)
+                input_ids = torch.cat((generated, torch.zeros((1, 1), dtype=torch.long, device=device)), dim=1)
+                perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float, device=device)
+                perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
+                target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float, device=device)
+                target_mapping[0, 0, -1] = 1.0  # predict last token
+                inputs = {'input_ids': input_ids, 'perm_mask': perm_mask, 'target_mapping': target_mapping}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+    return generated
+def sample_sequence_conditional(model, length, context, past=None, num_samples=1, temperature=1, top_k=0, top_p=0.0, device='cpu', decoder_tokenizer=None):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    with torch.no_grad():
+        while True:
+        # for _ in trange(length):
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+            # pdb.set_trace()
+            if next_token.unsqueeze(0)[0,0].item() == decoder_tokenizer.encode('<EOS>')[0]:
+                break
+    return generated
+def latent_code_from_text(text, tokenizer_encoder, model_vae, args):
+    tokenized1 = tokenizer_encoder.encode(text)
+    tokenized1 = [101] + tokenized1 + [102]
+    coded1 = torch.Tensor([tokenized1])
+    coded1 =torch.Tensor.long(coded1)
+    with torch.no_grad():
+        x0 = coded1
+        x0 = x0.to(args.device)
+        pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
+        mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+        latent_z = mean.squeeze(1)
+        coded_length = len(tokenized1)
+        return latent_z, coded_length
+def text_from_latent_code(latent_z, model_vae, args, tokenizer_decoder):
+    past = latent_z
+    context_tokens = tokenizer_decoder.encode('<BOS>')
+    length = 128 # maximum length, but not used
+    out = sample_sequence_conditional(
+        model=model_vae.decoder,
+        context=context_tokens,
+        past=past,
+        length= length, # Chunyuan: Fix length; or use <EOS> to complete a sentence
+        temperature=args.temperature,
+        top_k=args.top_k,
+        top_p=args.top_p,
+        device=args.device,
+        decoder_tokenizer = tokenizer_decoder
+    )
+    text_x1 = tokenizer_decoder.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+    text_x1 = text_x1.split()[1:-1]
+    text_x1 = ' '.join(text_x1)
+    return text_x1
+# a wrapper function to choose between different play modes
+def evaluate_latent_space(args, model_vae, encoder_tokenizer, decoder_tokenizer, prefix=""):
+    eval_dataloader = build_dataload_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=False)
+    # Eval!
+    logger.info("***** Running recontruction evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataloader))
+    logger.info("  Batch size = %d", args.per_gpu_eval_batch_size)
+    model_vae.eval()
+    model_vae =  model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+    if args.play_mode == 'reconstrction':
+        result = calc_rec(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=100)
+        result_file_name = "eval_recontruction_results.txt"
+    elif args.play_mode == 'interpolation':
+        result = calc_interpolate(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=100)
+        result_file_name = "eval_interpolation_results.txt"
+    else:
+        logger.info("Please specify the corrent play mode [reconstrction, interpolation]")
+    eval_output_dir = args.output_dir
+    output_eval_file = os.path.join(eval_output_dir, result_file_name)
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval {} results *****".format(args.play_mode))
+        for key in sorted(result.keys()):
+            logger.info("  %s \n %s", key, str(result[key]))
+            writer.write("%s \n %s\n" % (key, str(result[key])))
+    return result
+def calc_rec(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=1):
+    count = 0
+    result = defaultdict(str)
+    for batch in tqdm(eval_dataloader, desc="Evaluating recontruction"):
+        # pdb.set_trace()
+        x0, x1, x_lengths = batch
+        max_len_values, _ = x_lengths.max(0)
+        x0 = x0[:,:max_len_values[0]]
+        x1 = x1[:,:max_len_values[1]]
+        x0 = x0.to(args.device)
+        x1 = x1.to(args.device)
+        x_lengths = x_lengths.to(args.device)
+        context_tokens = decoder_tokenizer.encode('<BOS>')
+        with torch.no_grad():
+            text_x0 = encoder_tokenizer.decode(x0[0,:x_lengths[0,0]].tolist(), clean_up_tokenization_spaces=True)[0]
+            # result["INPUT TEXT " + str(count)].append(text_x0)
+            pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
+            # Connect hidden feature to the latent space
+            # latent_z, loss_kl = model_vae.connect(pooled_hidden_fea)
+            mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+            latent_z = mean.squeeze(1)
+            past = latent_z
+            out = sample_sequence_conditional(
+                model=model_vae.decoder,
+                context=context_tokens,
+                past=past,
+                length=x_lengths[0,1], # Chunyuan: Fix length; or use <EOS> to complete a sentence
+                temperature=args.temperature,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                device=args.device,
+                decoder_tokenizer = decoder_tokenizer
+            )
+            text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+            text_x1 = text_x1.split()[1:-1]
+            text_x1 = ' '.join(text_x1) + '\n'
+            result[text_x0] = text_x1
+        count += 1
+        if count>args.total_sents:
+            break
+    return result
+def calc_interpolate(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=1):
+    count = 0
+    latent_codes = []
+    sample_interval = 0
+    for batch in tqdm(eval_dataloader, desc="Evaluating interpolation"):
+        # pdb.set_trace()
+        x0, x1, x_lengths = batch
+        max_len_values, _ = x_lengths.max(0)
+        x0 = x0[:,:max_len_values[0]]
+        x0 = x0.to(args.device)
+        x_lengths = x_lengths.to(args.device)
+        with torch.no_grad():
+            if sample_interval == 0 or sample_interval == args.total_sents:
+                text_x0 = encoder_tokenizer.decode(x0[0,:x_lengths[0,0]].tolist(), clean_up_tokenization_spaces=True)[0]
+                pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
+                # Connect hidden feature to the latent space
+                mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+                latent_z = mean.squeeze(1)
+                latent_codes.append(latent_z)
+                if sample_interval == 5:
+                    latent_codes.append(latent_z)
+                    sample_interval = 0
+                    continue
+            else:
+                sample_interval += 1
+                continue
+        count += 1
+        if count>args.total_sents:
+            break
+    context_tokens = decoder_tokenizer.encode('<BOS>')
+    result = defaultdict(str)
+    latent_codes_interpolation = []
+    num_steps = args.num_interpolation_steps
+    for step in range(num_steps+1):
+        latent_z = latent_codes[0] + (latent_codes[1] - latent_codes[0]) * step * 1.0/num_steps
+        past = latent_z
+        out = sample_sequence_conditional(
+            model=model_vae.decoder,
+            context=context_tokens,
+            past=past,
+            length=x_lengths[0,1], # Chunyuan: Fix length; or use <EOS> to complete a sentence
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            device=args.device,
+            decoder_tokenizer = decoder_tokenizer
+        )
+        text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+        text_x1 = text_x1.split()[1:-1]
+        text_x1 = ' '.join(text_x1)
+        result[step] = text_x1
+    return result
+def interpolate(model_vae, tokenizer_encoder, tokenizer_decoder, args):
+    # and then in the main function
+    latent_z1, coded_length1 = latent_code_from_text(args.sent_source, tokenizer_encoder, model_vae, args)
+    latent_z2, coded_length2 = latent_code_from_text(args.sent_target, tokenizer_encoder, model_vae, args)
+    result = defaultdict(str)
+    num_steps = args.num_interpolation_steps + 1
+    for step in range(num_steps+1):
+        latent_z = latent_z1 + (latent_z2 - latent_z1) * step * 1.0/num_steps
+        text_interpolate = text_from_latent_code(latent_z, model_vae, args, tokenizer_decoder)
+        result[step] = text_interpolate
+        print(text_interpolate)
+    return result
+def analogy(model_vae, tokenizer_encoder, tokenizer_decoder, args):
+    latent_z1, coded_length1 = latent_code_from_text(args.sent_source, tokenizer_encoder, model_vae, args)
+    latent_z2, coded_length2 = latent_code_from_text(args.sent_target, tokenizer_encoder, model_vae, args)
+    latent_z3, coded_length3 = latent_code_from_text(args.sent_input, tokenizer_encoder, model_vae, args)
+    result = defaultdict(str)
+    latent_z = latent_z3 + args.degree_to_target * (latent_z2 - latent_z1)
+    text_analogy = text_from_latent_code(latent_z, model_vae, args, tokenizer_decoder)
+    result[0] = text_analogy
+    print(text_analogy)
+    return result
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--checkpoint_dir", default=None, type=str, required=True,
+                        help="The directory where checkpoints are saved.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default='Snli', type=str, help="The dataset.")
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--total_sents", default=10, type=int, help="Total sentences to test recontruction.")
+    parser.add_argument("--num_interpolation_steps", default=10, type=int, help="Total sentences to test recontruction.")
+    parser.add_argument("--play_mode", default="interpolation", type=str,
+                        help="interpolation or reconstruction.")
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    parser.add_argument("--per_gpu_train_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    # Interact with users
+    parser.add_argument("--interact_with_user_input", action='store_true', help="Use user input to interact_with.")
+    parser.add_argument("--sent_source", type=str, default="")
+    parser.add_argument("--sent_target", type=str, default="")
+    parser.add_argument("--sent_input", type=str, default="")
+    parser.add_argument("--degree_to_target", type=float, default="1.0")
+    ## Variational auto-encoder
+    parser.add_argument("--nz", default=32, type=int,
+                        help="Latent space dimension.")
+    parser.add_argument("--prompt", type=str, default="")
+    parser.add_argument("--padding_text", type=str, default="")
+    parser.add_argument("--length", type=int, default=20)
+    parser.add_argument("--temperature", type=float, default=1.0)
+    parser.add_argument("--top_k", type=int, default=0)
+    parser.add_argument("--top_p", type=float, default=1.0)
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+    args = parser.parse_args()
+    args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+    args.n_gpu = torch.cuda.device_count()
+    set_seed(args)
+    args.encoder_model_type = args.encoder_model_type.lower()
+    args.decoder_model_type = args.decoder_model_type.lower()
+    global_step = args.gloabl_step_eval
+    output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+    output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}'.format(global_step))
+    checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+    logger.info("Evaluate the following checkpoints: %s", checkpoints)
+    # Load a trained Encoder model and vocabulary that you have fine-tuned
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    model_encoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    # Load a trained Decoder model and vocabulary that you have fine-tuned
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    model_decoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    # Load full model
+    output_full_dir    = os.path.join(args.checkpoint_dir, 'checkpoint-full-{}'.format(global_step))
+    checkpoint = torch.load(os.path.join(output_full_dir, 'training.bin'))
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+    # Evaluation
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args)
+    model_vae.load_state_dict(checkpoint['model_state_dict'])
+    logger.info("Pre-trained Optimus is successfully loaded")
+    model_vae.to(args.device)
+    if args.interact_with_user_input:
+        if args.play_mode == 'interpolation':
+            if len(args.sent_source) > 0 and len(args.sent_source) > 0:
+                result = interpolate(model_vae, tokenizer_encoder, tokenizer_decoder, args)
+            else:
+                print('Please check: specify the source and target sentences!')
+        if args.play_mode == 'analogy':
+            if len(args.sent_source) > 0 and len(args.sent_source) > 0 and len(args.sent_input) > 0:
+                result = analogy(model_vae, tokenizer_encoder, tokenizer_decoder, args)
+            else:
+                print('Please check: specify the source, target and input analogy sentences!')
+    else:
+        result = evaluate_latent_space(args, model_vae, tokenizer_encoder, tokenizer_decoder, prefix=global_step)
+if __name__ == '__main__':
+    main()

Optimus/code/examples/big_ae/run_lm_ae_pretraining.py ADDED Viewed

	@@ -0,0 +1,692 @@

+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+from __future__ import absolute_import, division, print_function
+import pdb
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertModel, BertTokenizer,
+                                  GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+logger = logging.getLogger(__name__)
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertModel, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+class TextDataset(Dataset):
+    def __init__(self, tokenizer, file_path='train', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_{block_size}_{filename}')
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+            self.examples = []
+            with open(file_path, encoding="utf-8") as f:
+                text = f.read()
+            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
+            while len(tokenized_text) >= block_size:  # Truncate in block of block_size
+                self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))
+                tokenized_text = tokenized_text[block_size:]
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+    def __len__(self):
+        return len(self.examples)
+    def __getitem__(self, item):
+        return torch.tensor(self.examples[item])
+class TextDataset_2Tokenizers(Dataset):
+    def __init__(self, tokenizers, file_path='train', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_gpt_bert_{block_size}_{filename}')
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+            with open(file_path, encoding="utf-8") as f:
+                text = f.read()
+            # pdb.set_trace()
+            self.examples = []
+            # Chunyuan: divide the linguistic text into the same length, then different tokenization schemes are applied
+            while len(text) >= block_size:  # Truncate in block of block_size
+                tokenized_text0 = tokenizers[0].convert_tokens_to_ids(tokenizers[0].tokenize(text[:block_size]))
+                tokenized_text0 = tokenizers[0].add_special_tokens_single_sentence(tokenized_text0)
+                tokenized_text0_length = len(tokenized_text0)
+                pad_token=tokenizers[0].convert_tokens_to_ids([tokenizers[0].pad_token])[0]
+                tokenized_text0 = tokenized_text0 + ([pad_token] * (block_size - tokenized_text0_length)  ) # Pad up to the sequence length.
+                assert len(tokenized_text0) == block_size
+                tokenized_text1 = tokenizers[1].convert_tokens_to_ids(tokenizers[1].tokenize(text[:block_size]))
+                tokenized_text1 = tokenizers[1].add_special_tokens_single_sentence(tokenized_text1)
+                tokenized_text1_length = len(tokenized_text1)
+                pad_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].pad_token])[0]
+                tokenized_text1 = tokenized_text1 + ([pad_token] *  (block_size - tokenized_text1_length) ) # Pad up to the sequence length.
+                assert len(tokenized_text1) == block_size
+                self.examples.append([tokenized_text0, tokenized_text0_length, tokenized_text1, tokenized_text1_length])
+                text = text[block_size:]
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+    def __len__(self):
+        return len(self.examples)
+    def __getitem__(self, item):
+        # pdb.set_trace()
+        # Convert to Tensors and build dataset
+        tokenized_text0= torch.tensor(self.examples[item][0], dtype=torch.long)
+        tokenized_text1= torch.tensor(self.examples[item][2], dtype=torch.long)
+        tokenized_text_lengths = torch.tensor([self.examples[item][1], self.examples[item][3]], dtype=torch.long)
+        # pdb.set_trace()
+        return (tokenized_text0, tokenized_text1, tokenized_text_lengths)
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset(tokenizer, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+def train(args, train_dataset, model_encoder, model_decoder, encoder_tokenizer, decoder_tokenizer):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_encoder_parameters = [
+        {'params': [p for n, p in model_encoder.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_encoder.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer_grouped_decoder_parameters = [
+        {'params': [p for n, p in model_decoder.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_decoder.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer_encoder = AdamW(optimizer_grouped_encoder_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    optimizer_decoder = AdamW(optimizer_grouped_decoder_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler_encoder = WarmupLinearSchedule(optimizer_encoder, warmup_steps=args.warmup_steps, t_total=t_total)
+    scheduler_decoder = WarmupLinearSchedule(optimizer_decoder, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_encoder, optimizer_encoder = amp.initialize(model_encoder, optimizer_encoder, opt_level=args.fp16_opt_level)
+        model_decoder, optimizer_decoder = amp.initialize(model_decoder, optimizer_decoder, opt_level=args.fp16_opt_level)
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_encoder = torch.nn.DataParallel(model_encoder)
+        model_decoder = torch.nn.DataParallel(model_decoder)
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_encoder = torch.nn.parallel.DistributedDataParallel(model_encoder, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+        model_decoder = torch.nn.parallel.DistributedDataParallel(model_decoder, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model_encoder.zero_grad()
+    model_decoder.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+            # tokenized_text0 = tokenized_text0.to(args.device)
+            # tokenized_text1 = tokenized_text1.to(args.device)
+            # prepare input-output data for reconstruction
+            inputs, labels = mask_tokens(tokenized_text0, encoder_tokenizer, args) if args.mlm else (tokenized_text0, tokenized_text1)
+            labels = tokenized_text1
+            inputs = inputs.to(args.device)
+            labels = labels.to(args.device)
+            model_encoder.train()
+            model_decoder.train()
+            # Encoding
+            outputs = model_encoder(inputs)
+            pooled_hidden_fea = outputs[1]  # model outputs are always tuple in pytorch-transformers (see doc)
+            # Decoding
+            outputs = model_decoder(input_ids=tokenized_text1, past=pooled_hidden_fea, labels=labels)
+            loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+            if args.n_gpu > 1:
+                loss = loss.mean()  # mean() to average on multi-gpu parallel training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+            else:
+                loss.backward()
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer_encoder), args.max_grad_norm)
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer_decoder), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model_encoder.parameters(), args.max_grad_norm)
+                    torch.nn.utils.clip_grad_norm_(model_decoder.parameters(), args.max_grad_norm)
+                optimizer_encoder.step()
+                optimizer_decoder.step()
+                scheduler_encoder.step()  # Update learning rate schedule
+                scheduler_decoder.step()
+                model_encoder.zero_grad()
+                model_decoder.zero_grad()
+                global_step += 1
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model_encoder, model_decoder, encoder_tokenizer, decoder_tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr_encoder', scheduler_encoder.get_lr()[0], global_step)
+                    tb_writer.add_scalar('lr_decoder', scheduler_decoder.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+                    output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+                    if not os.path.exists(output_encoder_dir):
+                        os.makedirs(output_encoder_dir)
+                    if not os.path.exists(output_decoder_dir):
+                        os.makedirs(output_decoder_dir)
+                    model_encoder_to_save = model_encoder.module if hasattr(model_encoder, 'module') else model_encoder  # Take care of distributed/parallel training
+                    model_decoder_to_save = model_decoder.module if hasattr(model_decoder, 'module') else model_decoder  # Take care of distributed/parallel training
+                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                    torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                    torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+                    logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                    logger.info("Saving model checkpoint to %s", output_decoder_dir)
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+    return global_step, tr_loss / global_step
+def evaluate(args, model_encoder, model_decoder, encoder_tokenizer, decoder_tokenizer, prefix=""):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+    eval_dataset = load_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    eval_loss = 0.0
+    nb_eval_steps = 0
+    model_encoder.eval()
+    model_decoder.eval()
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        # pdb.set_trace()
+        tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+        # prepare input-output data for evaluation
+        inputs, labels = tokenized_text0, tokenized_text1
+        tokenized_text1 = tokenized_text1.to(args.device)
+        inputs = inputs.to(args.device)
+        labels = labels.to(args.device)
+        with torch.no_grad():
+            # Encoding
+            outputs = model_encoder(inputs)
+            pooled_hidden_fea = outputs[1]  # model outputs are always tuple in pytorch-transformers (see doc)
+            # Decoding
+            outputs = model_decoder(input_ids=tokenized_text1, past=pooled_hidden_fea, labels=labels)
+            lm_loss = outputs[0]
+            eval_loss += lm_loss.mean().item()
+        nb_eval_steps += 1
+    eval_loss = eval_loss / nb_eval_steps
+    perplexity = torch.exp(torch.tensor(eval_loss))
+    result = {
+        "perplexity": perplexity
+    }
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results {} *****".format(prefix))
+        for key in sorted(result.keys()):
+            logger.info("  %s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+    return result
+def main():
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    # Training Schedules
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    # Precision & Distributed Training
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+    # Set seed
+    set_seed(args)
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+    ## Encoder
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config)
+    model_encoder.to(args.device)
+    ## Decoder
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config)
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+    model_decoder.to(args.device)
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+    logger.info("Training/evaluation parameters %s", args)
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+        train_dataset = load_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+        global_step, tr_loss = train(args, train_dataset, model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        # Save model checkpoint
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        if not os.path.exists(output_encoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_encoder_dir)
+        if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_decoder_dir)
+        logger.info("Saving encoder model checkpoint to %s", output_encoder_dir)
+        logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_encoder_to_save = model_encoder.module if hasattr(model_encoder, 'module') else model_encoder  # Take care of distributed/parallel training
+        model_decoder_to_save = model_decoder.module if hasattr(model_decoder, 'module') else model_decoder  # Take care of distributed/parallel training
+        # Good practice: save your training arguments together with the trained model
+        model_encoder_to_save.save_pretrained(output_encoder_dir)
+        torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+        model_decoder_to_save.save_pretrained(output_decoder_dir)
+        torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_encoder = encoder_model_class.from_pretrained(output_encoder_dir)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(output_encoder_dir, do_lower_case=args.do_lower_case)
+        model_encoder.to(args.device)
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(output_decoder_dir, do_lower_case=args.do_lower_case)
+        model_decoder.to(args.device)
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        global_step= 881
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint[0].split('-')[-1] if len(checkpoints) > 1 else ""
+            model_encoder = encoder_model_class.from_pretrained(checkpoint[0])
+            model_encoder.to(args.device)
+            model_decoder = decoder_model_class.from_pretrained(checkpoint[1])
+            model_decoder.to(args.device)
+            result = evaluate(args, model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, prefix=global_step)
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+    return results
+if __name__ == "__main__":
+    main()

Optimus/code/examples/big_ae/run_lm_causal_pretraining.py ADDED Viewed

	@@ -0,0 +1,692 @@

+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+from __future__ import absolute_import, division, print_function
+import pdb
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertModel, BertTokenizer,
+                                  GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+logger = logging.getLogger(__name__)
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertModel, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+class TextDataset(Dataset):
+    def __init__(self, tokenizer, file_path='train', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_{block_size}_{filename}')
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+            self.examples = []
+            with open(file_path, encoding="utf-8") as f:
+                text = f.read()
+            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
+            while len(tokenized_text) >= block_size:  # Truncate in block of block_size
+                self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))
+                tokenized_text = tokenized_text[block_size:]
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+    def __len__(self):
+        return len(self.examples)
+    def __getitem__(self, item):
+        return torch.tensor(self.examples[item])
+class TextDataset_2Tokenizers(Dataset):
+    def __init__(self, tokenizers, file_path='train', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_gpt_bert_{block_size}_{filename}')
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+            with open(file_path, encoding="utf-8") as f:
+                text = f.read()
+            # pdb.set_trace()
+            self.examples = []
+            # Chunyuan: divide the linguistic text into the same length, then different tokenization schemes are applied
+            while len(text) >= block_size:  # Truncate in block of block_size
+                tokenized_text0 = tokenizers[0].convert_tokens_to_ids(tokenizers[0].tokenize(text[:block_size]))
+                tokenized_text0 = tokenizers[0].add_special_tokens_single_sentence(tokenized_text0)
+                tokenized_text0_length = len(tokenized_text0)
+                pad_token=tokenizers[0].convert_tokens_to_ids([tokenizers[0].pad_token])[0]
+                tokenized_text0 = tokenized_text0 + ([pad_token] * (block_size - tokenized_text0_length)  ) # Pad up to the sequence length.
+                assert len(tokenized_text0) == block_size
+                tokenized_text1 = tokenizers[1].convert_tokens_to_ids(tokenizers[1].tokenize(text[:block_size]))
+                tokenized_text1 = tokenizers[1].add_special_tokens_single_sentence(tokenized_text1)
+                tokenized_text1_length = len(tokenized_text1)
+                pad_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].pad_token])[0]
+                tokenized_text1 = tokenized_text1 + ([pad_token] *  (block_size - tokenized_text1_length) ) # Pad up to the sequence length.
+                assert len(tokenized_text1) == block_size
+                self.examples.append([tokenized_text0, tokenized_text0_length, tokenized_text1, tokenized_text1_length])
+                text = text[block_size:]
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+    def __len__(self):
+        return len(self.examples)
+    def __getitem__(self, item):
+        # pdb.set_trace()
+        # Convert to Tensors and build dataset
+        tokenized_text0= torch.tensor(self.examples[item][0], dtype=torch.long)
+        tokenized_text1= torch.tensor(self.examples[item][2], dtype=torch.long)
+        tokenized_text_lengths = torch.tensor([self.examples[item][1], self.examples[item][3]], dtype=torch.long)
+        # pdb.set_trace()
+        return (tokenized_text0, tokenized_text1, tokenized_text_lengths)
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset(tokenizer, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+def train(args, train_dataset, model_encoder, model_decoder, encoder_tokenizer, decoder_tokenizer):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_encoder_parameters = [
+        {'params': [p for n, p in model_encoder.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_encoder.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer_grouped_decoder_parameters = [
+        {'params': [p for n, p in model_decoder.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_decoder.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer_encoder = AdamW(optimizer_grouped_encoder_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    optimizer_decoder = AdamW(optimizer_grouped_decoder_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler_encoder = WarmupLinearSchedule(optimizer_encoder, warmup_steps=args.warmup_steps, t_total=t_total)
+    scheduler_decoder = WarmupLinearSchedule(optimizer_decoder, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_encoder, optimizer_encoder = amp.initialize(model_encoder, optimizer_encoder, opt_level=args.fp16_opt_level)
+        model_decoder, optimizer_decoder = amp.initialize(model_decoder, optimizer_decoder, opt_level=args.fp16_opt_level)
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_encoder = torch.nn.DataParallel(model_encoder)
+        model_decoder = torch.nn.DataParallel(model_decoder)
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_encoder = torch.nn.parallel.DistributedDataParallel(model_encoder, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+        model_decoder = torch.nn.parallel.DistributedDataParallel(model_decoder, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model_encoder.zero_grad()
+    model_decoder.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+            # tokenized_text0 = tokenized_text0.to(args.device)
+            # tokenized_text1 = tokenized_text1.to(args.device)
+            # prepare input-output data for reconstruction
+            inputs, labels = mask_tokens(tokenized_text0, encoder_tokenizer, args) if args.mlm else (tokenized_text0, tokenized_text1)
+            labels = tokenized_text1
+            inputs = inputs.to(args.device)
+            labels = labels.to(args.device)
+            model_encoder.train()
+            model_decoder.train()
+            # Encoding
+            outputs = model_encoder(inputs)
+            pooled_hidden_fea = outputs[1]  # model outputs are always tuple in pytorch-transformers (see doc)
+            # Decoding
+            outputs = model_decoder(input_ids=tokenized_text1, past=None, labels=labels)
+            loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+            if args.n_gpu > 1:
+                loss = loss.mean()  # mean() to average on multi-gpu parallel training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+            else:
+                loss.backward()
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer_encoder), args.max_grad_norm)
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer_decoder), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model_encoder.parameters(), args.max_grad_norm)
+                    torch.nn.utils.clip_grad_norm_(model_decoder.parameters(), args.max_grad_norm)
+                optimizer_encoder.step()
+                optimizer_decoder.step()
+                scheduler_encoder.step()  # Update learning rate schedule
+                scheduler_decoder.step()
+                model_encoder.zero_grad()
+                model_decoder.zero_grad()
+                global_step += 1
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model_encoder, model_decoder, encoder_tokenizer, decoder_tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr_encoder', scheduler_encoder.get_lr()[0], global_step)
+                    tb_writer.add_scalar('lr_decoder', scheduler_decoder.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+                    output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+                    if not os.path.exists(output_encoder_dir):
+                        os.makedirs(output_encoder_dir)
+                    if not os.path.exists(output_decoder_dir):
+                        os.makedirs(output_decoder_dir)
+                    model_encoder_to_save = model_encoder.module if hasattr(model_encoder, 'module') else model_encoder  # Take care of distributed/parallel training
+                    model_decoder_to_save = model_decoder.module if hasattr(model_decoder, 'module') else model_decoder  # Take care of distributed/parallel training
+                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                    torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                    torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+                    logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                    logger.info("Saving model checkpoint to %s", output_decoder_dir)
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+    return global_step, tr_loss / global_step
+def evaluate(args, model_encoder, model_decoder, encoder_tokenizer, decoder_tokenizer, prefix=""):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+    eval_dataset = load_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    eval_loss = 0.0
+    nb_eval_steps = 0
+    model_encoder.eval()
+    model_decoder.eval()
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        # pdb.set_trace()
+        tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+        # prepare input-output data for evaluation
+        inputs, labels = tokenized_text0, tokenized_text1
+        tokenized_text1 = tokenized_text1.to(args.device)
+        inputs = inputs.to(args.device)
+        labels = labels.to(args.device)
+        with torch.no_grad():
+            # Encoding
+            outputs = model_encoder(inputs)
+            pooled_hidden_fea = outputs[1]  # model outputs are always tuple in pytorch-transformers (see doc)
+            # Decoding
+            outputs = model_decoder(input_ids=tokenized_text1, past=None, labels=labels)
+            lm_loss = outputs[0]
+            eval_loss += lm_loss.mean().item()
+            nb_eval_steps += 1
+    eval_loss = eval_loss / nb_eval_steps
+    perplexity = torch.exp(torch.tensor(eval_loss))
+    result = {
+        "perplexity": perplexity
+    }
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results {} *****".format(prefix))
+        for key in sorted(result.keys()):
+            logger.info("  %s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+    return result
+def main():
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    # Training Schedules
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    # Precision & Distributed Training
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+    # Set seed
+    set_seed(args)
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+    ## Encoder
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config)
+    model_encoder.to(args.device)
+    ## Decoder
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config)
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+    model_decoder.to(args.device)
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+    logger.info("Training/evaluation parameters %s", args)
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+        train_dataset = load_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+        global_step, tr_loss = train(args, train_dataset, model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        # Save model checkpoint
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        if not os.path.exists(output_encoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_encoder_dir)
+        if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_decoder_dir)
+        logger.info("Saving encoder model checkpoint to %s", output_encoder_dir)
+        logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_encoder_to_save = model_encoder.module if hasattr(model_encoder, 'module') else model_encoder  # Take care of distributed/parallel training
+        model_decoder_to_save = model_decoder.module if hasattr(model_decoder, 'module') else model_decoder  # Take care of distributed/parallel training
+        # Good practice: save your training arguments together with the trained model
+        model_encoder_to_save.save_pretrained(output_encoder_dir)
+        torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+        model_decoder_to_save.save_pretrained(output_decoder_dir)
+        torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_encoder = encoder_model_class.from_pretrained(output_encoder_dir)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(output_encoder_dir, do_lower_case=args.do_lower_case)
+        model_encoder.to(args.device)
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(output_decoder_dir, do_lower_case=args.do_lower_case)
+        model_decoder.to(args.device)
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        global_step= 881
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint[0].split('-')[-1] if len(checkpoints) > 1 else ""
+            model_encoder = encoder_model_class.from_pretrained(checkpoint[0])
+            model_encoder.to(args.device)
+            model_decoder = decoder_model_class.from_pretrained(checkpoint[1])
+            model_decoder.to(args.device)
+            result = evaluate(args, model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, prefix=global_step)
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+    return results
+if __name__ == "__main__":
+    main()

Optimus/code/examples/big_ae/run_lm_finetuning_baseline.py ADDED Viewed

	@@ -0,0 +1,573 @@

+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+from __future__ import absolute_import, division, print_function
+import pdb
+import sys
+sys.path.insert(0, '.')
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForMaskedLM, BertTokenizer,
+                                  GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+from utils import (calc_iwnll, calc_mi, calc_au, TextDataset_Split, TextDataset_2Tokenizers)
+import pdb
+logger = logging.getLogger(__name__)
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForMaskedLM, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+class TextDataset(Dataset):
+    def __init__(self, tokenizer, file_path='train', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_{block_size}_{filename}')
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+            self.examples = []
+            with open(file_path, encoding="utf-8") as f:
+                text = f.read()
+            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
+            while len(tokenized_text) >= block_size:  # Truncate in block of block_size
+                self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))
+                tokenized_text = tokenized_text[block_size:]
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+    def __len__(self):
+        return len(self.examples)
+    def __getitem__(self, item):
+        return torch.tensor(self.examples[item])
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+def train(args, train_dataset, model, tokenizer):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            tokenized_text1, tokenized_text_lengths = batch
+            inputs, labels =  tokenized_text1, tokenized_text1
+            inputs = inputs.to(args.device)
+            labels = labels.to(args.device)
+            model.train()
+            outputs = model(inputs, labels=labels, label_ignore=tokenizer.pad_token_id)
+            # pdb.set_trace()
+            loss = outputs[0].mean()  # model outputs are always tuple in pytorch-transformers (see doc)
+            if args.use_philly:
+                print("PROGRESS: {}%".format(round(100 * (step +  epoch*len(epoch_iterator) ) /(int(args.num_train_epochs) *  len(epoch_iterator)) , 4)))
+                print("EVALERR: {}%".format(loss))
+            if args.n_gpu > 1:
+                loss = loss.mean()  # mean() to average on multi-gpu parallel training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+            else:
+                loss.backward()
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+                optimizer.step()
+                scheduler.step()  # Update learning rate schedule
+                model.zero_grad()
+                global_step += 1
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model, tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+                    model_to_save.save_pretrained(output_dir)
+                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+    return global_step, tr_loss / global_step
+def evaluate(args, model, tokenizer, prefix=""):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+    eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    eval_loss = 0.0
+    eval_loss_sum = 0.0
+    nb_eval_steps = 0
+    report_num_words = 0
+    model.eval()
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        tokenized_text1, x_lengths = batch
+        x_lengths = x_lengths.to(args.device)
+        report_num_words += x_lengths.sum().item()
+        inputs, labels =  tokenized_text1, tokenized_text1
+        inputs = inputs.to(args.device)
+        labels = labels.to(args.device)
+        with torch.no_grad():
+            outputs = model(inputs, labels=labels, label_ignore=tokenizer.pad_token_id)
+            lm_loss = outputs[0]
+            eval_loss += lm_loss.mean().item()/x_lengths.sum().item()
+            eval_loss_sum += lm_loss.sum().item()
+        nb_eval_steps += 1
+        # pdb.set_trace()
+    eval_loss = eval_loss / nb_eval_steps
+    perplexity1 = torch.exp(torch.tensor(eval_loss))
+    perplexity2 = torch.exp(torch.tensor(eval_loss_sum / report_num_words))
+    result = {
+        "perplexity1": perplexity1, "perplexity2": perplexity2
+    }
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results {} *****".format(prefix))
+        for key in sorted(result.keys()):
+            logger.info("  %s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+    return result
+def main():
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--model_type", default="bert", type=str,
+                        help="The model architecture to be fine-tuned.")
+    parser.add_argument("--model_name_or_path", default="bert-base-cased", type=str,
+                        help="The model checkpoint for weights initialization.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+    parser.add_argument('--logging_steps', type=int, default=100,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=100,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+    if args.model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+    # Set seed
+    set_seed(args)
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer.max_len_single_sentence)
+    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+    model.to(args.device)
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer.pad_token == '<PAD>'
+    # pdb.set_trace()
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+    logger.info("Training/evaluation parameters %s", args)
+    # Training
+    global_step= 0
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+        train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(args.output_dir)
+        logger.info("Saving model checkpoint to %s", args.output_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+        # Good practice: save your training arguments together with the trained model
+        torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = model_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        model.to(args.device)
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        if global_step == 0:
+            global_step = args.gloabl_step_eval
+        output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+        checkpoints = [args.output_dir]
+        if args.eval_all_checkpoints:
+            checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
+            logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        print("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            model = model_class.from_pretrained(checkpoint)
+            model.to(args.device)
+            result = evaluate(args, model, tokenizer, prefix=global_step)
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+    return results
+if __name__ == "__main__":
+    main()

Optimus/code/examples/big_ae/run_lm_gpt2_training.py ADDED Viewed

	@@ -0,0 +1,658 @@

+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+from __future__ import absolute_import, division, print_function
+import pdb
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+from utils import (BucketingDataLoader, TextDataset_Split, TextDataset_2Tokenizers)
+logger = logging.getLogger(__name__)
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+# ts = TableService(account_name=storage_name, account_key=key)
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+            file_path=args.eval_data_file
+        dataloader = BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=True)
+    else:
+        pass
+    return dataloader
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+def train(args, train_dataloader, model, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    # train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    # train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model, device_ids=range(args.n_gpu)).to(args.device)
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", train_dataloader.num_examples)
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    n_iter = int(args.num_train_epochs) * len(train_dataloader)
+    tmp_list = []
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+            inputs, labels =  tokenized_text1.to(args.device), tokenized_text1.to(args.device)
+            model.train()
+            outputs = model(inputs, labels=labels, label_ignore=decoder_tokenizer.pad_token_id)
+            loss = outputs[0].mean()  # model outputs are always tuple in pytorch-transformers (see doc)
+            if args.n_gpu > 1:
+                loss = loss.mean()
+            if args.use_philly:
+                print("PROGRESS: {}%".format(round(100 * (step +  epoch*len(epoch_iterator) ) /(int(args.num_train_epochs) *  len(epoch_iterator)) , 4)))
+                print("EVALERR: {}%".format(loss))
+            epoch_iterator.set_description(
+                (
+                    f'iter: {step +  epoch*len(epoch_iterator) }; loss: {loss.item():.3f}; '
+                )
+            )
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+            else:
+                loss.backward()
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+                optimizer.step()
+                scheduler.step()  # Update learning rate schedule
+                model.zero_grad()
+                global_step += 1
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save decoder model checkpoint
+                    output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+                    if not os.path.exists(output_decoder_dir):
+                        os.makedirs(output_decoder_dir)
+                    model_decoder_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+                    if args.use_philly:
+                        save_solid = False
+                        while not save_solid:
+                            try:
+                                model_decoder_to_save.save_pretrained(output_decoder_dir)
+                                torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                                logger.info("Saving model checkpoint to %s", output_decoder_dir)
+                                save_solid = True
+                            except:
+                                pass
+                    else:
+                        model_decoder_to_save.save_pretrained(output_decoder_dir)
+                        torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                        logger.info("Saving model checkpoint to %s", output_decoder_dir)
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+    return global_step, tr_loss / global_step
+def evaluate(args, model, encoder_tokenizer, decoder_tokenizer, table_name, prefix="", subset="test"):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+    logger.info("***** Running evaluation on {} dataset *****".format(subset))
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+    args.per_gpu_eval_batch_size = 1
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    eval_dataloader = build_dataload_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataloader))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    eval_loss = 0.0
+    eval_loss_sum = 0.0
+    nb_eval_steps = 0
+    report_num_words = 0
+    model.eval()
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        _, tokenized_text1, tokenized_text_lengths = batch
+        inputs, labels =  tokenized_text1.to(args.device), tokenized_text1.to(args.device)
+        x_lengths = tokenized_text_lengths[:,1].to(args.device)
+        report_num_words += x_lengths.sum().item()
+        with torch.no_grad():
+            outputs = model(inputs, labels=labels, label_ignore=decoder_tokenizer.pad_token_id)
+            lm_loss = outputs[0]
+            eval_loss += lm_loss.mean().item()/x_lengths.sum().item()
+            eval_loss_sum += lm_loss.sum().item()
+        nb_eval_steps += 1
+    eval_loss = eval_loss / nb_eval_steps
+    perplexity1 = torch.exp(torch.tensor(eval_loss))
+    perplexity2 = torch.exp(torch.tensor(eval_loss_sum / report_num_words))
+    result = {
+        "perplexity1": perplexity1, "perplexity2": perplexity2
+    }
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results {} *****".format(prefix))
+        for key in sorted(result.keys()):
+            logger.info("  %s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+    return result
+def main():
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+    parser.add_argument("--save_bert_gpt_init", action='store_true',
+                        help="Use Philly for computing.")
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+    parser.add_argument("--use_pretrained_model", action='store_true',
+                        help="Use pre-trained auto-encoder models as the initialization")
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.")
+    parser.add_argument("--ratio_zero", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")
+    parser.add_argument("--dim_target_kl", default=3.0, type=float,
+                        help="dim_target_kl free bit training mode.")
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+    # Precision & Distributed Training
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero)
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size)
+    try:
+        ts.create_table(table_name)
+    except:
+        pass
+    # Set seed
+    set_seed(args)
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+    ## Encoder
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+    # model_encoder.to(args.device)
+    ## Decoder
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config)
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+    model_decoder.to(args.device)
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+    logger.info("Training/evaluation parameters %s", args)
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+        train_dataloader = build_dataload_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+        global_step, tr_loss = train(args, train_dataloader, model_decoder, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        # Save model checkpoint
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_decoder_dir)
+        logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_decoder_to_save = model_decoder.module if hasattr(model_decoder, 'module') else model_decoder  # Take care of distributed/parallel training
+        # Good practice: save your training arguments together with the trained model
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                    torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_decoder_to_save.save_pretrained(output_decoder_dir)
+            torch.save(args, os.path.join(output_decoder_dir, 'training_encoder_args.bin'))
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir)
+        model_decoder.to(args.device)
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        if global_step == 0:
+            global_step = args.gloabl_step_eval
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        checkpoints = [ output_decoder_dir ]
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            model_decoder = decoder_model_class.from_pretrained(checkpoint)
+            model_decoder.to(args.device)
+            result = evaluate(args, model_decoder, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='test')
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+            # result = evaluate(args, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='train')
+            # result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            # results.update(result)
+    return results
+if __name__ == "__main__":
+    main()

Optimus/code/examples/big_ae/run_lm_vae_label_ctrl_gen.py ADDED Viewed

	@@ -0,0 +1,875 @@

+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+from __future__ import absolute_import, division, print_function
+import pdb
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+import sys
+import json
+import nltk
+nltk.download('punkt')
+sys.path.append('../../')
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+from utils import (TextDataset_Split, TextDataset_2Tokenizers_LCtrlG,
+                   frange_cycle_linear, frange_cycle_zero_linear, AverageValueMeter)
+# from modules import ARAE
+from modules import CARA
+# logging.getLogger("azure").setLevel(logging.WARNING)
+# logging.getLogger("TableService").setLevel(logging.WARNING)
+logger = logging.getLogger(__name__)
+import time
+def get_time_str():
+    return time.ctime().replace(' ', '_').replace(':', '-')
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+# ts = TableService(account_name=storage_name, account_key=key)
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers_LCtrlG(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file,
+                                                 block_size=args.block_size, create_new=args.create_new)
+    else:
+        raise NotImplementedError
+        # dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+def train(args, train_dataset, model_vae, encoder_tokenizer, decoder_tokenizer, table_name, logff):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+    # Prepare optimizer and schedule (linear warmup and decay)
+    # model_encoder, model_decoder, model_connector = model_vae.encoder,  model_vae.decoder, model_vae.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_vae.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_vae.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_vae, optimizer = amp.initialize(model_vae, optimizer, opt_level=args.fp16_opt_level)
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_vae = torch.nn.DataParallel(model_vae, device_ids=range(args.n_gpu)).to(args.device)
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
+    # model_vae = model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+    # Train!
+    logger.info("***** Running training *****")
+    logff.write("***** Running training *****\n")
+    logger.info("  Num examples = {}".format(len(train_dataset)))
+    logff.write("  Num examples = {}\n".format(len(train_dataset)))
+    logger.info("  Num Epochs = {}".format(args.num_train_epochs))
+    logff.write("  Num Epochs = {}\n".format(args.num_train_epochs))
+    logger.info("  Instantaneous batch size per GPU = {}".format(args.per_gpu_train_batch_size))
+    logff.write("  Instantaneous batch size per GPU = {}\n".format(args.per_gpu_train_batch_size))
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logff.write("  Total train batch size (w. parallel, distributed & accumulation) = {}\n".format(
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1)))
+    logger.info("  Gradient Accumulation steps = {}".format(args.gradient_accumulation_steps))
+    logff.write("  Gradient Accumulation steps = {}\n".format(args.gradient_accumulation_steps))
+    logger.info("  Total optimization steps = {}".format( t_total))
+    logff.write("  Total optimization steps = {}\n".format(t_total))
+    logff.flush()
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model_vae.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    n_iter = int(args.num_train_epochs) * len(train_dataloader)
+    beta_t_list = frange_cycle_zero_linear(n_iter, start=1.0, stop=args.beta_cls,  n_cycle=1, ratio_increase=args.ratio_increase, ratio_zero=args.ratio_zero)
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    accmeter = {
+        'acc_encode_z_dis': AverageValueMeter(),
+        'acc_gen_z_dis': AverageValueMeter(),
+        'acc_encode_z_cls': AverageValueMeter(),
+        'acc_cls': AverageValueMeter(),
+        # 'acc_at_soft_cls': AverageValueMeter(),
+    }
+    lossmeter = {
+        'loss': AverageValueMeter(),
+        'loss_rec': AverageValueMeter(),
+        'loss_encoder': AverageValueMeter(),
+        'loss_lsc': AverageValueMeter(),
+        'loss_lsd': AverageValueMeter(),
+        'loss_lsg': AverageValueMeter(),
+        'loss_cls': AverageValueMeter(),
+        # 'loss_at_soft_cls': AverageValueMeter(),
+    }
+    for epoch in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        # pbar = tqdm(total=(len(train_dataloader)+1) // args.gradient_accumulation_steps)
+        for step, batch in enumerate(train_dataloader):
+            # if step > 100:
+            #     break
+            # Data
+            input_seq_ids, tgt_seq_ids, tokenized_text_lengths, cond_labels = batch
+            max_len_values, _ = tokenized_text_lengths.max(0)
+            input_seq_ids = input_seq_ids[:,:max_len_values[0]]
+            tgt_seq_ids = tgt_seq_ids[:,:max_len_values[1]]
+            input_seq_ids, tgt_seq_ids = mask_tokens(input_seq_ids, encoder_tokenizer, args) if args.mlm else (input_seq_ids, tgt_seq_ids)
+            input_seq_ids = input_seq_ids.to(args.device)
+            tgt_seq_ids = tgt_seq_ids.to(args.device)
+            cond_labels = cond_labels.to(args.device)
+            input_mask = torch.where(torch.arange(max_len_values[0].item()).unsqueeze(0).repeat(input_seq_ids.size(0), 1).type_as(tokenized_text_lengths).to(args.device)
+                                     < tokenized_text_lengths[:, 0].unsqueeze(1).to(args.device), torch.ones_like(input_seq_ids), torch.zeros_like(input_seq_ids))
+            # Configs
+            model_vae.train()
+            beta_t = beta_t_list[step +  epoch*len(epoch_iterator)]
+            model_vae.module.args.beta_cls = beta_t
+            # if beta_t == 0.0:
+            #     model_vae.args.fb_mode = 0
+            # else:
+            #     model_vae.args.fb_mode = 1
+            # if args.use_deterministic_connect:
+            #     model_vae.args.fb_mode = 2
+            # Model
+            loss_dict, acc_dict = model_vae(input_seq_ids=input_seq_ids, tgt_seq_ids=tgt_seq_ids, cond_labels=cond_labels, attention_mask=input_mask)
+            # Loss
+            for key, value in loss_dict.items():
+                loss_dict[key] = value.mean()
+            loss = loss_dict['loss']
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+            else:
+                loss.backward()
+            tr_loss += loss.item()
+            # Log
+            for key, value in loss_dict.items():
+                lossmeter[key].add(value.item())
+            for key, value in acc_dict.items():
+                value = value.cpu().tolist()
+                for v in value:
+                    accmeter[key].add(float(v))
+            # Optimize
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                # Optimize
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model_vae.parameters(), args.max_grad_norm)
+                optimizer.step()
+                scheduler.step()  # Update learning rate schedule
+                model_vae.zero_grad()
+                global_step += 1
+                # pbar.update(1)
+                # Log
+                if global_step % args.logging_steps == 0:
+                    logger.info("\n")
+                    logger.info("global_step: {}, avg loss: {:3f}".format(global_step, tr_loss/global_step))
+                    logff.write("global_step: {}, avg loss: {:3f}\n".format(global_step, tr_loss/global_step))
+                    logger.info("loss: {}".format(', '.join(key + ': ' + str(round(meter.mean, 3)) for key, meter in lossmeter.items())))
+                    logff.write("loss: {}\n".format(', '.join(key + ': ' + str(round(meter.mean, 3)) for key, meter in lossmeter.items())))
+                    logger.info("acc: {}".format(', '.join(key + ': ' + str(round(meter.mean, 3)) for key, meter in accmeter.items())))
+                    logff.write("acc: {}\n".format(', '.join(key + ': ' + str(round(meter.mean, 3)) for key, meter in accmeter.items())))
+                    logff.flush()
+                if args.use_philly:
+                    #if args.local_rank in [-1, 0]:
+                    if args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                        logger.info("PROGRESS: {}%".format(round(100 * (step +  epoch*len(train_dataloader) ) /(int(args.num_train_epochs) *  len(train_dataloader)) , 4)))
+                        logger.info("EVALERR: {}%".format(tr_loss / global_step))
+                if args.local_rank in [-1, 0] and args.eval_steps > 0 and global_step % args.eval_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer, table_name, epoch=epoch)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.eval_steps, global_step)
+                    logging_loss = tr_loss
+                # Save checkpoints
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save encoder model checkpoint
+                    output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+                    if not os.path.exists(output_encoder_dir):
+                        os.makedirs(output_encoder_dir)
+                    model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+                    if args.use_philly:
+                        save_solid = False
+                        while not save_solid:
+                            try:
+                                model_encoder_to_save.save_pretrained(output_encoder_dir)
+                                torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                                logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                                save_solid = True
+                            except:
+                                pass
+                    else:
+                        model_encoder_to_save.save_pretrained(output_encoder_dir)
+                        torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                        logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                    # Save decoder model checkpoint
+                    output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+                    if not os.path.exists(output_decoder_dir):
+                        os.makedirs(output_decoder_dir)
+                    model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+                    if args.use_philly:
+                        save_solid = False
+                        while not save_solid:
+                            try:
+                                model_decoder_to_save.save_pretrained(output_decoder_dir)
+                                torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                                logger.info("Saving model checkpoint to %s", output_decoder_dir)
+                                save_solid = True
+                            except:
+                                pass
+                    else:
+                        model_decoder_to_save.save_pretrained(output_decoder_dir)
+                        torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                        logger.info("Saving model checkpoint to %s", output_decoder_dir)
+            if args.max_steps > 0 and global_step > args.max_steps:
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+    return global_step, tr_loss / global_step
+def evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer, table_name, prefix="", subset="test", epoch=None):
+    eval_output_dir = args.output_dir
+    if subset == 'test':
+        eval_dataset = load_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+    elif subset == 'train':
+        eval_dataset = load_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=False)
+    else:
+        raise ValueError
+    args.label_size = len(eval_dataset.get_labels())
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    logger.info("  Num steps = %d", len(eval_dataset) // args.eval_batch_size)
+    logger.info("  eval_output_dir = %s", eval_output_dir)
+    model_vae.eval()
+    model_vae_module =  model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+    outputs = {
+        'sampled_cond_labels': None,
+        'cond_labels': None,
+        'tgt_seq_ids': None,
+        'generated': None,
+        'at_generated': None,
+        'cg_generated': None,
+        'pred_cls': None,
+        'pred_ge_cls': None,
+        'pred_at_cls': None,
+        'pred_cg_cls': None,
+    }
+    for bi, batch in enumerate(tqdm(eval_dataloader, desc="#Sentences", disable=args.local_rank not in [-1, 0]) ):
+        # if bi == 3:
+        #     break
+        # Data
+        input_seq_ids, tgt_seq_ids, tokenized_text_lengths, cond_labels = batch
+        max_len_values, _ = tokenized_text_lengths.max(0)
+        input_seq_ids = input_seq_ids[:,:max_len_values[0]]
+        tgt_seq_ids = tgt_seq_ids[:,:max_len_values[1]]
+        input_seq_ids = input_seq_ids.to(args.device)
+        tgt_seq_ids = tgt_seq_ids.to(args.device)
+        cond_labels = cond_labels.to(args.device)
+        input_mask = torch.where(torch.arange(max_len_values[0].item()).unsqueeze(0).repeat(input_seq_ids.size(0), 1).type_as(tokenized_text_lengths).to(args.device)
+                                     < tokenized_text_lengths[:, 0].unsqueeze(1).to(args.device), torch.ones_like(input_seq_ids), torch.zeros_like(input_seq_ids))
+        # Model
+        with torch.no_grad():
+            result = model_vae(input_seq_ids=input_seq_ids, tgt_seq_ids=tgt_seq_ids, cond_labels=cond_labels, attention_mask=input_mask)
+        if bi == 0:
+            for key in outputs.keys():
+                outputs[key] = result[key].cpu().tolist()
+        else:
+            for key in outputs.keys():
+                outputs[key].extend(result[key].cpu().tolist())
+    # compute accuracies and store in results
+    acc = np.mean(np.array(np.array(outputs['pred_cls']) == np.array(outputs['cond_labels']), dtype=np.float))
+    acc_ge = np.mean(np.array(np.array(outputs['pred_ge_cls']) == np.array(outputs['cond_labels']), dtype=np.float))
+    acc_at = np.mean(np.array(np.array(outputs['pred_at_cls']) == np.array(outputs['sampled_cond_labels']), dtype=np.float))
+    acc_cg = np.mean(np.array(np.array(outputs['pred_cg_cls']) == np.array(outputs['sampled_cond_labels']), dtype=np.float))
+    metrics = {'acc': acc, 'acc_ge': acc_ge, 'acc_at': acc_at, 'acc_cg': acc_cg}
+    # dump generated outputs to file.
+    json.dump(outputs, open(os.path.join(eval_output_dir, "outputs_{}.json".format(epoch) if epoch is not None else "outputs.json"), 'w'))
+    # compute BLEU
+    bos_token_id = model_vae_module.tokenizer_decoder.encode('<BOS>')[0]
+    eos_token_id = model_vae_module.tokenizer_decoder.encode('<EOS>')[0]
+    pad_token_id = model_vae_module.tokenizer_decoder.encode('<PAD>')[0]
+    generated_ids = []
+    generated_text = []
+    for g in outputs['generated']:
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        g = g[:g.index(eos_token_id)] if eos_token_id in g else g
+        g = g[:g.index(pad_token_id)] if pad_token_id in g else g
+        g_text = model_vae_module.tokenizer_decoder.decode(g, clean_up_tokenization_spaces=True)
+        generated_ids.append(g)
+        generated_text.append(g_text)
+    tgt_seq_ids = []
+    tgt_seq_text = []
+    for g in outputs['tgt_seq_ids']:
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        g = g[:g.index(eos_token_id)] if eos_token_id in g else g
+        g = g[:g.index(pad_token_id)] if pad_token_id in g else g
+        g_text = model_vae_module.tokenizer_decoder.decode(g, clean_up_tokenization_spaces=True)
+        tgt_seq_ids.append(g)
+        tgt_seq_text.append(g_text)
+    at_generated_ids = []
+    at_generated_text = []
+    for g in outputs['at_generated']:
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        g = g[:g.index(eos_token_id)] if eos_token_id in g else g
+        g = g[:g.index(pad_token_id)] if pad_token_id in g else g
+        g_text = model_vae_module.tokenizer_decoder.decode(g, clean_up_tokenization_spaces=True)
+        at_generated_ids.append(g)
+        at_generated_text.append(g_text)
+    cg_generated_ids = []
+    cg_generated_text = []
+    for g in outputs['cg_generated']:
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        g = g[:g.index(eos_token_id)] if eos_token_id in g else g
+        g = g[:g.index(pad_token_id)] if pad_token_id in g else g
+        g_text = model_vae_module.tokenizer_decoder.decode(g, clean_up_tokenization_spaces=True)
+        cg_generated_ids.append(g)
+        cg_generated_text.append(g_text)
+    f = open(os.path.join(eval_output_dir, "reconstruction{}.txt".format(('_'+str(epoch)) if epoch is not None else '')), 'w')
+    f.write('\n'.join([g + '\n' + t for g, t in zip(generated_text, tgt_seq_text)]))
+    fat = open(os.path.join(eval_output_dir, "attribute_transfer{}.txt".format(('_'+str(epoch)) if epoch is not None else '')), 'w')
+    fat.write('\n'.join([g + '\n' + t for g, t in zip(at_generated_text, tgt_seq_text)]))
+    fcg = open(os.path.join(eval_output_dir, "conditional_generation{}.txt".format(('_'+str(epoch)) if epoch is not None else '')), 'w')
+    fcg.write('\n'.join(cg_generated_text))
+    rec_bleu = nltk.translate.bleu_score.corpus_bleu(list_of_references=[[nltk.word_tokenize(t)] for t in tgt_seq_text],
+                                                     hypotheses=[nltk.word_tokenize(g) for g in generated_text])
+    at_bleu = nltk.translate.bleu_score.corpus_bleu(list_of_references=[[nltk.word_tokenize(t)] for t in tgt_seq_text],
+                                                    hypotheses=[nltk.word_tokenize(g) for g in at_generated_text])
+    cg_generated_text_subset = cg_generated_text[:500]  # use a subset, otherwise it takes a long time to compute.
+    cg_bleu = nltk.translate.bleu_score.corpus_bleu(list_of_references=[[nltk.word_tokenize(t) for t in tgt_seq_text] for _ in range(len(cg_generated_text_subset))],
+                                                    hypotheses=[nltk.word_tokenize(g) for g in cg_generated_text_subset])
+    cg_self_bleu = nltk.translate.bleu_score.corpus_bleu(list_of_references=[[nltk.word_tokenize(t) for t in cg_generated_text_subset[:i]+cg_generated_text_subset[i+1:]]
+                                                         for i in range(len(cg_generated_text_subset))],
+                                                         hypotheses=[nltk.word_tokenize(g) for g in cg_generated_text_subset])
+    metrics['rec_bleu'] = rec_bleu
+    metrics['at_bleu'] = at_bleu
+    metrics['cg_bleu'] = cg_bleu
+    metrics['cg_self_bleu'] = cg_self_bleu
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    writer = open(output_eval_file, "w")
+    logger.info("***** Eval results, global steps: {} *****".format(prefix))
+    for key, value in metrics.items():
+        logger.info("  %s = %s", key, str(value))
+        writer.write("%s = %s\n" % (key, str(value)))
+    return metrics
+def main():
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--output_dir", default='results_cara', type=str, help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--temperature", type=float, default=1.0)
+    parser.add_argument("--soft_temperature", type=float, default=0.5)
+    parser.add_argument("--top_k", type=int, default=5)
+    parser.add_argument("--top_p", type=float, default=0.0)
+    parser.add_argument("--num_train_epochs", default=10.0, type=float, help="Total number of training epochs to perform.")
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+    parser.add_argument("--lambda", default=0, type=float, help="")
+    ## Data parameters
+    parser.add_argument("--dataset", default='yelp', type=str, help="The dataset.")
+    # parser.add_argument("--train_data_file", default='../../../data/yelp/sentiment.train.tiny.text', type=str, help="The input training data file (a text file).")
+    parser.add_argument("--train_data_file", default='../../../data/yelp/sentiment.train.text', type=str, help="The input training data file (a text file).")
+    # parser.add_argument("--eval_data_file", default='../../../data/yelp/sentiment.dev.tiny.text', type=str, help="")
+    parser.add_argument("--eval_data_file", default='../../../data/yelp/sentiment.dev.small.text', type=str, help="2000 samples.")
+    parser.add_argument("--ExpName", default="local_lctrlg_yelp", type=str, help="The experiment name used in Azure Table.")
+    parser.add_argument("--create_new", default=0, type=int, help="")
+    # Training parameters
+    parser.add_argument("--checkpoint_dir", default='results_arae/checkpoint-47501/pytorch_model.bin', type=str, help='results/checkpoint-1212/pytorch_model.bin')
+    # parser.add_argument("--checkpoint", default='', type=str, help='results/checkpoint-1212/pytorch_model.bin')
+    parser.add_argument("--start_global_step", default=1001, type=int, help='')
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1, help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--evaluate_during_training", action='store_true', help="Run evaluation during training at each logging step.")
+    parser.add_argument('--gloabl_step_eval', type=int, default=0, help="Evaluate the results at the given global step")
+    # parser.add_argument('--logging_steps', type=int, default=2000, help="ARAE")
+    parser.add_argument('--logging_steps', type=int, default=10, help="CARA")
+    parser.add_argument('--eval_steps', type=int, default=500, help="CARA")
+    # parser.add_argument('--save_steps', type=int, default=5000, help="ARAE")
+    parser.add_argument('--save_steps', type=int, default=1000, help="CARA")
+    parser.add_argument("--eval_all_checkpoints", action='store_true', help="")
+    ## Encoder options
+    # parser.add_argument("--encoder_model_name_or_path", default="bert-base-uncased", type=str, )
+    parser.add_argument("--encoder_model_name_or_path", default="results_cara/checkpoint-encoder-1000", type=str)
+    # parser.add_argument("--encoder_model_name_or_path", default="results/checkpoint-encoder-55000", type=str")
+    parser.add_argument("--encoder_config_name", default="", type=str, help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str, help="Keep empty. Will default to decoder_model_name_or_path")
+    parser.add_argument("--encoder_model_type", default="bert", type=str, help="The encoder model architecture to be fine-tuned.")
+    ## Decoder options
+    # parser.add_argument("--decoder_model_name_or_path", default="gpt2", type=str)
+    parser.add_argument("--decoder_model_name_or_path", default="results_cara/checkpoint-decoder-1000", type=str)
+    # parser.add_argument("--decoder_model_name_or_path", default="results/checkpoint-decoder-55000", type=str)
+    parser.add_argument("--decoder_config_name", default="", type=str, help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str, help="Keep empty. Will default to decoder_model_name_or_path")
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str, help="The decoder model architecture to be fine-tuned.")
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true', help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true', help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15, help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--cache_dir", default="", type=str, help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--block_size", default=21, type=int, help="21 for Yelp and Yahoo on label-conditional text generation")
+    parser.add_argument("--do_lower_case", action='store_true', help="Set this flag if you are using an uncased model.")
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float, help="Learning schedule, the percentage for the annealing stage.")
+    parser.add_argument("--ratio_zero", default=0.5, type=float, help="Learning schedule, the percentage for the pure auto-encoding stage.")
+    parser.add_argument("--fb_mode", default=1, type=int, help="free bit training mode.")
+    parser.add_argument("--dim_target_kl", default=3.0, type=float, help="dim_target_kl free bit training mode.")
+    parser.add_argument("--learning_rate", default=5e-6, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--use_philly", action='store_true', help="Use Philly for computing.")
+    parser.add_argument("--use_pretrained_model", action='store_true',
+                        help="Use pre-trained auto-encoder models as the initialization")
+    parser.add_argument("--use_pretrained_vae", action='store_true',
+                        help="Use use_pretrained_vae as initialization, where beta value is specified in the folder")
+    parser.add_argument("--beta", type=float, default=1.0, help="The weighting hyper-parameter of the KL term in VAE")
+    parser.add_argument("--beta_cls", type=float, default=1.0, help="The weighting hyper-parameter for the classifier on the generated sentences")
+    ## IO: Logging and Saving
+    parser.add_argument("--no_cuda", action='store_true', help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', type=int, default=1, help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true', help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42, help="random seed for initialization")
+    # Precision & Distributed Training
+    parser.add_argument('--fp16', action='store_true', help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1', help="")
+    parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    # New parameters
+    parser.add_argument('--label_size', type=int, default=2, help="This depends on which dataset is used.")
+    args = parser.parse_args()
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file or remove the --do_eval argument.")
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        logger.info("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+    # pdb.set_trace()
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s', datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + \
+                    '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero)
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size)
+    set_seed(args)
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+    if args.use_pretrained_model:
+        args.encoder_model_type = args.encoder_model_type.lower()
+        args.decoder_model_type = args.decoder_model_type.lower()
+        global_step = args.gloabl_step_eval
+        if args.use_pretrained_vae:
+            output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}-1.0'.format(global_step))
+            output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}-1.0'.format(global_step))
+        else:
+            output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+            output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}'.format(global_step))
+        checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        # Load a trained Encoder model and vocabulary
+        encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+        model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        model_encoder.to(args.device)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+        # Load a trained Decoder model and vocabulary
+        decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        model_decoder.to(args.device)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    else:
+        ## Encoder
+        encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+        encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+        model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+        # model_encoder = encoder_model_class(config=encoder_config, latent_size=args.latent_size)
+        ## Decoder
+        decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+        decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+        setattr(decoder_config, "latent_size", args.latent_size)
+        model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size)
+        # model_decoder = decoder_model_class(config=decoder_config, latent_size=args.latent_size)
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    logger.info('We have added {} tokens to GPT2'.format(num_added_toks))
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+    # on_gpu = next(model_vae.parameters()).is_cuda
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+    logger.info("Training/evaluation parameters %s", args)
+    if not os.path.exists(args.output_dir): os.makedirs(args.output_dir)
+    # Training
+    logff = open(os.path.join(args.output_dir, 'log_{}'.format(get_time_str())), 'a')
+    if args.do_train:
+        global_step = args.start_global_step
+        model_vae = CARA(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device)
+        # if args.checkpoint:
+        #     logger.info("Loading checkpoint from {}".format(args.checkpoint))
+        #     model_vae.load_state_dict(torch.load(args.checkpoint))
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+        train_dataset = load_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+        # logger.info("Test evaluate before training.")
+        # evaluate(args, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, prefix=0, subset='test')
+        # Train
+        global_step, tr_loss = train(args, train_dataset, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, logff=logff)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        # Save model checkpoint
+        output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        if not os.path.exists(output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_dir)
+        if not os.path.exists(output_encoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_encoder_dir)
+        if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_decoder_dir)
+        logger.info("Saving encoder model checkpoint to %s", output_encoder_dir)
+        logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+        model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+        model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+        model_to_save = model_vae.module if hasattr(model_vae, "module") else model_vae
+        # Good practice: save your training arguments together with the trained model
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+                    torch.save(model_to_save.state_dict(), os.path.join(output_dir, 'pytorch_model.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+            torch.save(model_to_save.state_dict(), os.path.join(output_dir, 'pytorch_model.bin'))
+        args.checkpoint = os.path.join(output_dir, 'pytorch_model.bin')
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                    torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_encoder_to_save.save_pretrained(output_encoder_dir)
+            torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                    torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_decoder_to_save.save_pretrained(output_decoder_dir)
+            torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+        # Load a trained model and vocabulary that you have fine-tuned
+        # model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+        # model_encoder.to(args.device)
+        #
+        # # Load a trained model and vocabulary that you have fine-tuned
+        # model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+        # model_decoder.to(args.device)
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        # if global_step == 0:
+        #     global_step = args.gloabl_step_eval
+        # output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        # output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        # checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+        # logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        # for checkpoint in checkpoints:
+        # global_step = args.checkpoint_dir.split('/')[-2].split('-')[-1] if args.checkpoint_dir else ""
+        # model_encoder = encoder_model_class.from_pretrained(checkpoint[0], latent_size=args.latent_size)
+        # model_encoder.to(args.device)
+        # model_decoder = decoder_model_class.from_pretrained(checkpoint[1], latent_size=args.latent_size)
+        # model_decoder.to(args.device)
+        model_vae = CARA(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device)
+        if args.gloabl_step_eval < 1:
+            args.gloabl_step_eval = global_step
+            args.checkpoint_dir = os.path.join(args.output_dir, 'checkpoint-{}/pytorch_model.bin'.format(args.gloabl_step_eval))
+        else:
+            global_step = args.gloabl_step_eval
+            args.checkpoint_dir = os.path.join(args.checkpoint_dir, 'checkpoint-{}/pytorch_model.bin'.format(args.gloabl_step_eval))
+        # if args.checkpoint_dir and os.path.exists(args.checkpoint_dir):
+        #     logger.info("Loading checkpoint from {}".format(args.checkpoint_dir))
+        #     model_vae.load_state_dict(torch.load(args.checkpoint_dir))
+        # else:
+        #     raise ValueError("Cannot find checkpoint at: {}".format(args.checkpoint))
+        metrics = evaluate(args, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='test')
+        metrics = dict((k + '_{}'.format(global_step), v) for k, v in metrics.items())
+        results.update(metrics)
+        # result = evaluate(args, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='train')
+        # result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+        # results.update(result)
+    return results
+if __name__ == "__main__":
+    main()

Optimus/code/examples/big_ae/run_lm_vae_pretraining.py ADDED Viewed

	@@ -0,0 +1,669 @@

+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+from __future__ import absolute_import, division, print_function
+import pdb
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+from pathlib import Path
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+from utils import (calc_iwnll, calc_mi, calc_au, BucketingDataLoader, BucketingMultipleFiles_DataLoader, frange_cycle_linear, frange_cycle_zero_linear)
+from modules import VAE
+# logging.getLogger("azure").setLevel(logging.WARNING)
+# logging.getLogger("TableService").setLevel(logging.WARNING)
+logger = logging.getLogger(__name__)
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+# ts = TableService(account_name=storage_name, account_key=key)
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+        file_path=args.train_data_file
+        dataloader = BucketingMultipleFiles_DataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=True)
+    else:
+        pass
+    return dataloader
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+def train(args, train_dataloader, model_vae, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    # train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    # train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+    # Prepare optimizer and schedule (linear warmup and decay)
+    # model_encoder, model_decoder, model_connector = model_vae.encoder,  model_vae.decoder, model_vae.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_vae.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_vae.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_vae, optimizer = amp.initialize(model_vae, optimizer, opt_level=args.fp16_opt_level)
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_vae = torch.nn.DataParallel(model_vae, device_ids=range(args.n_gpu)).to(args.device)
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+    files = Path(args.train_data_file)
+    num_files = len(list(files.glob('*seq64*.json')))
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num files = %d", num_files)
+    logger.info("  Num examples of first file = %d", train_dataloader.num_examples)
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model_vae.zero_grad()
+    num_train_epochs_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    n_iter = int(args.num_train_epochs) * len(train_dataloader)
+    beta_t_list = frange_cycle_zero_linear(n_iter, start=0.0, stop=args.beta,  n_cycle=1, ratio_increase=args.ratio_increase, ratio_zero=args.ratio_zero)
+    tmp_list = []
+    dict_token_length = defaultdict(int)
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in num_train_epochs_iterator:
+        train_dataloader.reset()
+        for idx_file in range(num_files-1):
+            logger.info(f"Epoch {epoch}, File idx {train_dataloader.file_idx}")
+            epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+            for step, batch in enumerate(epoch_iterator):
+                tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+                dict_token_length[ tokenized_text_lengths[0,0].item() ] += 1
+                # continue
+                # tokenized_text0 = tokenized_text0.to(args.device)
+                # tokenized_text1 = tokenized_text1.to(args.device)
+                # prepare input-output data for reconstruction
+                inputs, labels = mask_tokens(tokenized_text0, encoder_tokenizer, args) if args.mlm else (tokenized_text0, tokenized_text1)
+                labels = tokenized_text1
+                tokenized_text1 = tokenized_text1.to(args.device)
+                inputs = inputs.to(args.device)
+                labels = labels.to(args.device)
+                model_vae.train()
+                beta_t = 0.0 # beta_t_list[step +  epoch*len(epoch_iterator)]
+                model_vae.module.args.beta = beta_t
+                if beta_t == 0.0:
+                    model_vae.module.args.fb_mode = 0
+                else:
+                    model_vae.module.args.fb_mode = 1
+                if args.use_deterministic_connect:
+                    model_vae.module.args.fb_mode = 2
+                loss_rec, loss_kl, loss = model_vae(inputs, labels)
+                loss_rec = loss_rec.mean()  # mean() to average on multi-gpu parallel training
+                loss_kl = loss_kl.mean()
+                loss = loss.mean()
+                if args.use_philly:
+                    print("PROGRESS: {}%".format(round(100 * (step +  epoch*len(epoch_iterator) ) /(int(args.num_train_epochs) *  len(epoch_iterator)) , 4)))
+                    print("EVALERR: {}%".format(loss_rec))
+                epoch_iterator.set_description(
+                    (
+                        f'iter: {step +  epoch*len(epoch_iterator) }; file:{idx_file}; loss: {loss.item():.3f}; '
+                        f'loss_rec: {loss_rec.item():.3f}; loss_kl: {loss_kl.item():.3f}; '
+                        f'beta: {model_vae.module.args.beta:.3f}'
+                    )
+                )
+                # if global_step % 5 == 0:
+                #     row = {
+                #             'PartitionKey': 'MILU_Rule_Rule_Template',
+                #             'RowKey': str(datetime.now()),
+                #             'ExpName' : args.ExpName,
+                #             'iter': str( step +  epoch*len(epoch_iterator) ),
+                #             'loss': str( loss.item()),
+                #             'loss_rec': str(loss_rec.item()),
+                #             'loss_kl': str(loss_kl.item()),
+                #             'beta': str(model_vae.args.beta)
+                #         }
+                #     # pdb.set_trace()
+                #     ts.insert_entity(table_name, row)
+                # pdb.set_trace()
+                if args.gradient_accumulation_steps > 1:
+                    loss = loss / args.gradient_accumulation_steps
+                if args.fp16:
+                    with amp.scale_loss(loss, optimizer) as scaled_loss:
+                        scaled_loss.backward()
+                else:
+                    loss.backward()
+                tr_loss += loss.item()
+                if (step + 1) % args.gradient_accumulation_steps == 0:
+                    if args.fp16:
+                        torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                    else:
+                        torch.nn.utils.clip_grad_norm_(model_vae.parameters(), args.max_grad_norm)
+                    optimizer.step()
+                    scheduler.step()  # Update learning rate schedule
+                    model_vae.zero_grad()
+                    global_step += 1
+                    if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                        # Log metrics
+                        if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                            results = evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer)
+                            for key, value in results.items():
+                                tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                        tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                        tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                        logging_loss = tr_loss
+                    if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                        # Save encoder model checkpoint
+                        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+                        if not os.path.exists(output_encoder_dir):
+                            os.makedirs(output_encoder_dir)
+                        model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+                        if args.use_philly:
+                            save_solid = False
+                            while not save_solid:
+                                try:
+                                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                                    torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                                    logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                                    save_solid = True
+                                except:
+                                    pass
+                        else:
+                            model_encoder_to_save.save_pretrained(output_encoder_dir)
+                            torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                            logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                        # Save decoder model checkpoint
+                        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+                        if not os.path.exists(output_decoder_dir):
+                            os.makedirs(output_decoder_dir)
+                        model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+                        if args.use_philly:
+                            save_solid = False
+                            while not save_solid:
+                                try:
+                                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                                    torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                                    logger.info("Saving model checkpoint to %s", output_decoder_dir)
+                                    save_solid = True
+                                except:
+                                    pass
+                        else:
+                            model_decoder_to_save.save_pretrained(output_decoder_dir)
+                            torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                            logger.info("Saving model checkpoint to %s", output_decoder_dir)
+                if args.max_steps > 0 and global_step > args.max_steps:
+                    epoch_iterator.close()
+                    break
+            if args.max_steps > 0 and global_step > args.max_steps:
+                train_iterator.close()
+                break
+    # print(dict_token_length)
+    # with open('wikipedia_stats.json', 'w') as fp:
+    #     json.dump(dict_token_length, fp)
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+    return global_step, tr_loss / global_step
+def main():
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.")
+    parser.add_argument("--ratio_zero", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")
+    parser.add_argument("--dim_target_kl", default=3.0, type=float,
+                        help="dim_target_kl free bit training mode.")
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+    # Precision & Distributed Training
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero)
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size)
+    try:
+        ts.create_table(table_name)
+    except:
+        pass
+    # Set seed
+    set_seed(args)
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+    ## Encoder
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+    # model_encoder.to(args.device)
+    ## Decoder
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size)
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+    # model_decoder.to(args.device)
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device) #
+    # on_gpu = next(model_vae.parameters()).is_cuda
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+    logger.info("Training/evaluation parameters %s", args)
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+        train_dataloader = build_dataload_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+        global_step, tr_loss = train(args, train_dataloader, model_vae, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        # Save model checkpoint
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        if not os.path.exists(output_encoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_encoder_dir)
+        if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_decoder_dir)
+        logger.info("Saving encoder model checkpoint to %s", output_encoder_dir)
+        logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+        model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+        # Good practice: save your training arguments together with the trained model
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                    torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_encoder_to_save.save_pretrained(output_encoder_dir)
+            torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                    torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_decoder_to_save.save_pretrained(output_decoder_dir)
+            torch.save(args, os.path.join(output_decoder_dir, 'training_encoder_args.bin'))
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+        model_encoder.to(args.device)
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+        model_decoder.to(args.device)
+if __name__ == "__main__":
+    main()