## Instructions to run on Google cloud TPUs Before starting these steps, make sure to prepare the dataset (normalization -> bpe -> .. -> binarization) following the steps in indicTrans workflow or do these steps on a cpu instance before launching the tpu instance (to save time and costs) ### Creating TPU instance - Create a cpu instance on gcp with `torch-xla` image like: ```bash gcloud compute --project=${PROJECT_ID} instances create \ --zone= \ --machine-type=n1-standard-16 \ --image-family=torch-xla \ --image-project=ml-images \ --boot-disk-size=200GB \ --scopes=https://www.googleapis.com/auth/cloud-platform ``` - Once the instance is created, Launch a Cloud TPU (from your cpu vm instance) using the following command (you can change the `accelerator_type` according to your needs): ```bash gcloud compute tpus create \ --zone= \ --network=default \ --version=pytorch-1.7 \ --accelerator-type=v3-8 ``` (or) Create a new tpu using the GUI in https://console.cloud.google.com/compute/tpus and make sure to select `version` as `pytorch 1.7`. - Once the tpu is launched, identify its ip address: ```bash # you can run this inside cpu instance and note down the IP address which is located under the NETWORK_ENDPOINTS column gcloud compute tpus list --zone=us-central1-a ``` (or) Go to https://console.cloud.google.com/compute/tpus and note down ip address for the created TPU from the `interal ip` column ### Installing Fairseq, getting data on the cpu instance - Activate the `torch xla 1.7` conda environment and install necessary libs for IndicTrans (**Excluding FairSeq**): ```bash conda activate torch-xla-1.7 pip install sacremoses pandas mock sacrebleu tensorboardX pyarrow ``` - Configure environment variables for TPU: ```bash export TPU_IP_ADDRESS=ip-address; \ export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470" ``` - Download the prepared binarized data for FairSeq - Clone the latest version of Fairseq (this supports tpu) and install from source. There is an [issue](https://github.com/pytorch/fairseq/issues/3259) with the latest commit and hence we use a different commit to install from source (This may have been fixed in the latest master but we have not tested it.) ```bash git clone https://github.com/pytorch/fairseq.git git checkout da9eaba12d82b9bfc1442f0e2c6fc1b895f4d35d pip install --editable ./ ``` - Start TPU training ```bash # this is for using all tpu cores export MKL_SERVICE_FORCE_INTEL=1 fairseq-train {expdir}/exp2_m2o_baseline/final_bin \ --max-source-positions=200 \ --max-target-positions=200 \ --max-update=1000000 \ --save-interval=5 \ --arch=transformer \ --attention-dropout=0.1 \ --criterion=label_smoothed_cross_entropy \ --source-lang=SRC \ --lr-scheduler=inverse_sqrt \ --skip-invalid-size-inputs-valid-test \ --target-lang=TGT \ --label-smoothing=0.1 \ --update-freq=1 \ --optimizer adam \ --adam-betas '(0.9, 0.98)' \ --warmup-init-lr 1e-07 \ --lr 0.0005 \ --warmup-updates 4000 \ --dropout 0.2 \ --weight-decay 0.0 \ --tpu \ --distributed-world-size 8 \ --max-tokens 8192 \ --num-batch-buckets 8 \ --tensorboard-logdir {expdir}/exp2_m2o_baseline/tensorboard \ --save-dir {expdir}/exp2_m2o_baseline/model \ --keep-last-epochs 5 \ --patience 5 ``` **Note** While training, we noticed that the training was slower on tpus, compared to using multiple GPUs, we have documented some issues and [filed an issue](https://github.com/pytorch/fairseq/issues/3317) at fairseq repo for advice. We'll update this section as we learn more about efficient training on TPUs. Also feel free to open an issue/pull request if you find a bug or know an efficient method to make code train faster on tpus.