Welcome to the colab notebook for [GPTNeo](https://github.com/EleutherAI/GPTNeo) - a fully open source implementation of GPT like models for mesh-tensorflow by [EleutherAI](eleuther.ai).

Our library provides training and inference for GPT models up to GPT3 sizes on both TPUs and GPUs. 

In this notebook we walk you through TPU training (or finetuning!) and sampling using the freely available colab TPUs.

If you find our repo useful, come join [our discord](https://discord.gg/BK2v3EJ) and say hi! 😬

Before we get going - make sure you are running this notebook with a TPU available. Go to Runtime -> Change Runtime Type and select 'TPU' under hardware accelerator.




In [None]:
#@title Setup
%tensorflow_version 2.x
!git clone https://github.com/EleutherAI/GPTNeo
%cd GPTNeo
!pip3 install -q -r requirements.txt
pretrained_model = None
dataset = None


## Set Up Google Cloud

To train on TPUs we need to store our data on a google cloud bucket - as TPUs can't read from local filesystems.

You can set up a bucket by signing up for a free trial here: https://console.cloud.google.com/

Make a bucket at https://console.cloud.google.com/storage and come back when that's done.

Make sure to select 'Uniform' access control when setting up the bucket, or the colab notebook won't have the required permissions to read from it.

The next cell sets up google authentication and gives the notebook read and write access to your bucket.


In [None]:
from google.colab import auth
auth.authenticate_user()
!gcloud init

In [3]:
path_to_cloud_bucket = 'gs://your-cloud-bucket/' #@param {type:"string"}

## Set Up Dataset

We first need to download and tokenize a dataset. If you just want to sample from a pretrained model, you can skip this step and move on to the `Pretrained Model` section.

You can choose from:

* Sampling Only - choose this option if you only wish to sample from our trained models, then move on to the `Pretrained Model` section.

* OpenWebText - an opensource clone of OpenAI's WebText dataset, the original training data of GPT2.

* YoutubeSubtitles - a dataset of subtitles scraped from youtube videos.

* Hackernews - comments scraped from hackernews

* NIHExporter - Data relating to various projects from the national institute of health.

* Custom - if this option is chosen you will be prompted to enter the path to your own dataset. It should be a directory containing .txt or .jsonl files.

All these datasets are from EleutherAI's side project - [The Pile™](https://github.com/EleutherAI/The-Pile) - an effort to gather a general purpose, diverse and open source plain text dataset large enough to train 1T+ parameter language models.

Even the smallest datasets are fairly large files, so this step will likely take a while. Select a dataset in the next cell, then run the next two cells, and go grab a snack and a cup of tea 😊

Alternatively, you can provide your own dataset in the form of a folder or gzip archive of .txt files. Simply select 'Custom' below and follow input the path to your data and the name of your dataset when prompted.

In [4]:
# Select a Dataset:
import os
dataset = 'Sampling_Only' #@param ["Sampling_Only", "OpenWebText", "YoutubeSubtitles", "HackerNews", "NIHExporter", "Custom"]

if dataset == "Sampling_Only":
 pass
elif dataset == 'OpenWebText':
 !wget https://the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar -O openwebtext.tar.xz
 !tar xf openwebtext.tar.xz
 dataset_path = "openwebtext"
 dataset_name = dataset_path
 out_name = dataset_name + "_tokenized"
elif dataset == 'YoutubeSubtitles':
 os.makedirs('data', exist_ok=True)
 !wget https://the-eye.eu/public/AI/pile_preliminary_components/yt_subs.jsonl.zst -O data/yt_subs.jsonl.zst
 dataset_path = 'data'
 dataset_name = 'ytsubs'
 out_name = dataset_name + "_tokenized"
elif dataset == 'HackerNews':
 os.makedirs('data', exist_ok=True)
 !wget https://the-eye.eu/public/AI/pile_preliminary_components/hn.tar.gz -O data/hn.tar.gz
 dataset_path = 'data'
 dataset_name = 'hackernews'
 out_name = dataset_name + "_tokenized"
elif dataset == "NIHExporter":
 os.makedirs('data', exist_ok=True)
 !wget https://the-eye.eu/public/AI/pile_preliminary_components/NIH_ExPORTER_awarded_grant_text.jsonl.zst -O data/NIH_ExPORTER_awarded_grant_text.jsonl.zst
 dataset_path = 'data'
 os.system('mv NIH_ExPORTER_awarded_grant_text.jsonl.zst ./data')
 dataset_name = 'nihexporter'
 out_name = dataset_name + "_tokenized"
elif dataset == "Custom":
 dataset_path = input('Enter the path to the folder containing your data: ')
 dataset_name = input('Enter the name of your dataset: ')
 out_name = dataset_name + "_tokenized"
else:
 raise NotImplementedError('please select from available options: ["OpenWebText", "YoutubeSubtitles", "HackerNews", "NIHExporter", "Custom"]')


### Tokenize and Upload Data

Now tokenize the dataset and copy it over to your google cloud bucket. You may skip this step if you are sampling from a pre-trained model.

In [None]:
# Tokenize Data
!python data/create_tfrecords.py --input_dir /content/GPTNeo/$dataset_path --name $dataset_name --files_per 1000 --output_dir $out_name --write_dataset_config --processes 1

# copy the data to your bucket
if not path_to_cloud_bucket.endswith('/'):
 path_to_cloud_bucket += '/'
copy_loc = path_to_cloud_bucket + "datasets/" + dataset
!gsutil -m cp -r /content/GPTNeo/$out_name $copy_loc
!gsutil ls $path_to_cloud_bucket

Before starting training - you'll need to edit your dataset & model configs to point to your buckets / data. You need to do this even if you are sampling from a pre-trained model.

* First change the writefile path to point to your chosen dataset - e.g `%%writefile configs/dataset_configs/ytsubs.json`
* Change the "path" field to point to your cloud bucket location - e.g `gs://neo_lmdatasets/datasets/ytsubs_*.tfrecords`
* Change `dataset_name` in `%%writefile configs/dataset_configs/dataset_name.json` to the name of your chosen dataset.
* Once you've made the edits, then run the cell below to overwrite the existing files.




In [None]:
%%writefile configs/dataset_configs/Sampling_Only.json

{
 "path": "gs://eleutherai/datasets/Sampling_Only/Sampling_Only*.tfrecords",
 "eval_path": "",
 "n_vocab": 50256,
 "tokenizer_is_pretrained": true,
 "tokenizer_path": "gpt2",
 "eos_id": 50256,
 "padding_id": 50257
}


## Set Model Configs

The model below is identical to our pretrained GPT3XL model (1.3B Params). 

If you want to use a smaller model, you can modify any of the config files in ../configs/ ending in _8.json, all of which are designed to train on tpu-v8s.

For a more detailed breakdown on what each item in the configuration file means - please read through our training and config guides in our [github README](https://github.com/EleutherAI/GPTNeo#training-guide). 

You'll want to change the first item in the `datasets` list to the name of your chosen dataset. (the filename minus .json in ./configs/dataset_configs)

You'll also want to modify the `model_path` field to point to your google cloud bucket, so checkpoints get saved to there.

In [None]:
%%writefile configs/GPT3_XL.json

{
 "n_head": 16,
 "n_vocab": 50257,
 "embed_dropout": 0,
 "lr": 0.0002,
 "lr_decay": "cosine",
 "warmup_steps": 3000,
 "beta1": 0.9,
 "beta2": 0.95,
 "epsilon": 1e-8,
 "opt_name": "adam",
 "weight_decay": 0,
 "train_batch_size": 256,
 "attn_dropout": 0,
 "train_steps": 600000,
 "eval_steps": 0,
 "predict_steps": 1,
 "res_dropout": 0,
 "eval_batch_size": 4,
 "predict_batch_size": 1,
 "iterations": 100,
 "n_embd": 2048,
 "datasets": [["pile", null, null, null]],
 "model": "GPT",
 "model_path": "gs://eleutherai/GPT3_XL",
 "n_ctx": 2048,
 "n_layer": 24,
 "scale_by_depth": true,
 "scale_by_in": false,
 "attention_types" : [[["global", "local"],12]],
 "mesh_shape": "x:4,y:2",
 "layout": "intermediate_expanded:x,heads:x,vocab:n_vocab,memory_length:y,embd:y",
 "activation_function": "gelu",
 "recompute_grad": true,
 "gradient_clipping": 1.0,
 "tokens_per_mb_per_replica": 2048,
 "precision": "bfloat16"
}

## Training from Scratch

Now we will begin to train the model. If no previous model is found in "model_path", the model will start training from scratch. If you'd prefer to finetune from pretrained, skip to the `Finetune a Pretrained Model` section.

If everything's set up correctly, you can now run the main.py function to start training!

In [None]:
!python3 main.py --model colab_XL --steps_per_checkpoint 500 --tpu colab

## Pretrained Model

If you want to sample from or finetune a pretrained model, EleutherAI has pretrained two models for release. One with [1.3B parameters](https://the-eye.eu/public/AI/gptneo-release/GPT3_XL/), and another with [2.7B](https://the-eye.eu/public/AI/gptneo-release/GPT3_2-7B/). 

Select an option below to download the weights locally. You will then need to upload them to your cloud bucket in order to finetune from them. If the download command isn't working, try the commented out code to download from a different source.

The 2-7B model likely won't fit into the colab TPUs memory, and you may have to get some larger pods to finetune from it.

Sampling from it, however, works just fine.


In [None]:
# @title Download pretrained model weights:
pretrained_model = 'GPT3_2-7B' #@param ["GPT3_XL", "GPT3_2-7B"]
!wget -m -np -c -U "eye02" -w 2 -R "index.html*" "https://the-eye.eu/public/AI/gptneo-release/$pretrained_model/"
path_to_local_weights = f"/content/GPTNeo/the-eye.eu/public/AI/gptneo-release/{pretrained_model}"

# URL = f"http://eaidata.bmk.sh/data/gptneo-release/{pretrained_model}/"
# FOLDER_NAME = "GPT3_XL"
# !curl $URL | grep -i "" | sed -n 's/.*href="\([^"]*\).*/\1/p' | sed "s|^|$URL|" | xargs -n 1 -P 4 wget -P $pretrained_model
# path_to_local_weights = pretrained_model


In [9]:
# upload to your bucket
bucket_base = "gs://" + path_to_cloud_bucket.replace('gs://', '').split('/')[0]
!gsutil -m cp -r $path_to_local_weights $bucket_base

If everything has worked successfully you should now see your model listed in your bucket below.

In [None]:
!gsutil ls $bucket_base

Now we want to make a few modifications to the model config in order to get training / sampling working on colab.

If you are just sampling from our pretrained models, you can leave the settings as is, run the cell below, then move on to the `Sample from your model` section.

If finetuning, you can change parameters below. 

* `path_to_model` should point to the model weights location in your cloud bucket, and will default to `$bucket_base/${pretrained_model}` if nothing is entered.

* `batch_size` is your train batch size - if you're encountering memory errors, try lowering this.

* `dataset_name` is the name of your dataset, if nothing is entered, this should default to the dataset you selected in the `Prepare Data` section.

* `mesh_shape` specifies the way the model will be divided up across the TPU cores. We suggest leaving this alone unless you know what you're doing.

* `train_steps` specifies how many steps you want the model to finetune for. We set this to 1000 for demonstrative purposes but you may need to increase this a little depending on your goals. If you are just sampling from the model, you can leave this as is.

* `steps_per_checkpoint` specifies how often you want to save model weights during training.



In [None]:
# @title Modify config for colab. 
 
import json
from pprint import pprint

path_to_model = "" #@param {type:"string"}
batch_size = 8 #@param {type:"integer"}
dset = "" #@param {type:"string"}
mesh_shape = "x:4,y:2" #@param {type:"string"}
train_steps = 1000 #@param {type:"integer"}
steps_per_checkpoint = 500 #@param {type:"integer"}
start_step = 400000 if pretrained_model == "GPT3_2-7B" else 362000

if path_to_model == "":
 path_to_model = f'{bucket_base.strip("/")}/{pretrained_model}'
print(f'MODEL PATH: {path_to_model}\n')

if dset == "" and dataset != "Sampling_Only":
 dset = dataset
elif dataset is None and dset == "":
 dset = "pile"

def pad_to_multiple_of(n, mult):
 """
 pads n to a multiple of mult
 """
 extra = n % mult
 if extra > 0:
 n = n + mult - extra
 return n

with open(f'{path_to_local_weights}/config.json', 'r') as f:
 data = json.load(f)
 pprint(data)
 dset_val = [[dset, None, None, None]] if dset != "" else data["datasets"]
 mods = {
 "mesh_shape": mesh_shape,
 "layout": "intermediate_expanded:x,heads:x,memory_length:y,embd:y",
 "model_path": path_to_model,
 "datasets": dset_val,
 "train_steps": start_step + train_steps,
 "eval_steps": 0,
 "train_batch_size": batch_size,
 "predict_batch_size": batch_size
 }
 data.update(mods)
 print('\n--->\n')
 pprint(data)
 with open(f'configs/{pretrained_model}.json', 'w') as outfile:
 json.dump(data, outfile, indent=2)

### Begin Fine-Tuning

If you are fine-tuning the pretrained model, this line of code will begin the training.

In [None]:
!python3 main.py --model $pretrained_model --steps_per_checkpoint $steps_per_checkpoint --tpu colab

### Sample from your model

Once training is finished, (or your pretrained model is on your bucket), you can run the same command with the --predict flag to sample from your model.

To pass in a prompt, save it to a .txt file, and pass in the name of the file with the --prompt flag.

use the cell below to enter your prompt, and run it to save it to example_prompt.txt.

You may need to decrease the predict batch size in your config if you're facing OOM errors.

Let's see if the GPTNeo model can finish coding itself, with a sample prompt consisting of the beginning of a `torch.nn.Module`:

In [13]:
%%writefile example_prompt.txt

class GPT(nn.Module):
 """ the full GPT language model, with a context size of block_size """

 def __init__(self, config):
 super().__init__()

 # input embedding stem
 self.tok_emb = nn.Embedding(config.vocab_size, config.n_embd)
 self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd))
 self.drop = nn.Dropout(config.embd_pdrop)
 # transformer
 self.blocks = nn.Sequential(*[Block(config) for _ in range(config.n_layer)])
 # decoder head
 self.ln_f = nn.LayerNorm(config.n_embd)
 self.head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

 self.block_size = config.block_size
 self.apply(self._init_weights)

 logger.info("number of parameters: %e", sum(p.numel() for p in self.parameters()))

Overwriting example_prompt.txt


In [14]:
!python3 main.py --model $pretrained_model --steps_per_checkpoint 500 --tpu colab --predict --prompt example_prompt.txt

2021-03-22 12:20:43.411018: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Instructions for updating:
non-resource variables are not supported in the long term
Current step 400000
Saving config to gs://test-bucket-neo/GPT3_2-7B
2021-03-22 12:20:50.689547: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-22 12:20:50.691059: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-03-22 12:20:50.701975: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-03-22 12:20:50.702051: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (eeb4af61eb99): /proc/driver/nvidia/version does not exist
2021-03-22 12:20:52.229703: I tensorflow/compiler/mlir/mlir

# Evaluating the model

This section assumes you are using a pretrained model and relies on variables created in the `Pretrained model` section.

## Wikitext

Download the wikitext test set:


In [None]:
wikitext103_src = "https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip"
!wget $wikitext103_src
!unzip wikitext-103-raw-v1.zip

Tokenize and upload to bucket:


In [None]:

!mkdir wikitext
!mv /content/GPTNeo/wikitext-103-raw/wiki.test.raw wikitext/wikitext_test.txt

# Tokenize Data
!python data/create_tfrecords.py --input_dir wikitext --name wikitext --files_per 1000 --output_dir wikitext_tokenized --write_dataset_config --processes 1 --wikitext-detokenize

# copy the data to your bucket
if not path_to_cloud_bucket.endswith('/'):
 path_to_cloud_bucket += '/'
copy_loc = path_to_cloud_bucket 
!gsutil -m cp -r wikitext_tokenized $copy_loc
!gsutil ls $path_to_cloud_bucket

Now make a dataset config that points to the tokenized wikitext data:

In [None]:
%%writefile configs/dataset_configs/wikitext.json

{
 "path": "",
 "eval_path": "gs://test-bucket-neo/wikitext_tokenized/*.tfrecords",
 "n_vocab": 50256,
 "tokenizer_is_pretrained": true,
 "tokenizer_path": "gpt2",
 "eos_id": 50256,
 "padding_id": 50257
}


And update your model config to point to that dataset:


In [None]:
# @title Modify config for wikitext. 
 
import json
from pprint import pprint

batch_size = 8 #@param {type:"integer"}
assert pretrained_model is not None
with open(f'configs/{pretrained_model}.json', 'r') as f:
 data = json.load(f)
 pprint(data)
 dset_val = [["wikitext", None, None, None]]
 mods = {
 "datasets": dset_val,
 "eval_steps": 139 // batch_size,
 "train_batch_size": batch_size,
 "eval_batch_size": batch_size,
 }
 data.update(mods)
 print('\n--->\n')
 pprint(data)
 with open(f'configs/{pretrained_model}.json', 'w') as outfile:
 json.dump(data, outfile, indent=2)

Now run model in eval mode over tokenized data:

In [None]:
!python3 main.py --eval --tpu colab --model $pretrained_model

## Lambada

Lambada eval is built into the codebase and can be run by adding a field to your model config

In [None]:
# @title Modify config for Lambada. 
 
import json
from pprint import pprint

batch_size = 8 #@param {type:"integer"}
assert pretrained_model is not None
with open(f'configs/{pretrained_model}.json', 'r') as f:
 data = json.load(f)
 mods = {
 "datasets": dset_val,
 "eval_steps": 0,
 "train_batch_size": batch_size,
 "eval_batch_size": batch_size,
 "eval_tasks": ["lambada"]
 }
 data.update(mods)
 print('\n--->\n')
 pprint(data)
 with open(f'configs/{pretrained_model}.json', 'w') as outfile:
 json.dump(data, outfile, indent=2)

Now run the eval:

In [None]:
!python3 main.py --eval --tpu colab --model $pretrained_model