Blanket optimizations

Files changed (11) hide show

.gitignore +1 -0
README.md +26 -23
beam_search.py +3 -1
export_model.ipynb +39 -95
generate.py +7 -1
instruction-tune.py +11 -2
model.py +33 -29
model_sizing.ipynb +72 -43
pre-train.py → pretrain.py +25 -16
requirements.txt +3 -1
runs/.gitignore +2 -0

.gitignore CHANGED Viewed

@@ -10,6 +10,7 @@ wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 .venv
 venv/
 ENV/

 *.egg-info/
 .installed.cfg
 *.egg
+*.sarif
 .venv
 venv/
 ENV/

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ tags:
 ---
 # LightGPT
-LightGPT is a lightweight generative pre-trained Transformer (GPT) model for the people! Built using pure PyTorch, LightGPT can answer questions, follow instructions, summarize documents, chat, and much more. Best of all, the model weights *and* code are fully open-source for you to customize, improve upon, and share with the world.
 ## Features
@@ -23,18 +23,18 @@ LightGPT is a lightweight generative pre-trained Transformer (GPT) model for the
 - **Fully Open-source**: Unlike closed-source LLMs, LightGPT provides both the model weights *and* the source code to train, fine-tune, export, and generate text from the model using your own hardware. With the help of the open-source software community, we aim to democratize access to AI and continually improve the models.
-## Suggested Pre-training Configurations
-Below is a table of some suggested pre-training configurations but feel free to experiment with settings on your own. See the `model_sizing.ipynb` notebook to estimate the memory and compute requirements for your model configuration.
-| Name | Vocab. Size | Block Size | Embedding Dim. | Attn. Heads | Layers | Parameters | Training Tokens |
 |---|---|---|---|---|---|---|---|
-| Small | 50,257 | 1024 | 1024 | 16 | 24 | 353M | 7B |
-| Medium | 50,257 | 1024 | 2048 | 32 | 32 | 1.7B | 34B |
-| Large | 100,275 | 2048 | 4096 | 64 | 32 | 6.8B | 132B |
-| X-large | 100,275 | 2048 | 4096 | 64 | 64 | 13B | 262B |
-| XX-large | 200,017 | 4096 | 8192 | 128 | 64 | 53B | 1T |
-| XXX-large | 200,017 | 4096 | 8192 | 128 | 128 | 105B | 2T |
 ## Install Project Dependencies
@@ -48,37 +48,37 @@ source ./.venv/bin/activate
 pip install -r requirements.txt
 ```
-## Pre-training
-For the pre-training corpus we use the Fineweb dataset which consists of about 15T high-quality tokens gathered from the worldwide web. The dataset has been split into 3 subsets (10BT, 100BT, and 350BT versions) for training smaller models. If you'd like to start training right away, the default settings should work on most single-GPU systems with 12G of VRAM or more.
 ```
-python pre-train.py
 ```
 **Note** that it will take a while to download and pre-process the dataset the first time that the training script is run.
-To customize the default "Small" architecture you can adjust the `block_size`, `embedding_dimensions`, `num_hidden_layers`, and `num_attention_heads` arguments of the pre-training script.
 ```
-python pre-train.py --block_size=2048 --embedding_dimensions=4096 --num_hidden_layers=64 --num_attention_heads=64
 ```
 You can also adjust the `batch_size`, `learning_rate`, and `gradient_accumulation_steps` to suite your training setup.
 ```
-python pre-train.py --batch_size=32 --learning_rate=0.01 --gradient_accumulation_steps=128
 ```
 For distributed training, use PyTorch's [torchrun](https://pytorch.org/docs/stable/elastic/run.html) extension to launch a distributed data parallel (DDP) session. The example below is for executing the training script on a single node with 8 individual GPUs.
 ```
-torchrun --standalone --nnodes=1 --nproc-per-node=8 pre-train.py --batch_size=16 --gradient_accumulation_steps=128
 ```
 **Note** that when training in data-parallel mode it's important that the `gradient_accumulation_steps` divides evenly into the world size for maximum performance. For example, if we have an 8 GPU cluster, we could perform 32 gradient accumulation steps in exactly 4 passes over the network.
-### Pre-training Arguments
 | Argument | Default | Type | Description |
 |---|---|---|---|
@@ -88,6 +88,7 @@ torchrun --standalone --nnodes=1 --nproc-per-node=8 pre-train.py --batch_size=16
 | --num_dataset_processes | 8 | int | The number of processes (CPUs) to use to process the dataset. |
 | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
 | --gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the weights. |
 | --samples_per_epoch | 4096 | int | The number of training samples to pass through the network every epoch. |
 | --num_epochs | 1686 | int | The number of epochs to train for. |
 | --learning_rate | 1e-2 | float | The learning rate of the Adafactor optimizer. |
@@ -95,17 +96,17 @@ torchrun --standalone --nnodes=1 --nproc-per-node=8 pre-train.py --batch_size=16
 | --low_memory_optimizer | False | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
 | --max_gradient_norm | 1.0 | float | Clip gradients above this threshold before stepping. |
 | --eval_interval | 10 | int | Evaluate the model after this many epochs on the testing set. |
-| --block_size | 1024 | int | The number of tokens within the context window for every sample. |
 | --embedding_dimensions | 1024 | int | The dimensionality of the token embeddings. |
 | --num_attention_heads | 16 | int | The number of attention heads within every block. |
 | --num_hidden_layers | 24 | int | The number of attention/MLP blocks within the hidden layer of the network. |
 | --feed_forward_ratio | 4 | (1, 2, 4) | The ratio of hidden neurons to embedding dimensions in the MLP layers of the network. |
 | --dropout | 0.1 | float | The proportion of signals to send to zero during training as regularization. |
-| --activation_checkpointing | False | bool | Should we use activation checkpointing? This will reduce drastically memory utilization during training at the cost of needing to recompute the forward pass. |
 | --ddp_sharding_level | 2 | int | The level of sharding to use for DDP training. Options are 2 or 3 for partial and full sharding respectively, or 0 for no sharding. |
-| --checkpoint_interval | 20 | int | Save the model parameters to disk every this many epochs. |
 | --checkpoint_path | "./checkpoints/checkpoint.pt" | str | The path to the base checkpoint file on disk. |
 | --resume | False | bool | Should we resume training from the last checkpoint? |
 | --device | "cuda" | str | The device to run the computation on. |
 | --seed | None | int | The seed for the random number generator. |
@@ -116,12 +117,13 @@ torchrun --standalone --nnodes=1 --nproc-per-node=8 pre-train.py --batch_size=16
 | Argument | Default | Type | Description |
 |---|---|---|---|
 | --base_model_path | "./checkpoints/checkpoint.pt" | string | The path to the base checkpoint on disk. |
 | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
 | --gradient_accumulation_steps | 64 | int | The number of batches to pass through the network before updating the weights. |
 | --learning_rate | 5e-4 | float | The learning rate of the Adafactor optimizer. |
 | --rms_decay | -0.8 | float | The decay rate of the RMS coefficient of the Adafactor optimizer. |
-| --optimizer_low_memory | True | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
-| --mask_input | False | bool | Should we mask the input part of the sample i.e. only train on the output? |
 | --rank | 8 | int | The rank of the LoRA decomposition matrices. |
 | --alpha | 1.0 | float | The strength of the LoRA signal. |
 | --dropout | 0.05 | float | The proportion of signals to send to zero during training as regularization. |
@@ -131,6 +133,7 @@ torchrun --standalone --nnodes=1 --nproc-per-node=8 pre-train.py --batch_size=16
 | --checkpoint_interval | 1 | int | Save the model parameters to disk every this many epochs. |
 | --checkpoint_path | "./checkpoints/lora_instruction.pt" | string | The path to the LoRA checkpoint. |
 | --resume | False | bool | Should we resume training from the last checkpoint? |
 | --device | "cuda" | string | The device to run the computation on. |
 | --seed | None | int | The seed for the random number generator. |

 ---
 # LightGPT
+LightGPT is a lightweight generative pretrained Transformer (GPT) model for the people! Built using PyTorch and trained on the Fineweb and Alpaca datasets, LightGPT can answer questions, follow instructions, summarize documents, chat, and more. Best of all, the model weights *and* code are fully open-source for you to customize, improve upon, and share with the world.
 ## Features
 - **Fully Open-source**: Unlike closed-source LLMs, LightGPT provides both the model weights *and* the source code to train, fine-tune, export, and generate text from the model using your own hardware. With the help of the open-source software community, we aim to democratize access to AI and continually improve the models.
+## Suggested Pretraining Configurations
+Below is a table of some suggested pretraining configurations but feel free to experiment with settings on your own. See the `model_sizing.ipynb` notebook to estimate the memory and compute requirements for your model configuration.
+| Name | Vocab. Size | Embedding Dim. | Attn. Heads | Layers | Parameters | Training Tokens |
 |---|---|---|---|---|---|---|---|
+| Small | 50,257 | 1024 | 16 | 24 | 353M | 7B |
+| Medium | 50,257 | 2048 | 32 | 32 | 1.7B | 34B |
+| Large | 100,275 | 4096 | 64 | 32 | 6.8B | 132B |
+| X-large | 100,275 | 4096 | 64 | 64 | 13B | 262B |
+| XX-large | 200,017 | 8192 | 128 | 64 | 53B | 1T |
+| XXX-large | 200,017 | 8192 | 128 | 128 | 105B | 2T |
 ## Install Project Dependencies
 pip install -r requirements.txt
 ```
+## Pretraining
+For the pretraining corpus we use the Fineweb dataset which consists of about 15T high-quality tokens gathered from the worldwide web. The dataset has been split into 3 subsets (10BT, 100BT, and 350BT versions) for training smaller models. If you'd like to start training right away, the default settings should work on most single-GPU systems with 12G of VRAM or more.
 ```
+python pretrain.py
 ```
 **Note** that it will take a while to download and pre-process the dataset the first time that the training script is run.
+To customize the default "Small" architecture you can adjust the `block_size`, `embedding_dimensions`, `num_hidden_layers`, and `num_attention_heads` arguments of the pretraining script.
 ```
+python pretrain.py --block_size=2048 --embedding_dimensions=4096 --num_hidden_layers=64 --num_attention_heads=64
 ```
 You can also adjust the `batch_size`, `learning_rate`, and `gradient_accumulation_steps` to suite your training setup.
 ```
+python pretrain.py --batch_size=32 --learning_rate=0.01 --gradient_accumulation_steps=128
 ```
 For distributed training, use PyTorch's [torchrun](https://pytorch.org/docs/stable/elastic/run.html) extension to launch a distributed data parallel (DDP) session. The example below is for executing the training script on a single node with 8 individual GPUs.
 ```
+torchrun --standalone --nnodes=1 --nproc-per-node=8 pretrain.py --batch_size=16 --gradient_accumulation_steps=128
 ```
 **Note** that when training in data-parallel mode it's important that the `gradient_accumulation_steps` divides evenly into the world size for maximum performance. For example, if we have an 8 GPU cluster, we could perform 32 gradient accumulation steps in exactly 4 passes over the network.
+### Pretraining Arguments
 | Argument | Default | Type | Description |
 |---|---|---|---|
 | --num_dataset_processes | 8 | int | The number of processes (CPUs) to use to process the dataset. |
 | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
 | --gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the weights. |
+| --tokens_per_sample | 1024 | int | The number of tokens to pack into a single training sequence. This is sometimes called the context length or block size. |
 | --samples_per_epoch | 4096 | int | The number of training samples to pass through the network every epoch. |
 | --num_epochs | 1686 | int | The number of epochs to train for. |
 | --learning_rate | 1e-2 | float | The learning rate of the Adafactor optimizer. |
 | --low_memory_optimizer | False | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
 | --max_gradient_norm | 1.0 | float | Clip gradients above this threshold before stepping. |
 | --eval_interval | 10 | int | Evaluate the model after this many epochs on the testing set. |
 | --embedding_dimensions | 1024 | int | The dimensionality of the token embeddings. |
 | --num_attention_heads | 16 | int | The number of attention heads within every block. |
 | --num_hidden_layers | 24 | int | The number of attention/MLP blocks within the hidden layer of the network. |
 | --feed_forward_ratio | 4 | (1, 2, 4) | The ratio of hidden neurons to embedding dimensions in the MLP layers of the network. |
 | --dropout | 0.1 | float | The proportion of signals to send to zero during training as regularization. |
+| --activation_checkpointing | False | bool | Should we use activation checkpointing? This will drastically reduce memory utilization during training at the cost of recomputing the forward pass. |
 | --ddp_sharding_level | 2 | int | The level of sharding to use for DDP training. Options are 2 or 3 for partial and full sharding respectively, or 0 for no sharding. |
+| --checkpoint_interval | 20 | int | Save the model checkpoint to disk every this many epochs. |
 | --checkpoint_path | "./checkpoints/checkpoint.pt" | str | The path to the base checkpoint file on disk. |
 | --resume | False | bool | Should we resume training from the last checkpoint? |
+| --run_dir_path | "./runs/pretrain" | str | The path to the TensorBoard run directory for this training session. |
 | --device | "cuda" | str | The device to run the computation on. |
 | --seed | None | int | The seed for the random number generator. |
 | Argument | Default | Type | Description |
 |---|---|---|---|
 | --base_model_path | "./checkpoints/checkpoint.pt" | string | The path to the base checkpoint on disk. |
+| --max_tokens_per_sample | 4096 | int | The maximum number of tokens to pack into a single training sequence. |
+| --mask_input | False | bool | Should we mask the input part of the training sequences i.e. only train on the supervised  output? |
 | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
 | --gradient_accumulation_steps | 64 | int | The number of batches to pass through the network before updating the weights. |
 | --learning_rate | 5e-4 | float | The learning rate of the Adafactor optimizer. |
 | --rms_decay | -0.8 | float | The decay rate of the RMS coefficient of the Adafactor optimizer. |
+| --optimizer_low_memory | False | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
 | --rank | 8 | int | The rank of the LoRA decomposition matrices. |
 | --alpha | 1.0 | float | The strength of the LoRA signal. |
 | --dropout | 0.05 | float | The proportion of signals to send to zero during training as regularization. |
 | --checkpoint_interval | 1 | int | Save the model parameters to disk every this many epochs. |
 | --checkpoint_path | "./checkpoints/lora_instruction.pt" | string | The path to the LoRA checkpoint. |
 | --resume | False | bool | Should we resume training from the last checkpoint? |
+| --run_dir_path | "./runs/instruction-tune" | str | The path to the TensorBoard run directory for this training session. |
 | --device | "cuda" | string | The device to run the computation on. |
 | --seed | None | int | The seed for the random number generator. |

beam_search.py CHANGED Viewed

@@ -22,7 +22,8 @@ def main():
         "--checkpoint_path", default="./checkpoints/checkpoint.pt", type=str
     )
     parser.add_argument("--lora_path", default=None, type=str)
-    parser.add_argument("--max_tokens", default=500, type=int)
     parser.add_argument("--num_candidates", default=3, type=int)
     parser.add_argument("--beam_width", default=16, type=int)
     parser.add_argument("--device", default="cuda", type=str)
@@ -92,6 +93,7 @@ def main():
         candidates = model.beam_search(
             prompt,
             args.max_tokens,
             args.num_candidates,
             args.beam_width,
         )

         "--checkpoint_path", default="./checkpoints/checkpoint.pt", type=str
     )
     parser.add_argument("--lora_path", default=None, type=str)
+    parser.add_argument("--max_tokens", default=100, type=int)
+    parser.add_argument("--context_length", default=1024, type=int)
     parser.add_argument("--num_candidates", default=3, type=int)
     parser.add_argument("--beam_width", default=16, type=int)
     parser.add_argument("--device", default="cuda", type=str)
         candidates = model.beam_search(
             prompt,
             args.max_tokens,
+            args.context_length,
             args.num_candidates,
             args.beam_width,
         )

export_model.ipynb CHANGED Viewed

@@ -9,7 +9,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -28,25 +28,21 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
    "metadata": {},
    "outputs": [
     {
-     "ename": "TypeError",
-     "evalue": "GPT.__init__() missing 1 required positional argument: 'feed_forward_ratio'",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
-      "Cell \u001b[0;32mIn[3], line 7\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mmodel\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m GPT, GPTWithLoRA\n\u001b[1;32m      5\u001b[0m checkpoint \u001b[38;5;241m=\u001b[39m torch\u001b[38;5;241m.\u001b[39mload(checkpoint_path, map_location\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcpu\u001b[39m\u001b[38;5;124m\"\u001b[39m, weights_only\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m)\n\u001b[0;32m----> 7\u001b[0m model \u001b[38;5;241m=\u001b[39m \u001b[43mGPT\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mcheckpoint\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mmodel_args\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m      9\u001b[0m model \u001b[38;5;241m=\u001b[39m torch\u001b[38;5;241m.\u001b[39mcompile(model)\n\u001b[1;32m     11\u001b[0m model\u001b[38;5;241m.\u001b[39mload_state_dict(checkpoint[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmodel\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n",
-      "\u001b[0;31mTypeError\u001b[0m: GPT.__init__() missing 1 required positional argument: 'feed_forward_ratio'"
      ]
     }
    ],
    "source": [
     "import torch\n",
     "\n",
-    "from model import GPT, GPTWithLoRA\n",
     "\n",
     "checkpoint = torch.load(checkpoint_path, map_location=\"cpu\", weights_only=True)\n",
     "\n",
@@ -68,10 +64,12 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 58,
    "metadata": {},
    "outputs": [],
    "source": [
     "if lora_path != None:\n",
     "    checkpoint = torch.load(lora_path, map_location=\"cpu\", weights_only=True)\n",
     "\n",
@@ -95,14 +93,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 59,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Model saved to ./exports/lightgpt-small-turbo.safetensors\n"
      ]
     }
    ],
@@ -127,66 +125,46 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 86,
    "metadata": {},
    "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[torch.onnx] Obtain model graph for `OptimizedModule([...]` with `torch.export.export`...\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "W0108 18:27:01.430000 5473 torch/onnx/_internal/exporter/_registration.py:73] torchvision is not installed. Skipping torchvision::nms\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[torch.onnx] Obtain model graph for `OptimizedModule([...]` with `torch.export.export`... ✅\n",
-      "[torch.onnx] Translate the graph into ONNX...\n"
-     ]
-    },
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "W0108 18:27:04.197000 5473 torch/onnx/_internal/exporter/_core.py:848] Skipping constant argument ConstantArgument(name='', value=None)\n"
      ]
     },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "[torch.onnx] Translate the graph into ONNX... ✅\n",
-      "Model saved to ./exports/lightgpt-small-turbo.onnx\n"
      ]
     }
    ],
    "source": [
-    "from torch.onnx import export\n",
     "\n",
     "example_input = torch.randint(0, model.vocabulary_size - 1, (1, model.block_size))\n",
     "\n",
     "model.eval()  # Turn off dropout and other train-time operations\n",
     "\n",
-    "example_output, _ = model(example_input)\n",
     "\n",
     "onnx_path = path.join(exports_path, f\"{model_name}.onnx\")\n",
     "\n",
-    "export(\n",
-    "    model,\n",
-    "    example_input,\n",
-    "    onnx_path,\n",
-    "    input_names=[\"input_tokens\", \"labels\"],\n",
-    "    output_names=[\"logits\"],\n",
-    "    dynamo=True,\n",
-    ")\n",
     "\n",
     "print(f\"Model saved to {onnx_path}\")"
    ]
@@ -195,75 +173,41 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can verify the ONNX model with the ONNX API."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 87,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Looks OK\n"
-     ]
-    }
-   ],
-   "source": [
-    "import onnx\n",
-    "\n",
-    "onnx_model = onnx.load(onnx_path)\n",
-    "\n",
-    "onnx.checker.check_model(onnx_model)\n",
-    "\n",
-    "print(\"Looks OK\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Lastly, let's compare the output of PyTorch with the ONNX runtime to see if they are the same."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "NameError",
-     "evalue": "name 'onnx_path' is not defined",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
-      "Cell \u001b[0;32mIn[1], line 7\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mnumpy\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mnp\u001b[39;00m\n\u001b[1;32m      5\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mnumpy\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mtesting\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m assert_allclose\n\u001b[0;32m----> 7\u001b[0m session \u001b[38;5;241m=\u001b[39m onnxruntime\u001b[38;5;241m.\u001b[39mInferenceSession(\u001b[43monnx_path\u001b[49m, providers\u001b[38;5;241m=\u001b[39m[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCPUExecutionProvider\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n\u001b[1;32m      9\u001b[0m onnx_input \u001b[38;5;241m=\u001b[39m {\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124minput_tokens\u001b[39m\u001b[38;5;124m\"\u001b[39m: example_input\u001b[38;5;241m.\u001b[39mnumpy()}\n\u001b[1;32m     11\u001b[0m output \u001b[38;5;241m=\u001b[39m session\u001b[38;5;241m.\u001b[39mrun(\u001b[38;5;28;01mNone\u001b[39;00m, onnx_input)\n",
-      "\u001b[0;31mNameError\u001b[0m: name 'onnx_path' is not defined"
      ]
     }
    ],
    "source": [
     "import onnxruntime\n",
     "\n",
-    "import numpy as np\n",
-    "\n",
     "from numpy.testing import assert_allclose\n",
     "\n",
     "session = onnxruntime.InferenceSession(onnx_path, providers=[\"CPUExecutionProvider\"])\n",
     "\n",
-    "onnx_input = {\"input_tokens\": example_input.numpy()}\n",
     "\n",
-    "output = session.run(None, onnx_input)\n",
     "\n",
-    "onnx_output = output[0]\n",
-    "pytorch_output = np.array(example_output.detach())\n",
     "\n",
-    "assert_allclose(pytorch_output, onnx_output, rtol=1e-2, atol=1e-03)\n",
     "\n",
-    "print(\"Looking good\")"
    ]
   }
  ],

   },
   {
    "cell_type": "code",
+   "execution_count": 18,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": 19,
    "metadata": {},
    "outputs": [
     {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Base checkpoint loaded successfully\n"
      ]
     }
    ],
    "source": [
     "import torch\n",
     "\n",
+    "from model import GPT\n",
     "\n",
     "checkpoint = torch.load(checkpoint_path, map_location=\"cpu\", weights_only=True)\n",
     "\n",
   },
   {
    "cell_type": "code",
+   "execution_count": 20,
    "metadata": {},
    "outputs": [],
    "source": [
+    "from model import GPTWithLoRA\n",
+    "\n",
     "if lora_path != None:\n",
     "    checkpoint = torch.load(lora_path, map_location=\"cpu\", weights_only=True)\n",
     "\n",
   },
   {
    "cell_type": "code",
+   "execution_count": 21,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
+      "Model saved to ./exports/lightgpt-small.safetensors\n"
      ]
     }
    ],
   },
   {
    "cell_type": "code",
+   "execution_count": 22,
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
+      "/home/andrew/Workspace/LightGPT/.venv/lib/python3.12/site-packages/torch/onnx/_internal/_exporter_legacy.py:116: UserWarning: torch.onnx.dynamo_export only implements opset version 18 for now. If you need to use a different opset version, please register them with register_custom_op.\n",
+      "  warnings.warn(\n"
      ]
     },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
+      "Applied 72 of general pattern rewrite rules.\n",
+      "Model saved to ./exports/lightgpt-small.onnx\n"
      ]
     }
    ],
    "source": [
+    "from model import ONNXModel\n",
+    "\n",
+    "from torch.onnx import dynamo_export, ExportOptions\n",
     "\n",
     "example_input = torch.randint(0, model.vocabulary_size - 1, (1, model.block_size))\n",
     "\n",
+    "model = ONNXModel(model)  # Nicer inferencing API\n",
+    "\n",
     "model.eval()  # Turn off dropout and other train-time operations\n",
     "\n",
+    "export_options = ExportOptions(\n",
+    "    dynamic_shapes=True\n",
+    ")  # Necessary for variable batch and sequence lengths\n",
+    "\n",
+    "onnx_model = dynamo_export(model, example_input, export_options=export_options)\n",
     "\n",
     "onnx_path = path.join(exports_path, f\"{model_name}.onnx\")\n",
     "\n",
+    "onnx_model.save(onnx_path)\n",
     "\n",
     "print(f\"Model saved to {onnx_path}\")"
    ]
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "Lastly, let's compare the output of PyTorch with the ONNX runtime to see if they are the same."
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
+      "Looking good!\n"
      ]
     }
    ],
    "source": [
     "import onnxruntime\n",
     "\n",
     "from numpy.testing import assert_allclose\n",
     "\n",
+    "pytorch_logits = model(example_input)\n",
+    "\n",
     "session = onnxruntime.InferenceSession(onnx_path, providers=[\"CPUExecutionProvider\"])\n",
     "\n",
+    "onnx_input = {\"l_x_\": example_input.numpy()}\n",
     "\n",
+    "onnx_logits = session.run(None, onnx_input)\n",
     "\n",
+    "onnx_logits = onnx_logits[0]\n",
+    "pytorch_logits = pytorch_logits.detach().numpy()\n",
     "\n",
+    "assert_allclose(pytorch_logits, onnx_logits, rtol=1e-2, atol=1e-03)\n",
     "\n",
+    "print(\"Looks good!\")"
    ]
   }
  ],

generate.py CHANGED Viewed

@@ -23,6 +23,7 @@ def main():
     )
     parser.add_argument("--lora_path", default=None, type=str)
     parser.add_argument("--max_tokens", default=1000, type=int)
     parser.add_argument("--temperature", default=1.0, type=float)
     parser.add_argument("--top_k", default=500, type=int)
     parser.add_argument("--top_p", default=0.9, type=float)
@@ -91,7 +92,12 @@ def main():
         prompt = torch.tensor(prompt, dtype=torch.int64, device=args.device)
         for token in model.generate(
-            prompt, args.max_tokens, args.temperature, args.top_k, args.top_p
         ):
             out = tokenizer.decode_single_token_bytes(token).decode(
                 "utf-8", errors="replace"

     )
     parser.add_argument("--lora_path", default=None, type=str)
     parser.add_argument("--max_tokens", default=1000, type=int)
+    parser.add_argument("--context_length", default=1024, type=int)
     parser.add_argument("--temperature", default=1.0, type=float)
     parser.add_argument("--top_k", default=500, type=int)
     parser.add_argument("--top_p", default=0.9, type=float)
         prompt = torch.tensor(prompt, dtype=torch.int64, device=args.device)
         for token in model.generate(
+            prompt,
+            args.max_tokens,
+            args.context_length,
+            args.temperature,
+            args.top_k,
+            args.top_p,
         ):
             out = tokenizer.decode_single_token_bytes(token).decode(
                 "utf-8", errors="replace"

instruction-tune.py CHANGED Viewed

@@ -9,6 +9,7 @@ from torch.optim import Adafactor
 from torch.amp import autocast
 from torch.cuda import is_available as cuda_is_available, is_bf16_supported
 from torch.utils.data import random_split
 from torchmetrics.text import Perplexity
@@ -26,12 +27,13 @@ def main():
     parser.add_argument(
         "--base_model_path", default="./checkpoints/checkpoint.pt", type=str
     )
     parser.add_argument("--batch_size", default=1, type=int)
     parser.add_argument("--gradient_accumulation_steps", default=64, type=int)
     parser.add_argument("--learning_rate", default=5e-4, type=float)
     parser.add_argument("--rms_decay", default=-0.8, type=float)
     parser.add_argument("--optimizer_low_memory", default=True, type=bool)
-    parser.add_argument("--mask_input", default=False, type=bool)
     parser.add_argument("--num_epochs", default=4, type=int)
     parser.add_argument("--rank", default=8, type=int)
     parser.add_argument("--alpha", default=1.0, type=float)
@@ -43,6 +45,7 @@ def main():
         "--checkpoint_path", default="./checkpoints/lora_instruction.pt", type=str
     )
     parser.add_argument("--resume", action="store_true")
     parser.add_argument("--device", default="cuda", type=str)
     parser.add_argument("--seed", default=None, type=int)
@@ -65,6 +68,8 @@ def main():
         torch.manual_seed(args.seed)
         random.seed(args.seed)
     checkpoint = torch.load(
         args.base_model_path, map_location=args.device, weights_only=True
     )
@@ -75,7 +80,7 @@ def main():
     dataset = Alpaca(
         tokenizer,
-        max_tokens_per_sample=model_args["block_size"],
         mask_input=args.mask_input,
     )
@@ -173,6 +178,8 @@ def main():
         average_cross_entropy = total_cross_entropy / total_batches
         print(
             f"Epoch {epoch}: Cross Entropy: {average_cross_entropy:.5f}",
         )
@@ -191,6 +198,8 @@ def main():
             perplexity = perplexity_metric.compute()
             print(f"Perplexity: {perplexity:.3f}")
             perplexity_metric.reset()

 from torch.amp import autocast
 from torch.cuda import is_available as cuda_is_available, is_bf16_supported
 from torch.utils.data import random_split
+from torch.utils.tensorboard import SummaryWriter
 from torchmetrics.text import Perplexity
     parser.add_argument(
         "--base_model_path", default="./checkpoints/checkpoint.pt", type=str
     )
+    parser.add_argument("--max_tokens_per_sample", default=4096, type=int)
+    parser.add_argument("--mask_input", action="store_true")
     parser.add_argument("--batch_size", default=1, type=int)
     parser.add_argument("--gradient_accumulation_steps", default=64, type=int)
     parser.add_argument("--learning_rate", default=5e-4, type=float)
     parser.add_argument("--rms_decay", default=-0.8, type=float)
     parser.add_argument("--optimizer_low_memory", default=True, type=bool)
     parser.add_argument("--num_epochs", default=4, type=int)
     parser.add_argument("--rank", default=8, type=int)
     parser.add_argument("--alpha", default=1.0, type=float)
         "--checkpoint_path", default="./checkpoints/lora_instruction.pt", type=str
     )
     parser.add_argument("--resume", action="store_true")
+    parser.add_argument("--run_dir_path", default="./runs/instruction-tune", type=str)
     parser.add_argument("--device", default="cuda", type=str)
     parser.add_argument("--seed", default=None, type=int)
         torch.manual_seed(args.seed)
         random.seed(args.seed)
+    logger = SummaryWriter(args.run_dir_path)
     checkpoint = torch.load(
         args.base_model_path, map_location=args.device, weights_only=True
     )
     dataset = Alpaca(
         tokenizer,
+        max_tokens_per_sample=args.max_tokens_per_sample,
         mask_input=args.mask_input,
     )
         average_cross_entropy = total_cross_entropy / total_batches
+        logger.add_scalar("cross entropy", average_cross_entropy, epoch)
         print(
             f"Epoch {epoch}: Cross Entropy: {average_cross_entropy:.5f}",
         )
             perplexity = perplexity_metric.compute()
+            logger.add_scalar("perplexity", perplexity, epoch)
             print(f"Perplexity: {perplexity:.3f}")
             perplexity_metric.reset()

model.py CHANGED Viewed

@@ -1,4 +1,4 @@
-from math import sqrt, exp
 from dataclasses import dataclass
 from functools import partial, cached_property
 from typing import Iterator, Self
@@ -13,8 +13,8 @@ from torch.nn import (
     Embedding,
     MultiheadAttention,
     Linear,
     RMSNorm,
-    GELU,
     Dropout1d,
     CrossEntropyLoss,
     Parameter,
@@ -27,25 +27,21 @@ from torch.utils.checkpoint import checkpoint as torch_checkpoint
 class GPT(Module):
-    """A generative pre-trained transformer."""
     def __init__(
         self,
-        block_size: int,
         embedding_dimensions: int,
         num_heads: int,
         num_layers: int,
         feed_forward_ratio: int,
         dropout: float,
-        vocabulary_size: int,
         padding_index: int,
         eos_index: int,
     ):
         super().__init__()
-        if block_size < 1:
-            raise ValueError(f"Block size must be greater than 0, {block_size} given.")
         if num_layers <= 0:
             raise ValueError(f"Num layers must be greater than 0, {num_layers} given.")
@@ -67,16 +63,10 @@ class GPT(Module):
         self.token_embeddings = token_embeddings
-        causal_mask = torch.full((block_size, block_size), float("-inf"))
-        causal_mask = torch.triu(causal_mask, diagonal=1)
-        self.causal_mask = Buffer(causal_mask, persistent=False)
         self.body = ModuleList(
             [
                 CausalSelfAttentionBlock(
                     embedding_dimensions,
-                    block_size,
                     num_heads,
                     feed_forward_ratio,
                     dropout,
@@ -93,7 +83,6 @@ class GPT(Module):
         self.loss_function = CrossEntropyLoss(ignore_index=padding_index)
         self.vocabulary_size = vocabulary_size
-        self.block_size = block_size
         self.eos_index = eos_index
     @cached_property
@@ -108,9 +97,10 @@ class GPT(Module):
     ) -> tuple[Tensor, Tensor | None]:
         z = self.token_embeddings(x)
-        b, t = x.size()
-        causal_mask = self.causal_mask[:t, :t]
         for layer in self.body:
             z = self.checkpoint(layer, z, causal_mask)
@@ -132,14 +122,15 @@ class GPT(Module):
     def generate(
         self,
         prompt: Tensor,
-        max_tokens: int = 500,
         temperature: float = 1.0,
         top_k: int = 500,
         top_p: float = 0.9,
     ) -> Iterator:
         """
         Given a prompt, sample the next {max_tokens} tokens from the model weighted
-        by their predicted probabilities.
         """
         if max_tokens <= 0:
@@ -161,7 +152,7 @@ class GPT(Module):
         context_window = prompt
         for _ in range(max_tokens):
-            context_window = context_window[-self.block_size :]
             y_pred, _ = self.forward(context_window.unsqueeze(0))
@@ -201,12 +192,15 @@ class GPT(Module):
     def beam_search(
         self,
         prompt: Tensor,
-        max_tokens: int = 200,
         num_candidates: int = 3,
         beam_width: int = 16,
     ) -> list:
         """
-        Given a prompt, return the {num_candidates} highest probability sequences.
         """
         if max_tokens <= 0:
@@ -267,7 +261,7 @@ class GPT(Module):
             context_window = torch.cat((prompt, candidate.tokens))
-            context_window = context_window[-self.block_size :]
             y_pred, _ = self.forward(context_window.unsqueeze(0))
@@ -293,7 +287,7 @@ class GPT(Module):
 class GPTWithLoRA(Module):
     """
-    A wrapper for pre-trained GPT models that applies a LoRA reparameterization
     to the intermediate layers of the network.
     """
@@ -382,13 +376,26 @@ class GPTWithLoRA(Module):
         return self.model.beam_search(prompt, max_tokens, num_candidates, beam_width)
 class CausalSelfAttentionBlock(Module):
     """Causal self-attention block with residual connections."""
     def __init__(
         self,
         embedding_dimensions: int,
-        block_size: int,
         num_heads: int,
         feed_forward_ratio: int,
         dropout: float,
@@ -400,9 +407,6 @@ class CausalSelfAttentionBlock(Module):
                 f"Embedding dimensions must be greater than 0, {embedding_dimensions} given."
             )
-        if block_size <= 0:
-            raise ValueError(f"Block size must be greater than 0, {block_size} given.")
         if num_heads <= 0:
             raise ValueError(f"Num heads must be greater than 0, {num_heads} given.")
@@ -459,7 +463,7 @@ class MLP(Module):
         self.layers = Sequential(
             Linear(embedding_dimensions, hidden_dimensions, bias=False),
-            GELU(),
             Linear(hidden_dimensions, embedding_dimensions, bias=False),
         )

+from math import sqrt
 from dataclasses import dataclass
 from functools import partial, cached_property
 from typing import Iterator, Self
     Embedding,
     MultiheadAttention,
     Linear,
+    SiLU,
     RMSNorm,
     Dropout1d,
     CrossEntropyLoss,
     Parameter,
 class GPT(Module):
+    """A generative pretrained transformer."""
     def __init__(
         self,
+        vocabulary_size: int,
         embedding_dimensions: int,
         num_heads: int,
         num_layers: int,
         feed_forward_ratio: int,
         dropout: float,
         padding_index: int,
         eos_index: int,
     ):
         super().__init__()
         if num_layers <= 0:
             raise ValueError(f"Num layers must be greater than 0, {num_layers} given.")
         self.token_embeddings = token_embeddings
         self.body = ModuleList(
             [
                 CausalSelfAttentionBlock(
                     embedding_dimensions,
                     num_heads,
                     feed_forward_ratio,
                     dropout,
         self.loss_function = CrossEntropyLoss(ignore_index=padding_index)
         self.vocabulary_size = vocabulary_size
         self.eos_index = eos_index
     @cached_property
     ) -> tuple[Tensor, Tensor | None]:
         z = self.token_embeddings(x)
+        b, t, d = z.size()
+        causal_mask = torch.full((t, t), float("-inf"), dtype=z.dtype, device=z.device)
+        causal_mask = torch.triu(causal_mask, diagonal=1)
         for layer in self.body:
             z = self.checkpoint(layer, z, causal_mask)
     def generate(
         self,
         prompt: Tensor,
+        max_tokens: int = 1000,
+        context_length: int = 1024,
         temperature: float = 1.0,
         top_k: int = 500,
         top_p: float = 0.9,
     ) -> Iterator:
         """
         Given a prompt, sample the next {max_tokens} tokens from the model weighted
+        by their predicted probabilities and filtered by the {top_k} and {top_p}.
         """
         if max_tokens <= 0:
         context_window = prompt
         for _ in range(max_tokens):
+            context_window = context_window[-context_length:]
             y_pred, _ = self.forward(context_window.unsqueeze(0))
     def beam_search(
         self,
         prompt: Tensor,
+        max_tokens: int = 100,
+        context_length: int = 1024,
         num_candidates: int = 3,
         beam_width: int = 16,
     ) -> list:
         """
+        Given a prompt, return the {num_candidates} highest probability sequences. Note that
+        this method is often best for generating shorter sequences and is typically less
+        natural sounding than sequences that are more random in nature.
         """
         if max_tokens <= 0:
             context_window = torch.cat((prompt, candidate.tokens))
+            context_window = context_window[-context_length:]
             y_pred, _ = self.forward(context_window.unsqueeze(0))
 class GPTWithLoRA(Module):
     """
+    A wrapper for pretrained GPT models that applies a LoRA reparameterization
     to the intermediate layers of the network.
     """
         return self.model.beam_search(prompt, max_tokens, num_candidates, beam_width)
+class ONNXModel(Module):
+    """This wrapper provides a cleaner inferencing API for production models."""
+    def __init__(self, model: GPT | GPTWithLoRA):
+        super().__init__()
+        self.model = model
+    def forward(self, x: Tensor) -> Tensor:
+        logits, _ = self.model.forward(x, None)
+        return logits
 class CausalSelfAttentionBlock(Module):
     """Causal self-attention block with residual connections."""
     def __init__(
         self,
         embedding_dimensions: int,
         num_heads: int,
         feed_forward_ratio: int,
         dropout: float,
                 f"Embedding dimensions must be greater than 0, {embedding_dimensions} given."
             )
         if num_heads <= 0:
             raise ValueError(f"Num heads must be greater than 0, {num_heads} given.")
         self.layers = Sequential(
             Linear(embedding_dimensions, hidden_dimensions, bias=False),
+            SiLU(),
             Linear(hidden_dimensions, embedding_dimensions, bias=False),
         )

model_sizing.ipynb CHANGED Viewed

@@ -9,15 +9,19 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 148,
    "metadata": {},
    "outputs": [],
    "source": [
-    "block_size = 1024\n",
     "vocabulary_size = 50257\n",
     "embedding_dimensions = 1024\n",
     "num_attention_heads = 16\n",
     "num_hidden_layers = 24\n",
     "samples_per_epoch = 4096"
    ]
   },
@@ -30,7 +34,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 149,
    "metadata": {},
    "outputs": [
     {
@@ -49,12 +53,12 @@
      "text": [
       "Token Embeddings               51,463,168      14.56%\n",
       "Attention                     100,663,296      28.48%\n",
-      "MLP                           201,326,592      56.95%\n",
       "RMS Norm                           50,176       0.01%\n",
       "Output Layer                            0       0.00%\n",
       "\n",
       "\n",
-      "Total parameters: 353,503,232\n"
      ]
     }
    ],
@@ -67,7 +71,11 @@
     "        embedding_dimensions**2 + embedding_dimensions * 3 * embedding_dimensions\n",
     "    )\n",
     "    * num_hidden_layers,\n",
-    "    \"MLP\": embedding_dimensions * 4 * embedding_dimensions * 2 * num_hidden_layers,\n",
     "    \"RMS Norm\": embedding_dimensions * num_hidden_layers * 2 + embedding_dimensions,\n",
     "    \"Output Layer\": 0,  # Tied to token embeddings\n",
     "}\n",
@@ -99,7 +107,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 150,
    "metadata": {},
    "outputs": [
     {
@@ -125,7 +133,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 151,
    "metadata": {},
    "outputs": [
     {
@@ -151,7 +159,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 152,
    "metadata": {},
    "outputs": [
     {
@@ -181,14 +189,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 153,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Optimal training tokens: 7,070,064,640\n",
       "Epochs required: 1,686\n",
       "\n"
      ]
@@ -197,7 +205,9 @@
    "source": [
     "num_training_tokens = 20 * total_parameter_count\n",
     "\n",
-    "num_epochs_required = round(num_training_tokens / (samples_per_epoch * block_size))\n",
     "\n",
     "print(f\"Optimal training tokens: {num_training_tokens:,}\")\n",
     "\n",
@@ -213,7 +223,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 154,
    "metadata": {},
    "outputs": [
     {
@@ -231,43 +241,57 @@
      "output_type": "stream",
      "text": [
       "Attention                 309,237,645,312      37.39%\n",
-      "MLP                       412,317,745,152      49.86%\n",
       "RMS Norm                          179,200       0.00%\n",
       "Output Layer              105,396,568,064      12.75%\n",
       "\n",
       "\n",
-      "Total forward FLOPs: 826,952,137,728\n"
      ]
     }
    ],
    "source": [
     "ops_per_matmul = 2  # Multiply + accumulate (MAC)\n",
-    "ops_per_activation = 9  # Assuming GELU\n",
     "ops_per_rms_norm = 7  # y = (x / sqrt(rms[x] + epsilon)) * gamma\n",
     "\n",
     "head_dimensions = embedding_dimensions // num_attention_heads\n",
     "\n",
     "# K, Q, V projections\n",
     "attention = (\n",
-    "    ops_per_matmul * block_size * (embedding_dimensions * 3 * embedding_dimensions)\n",
     ")\n",
     "\n",
     "# Attention logits\n",
-    "attention += ops_per_matmul * block_size * block_size * embedding_dimensions\n",
     "\n",
     "# Reductions\n",
     "attention += (\n",
-    "    ops_per_matmul * num_attention_heads * (block_size * block_size * head_dimensions)\n",
     ")\n",
     "\n",
     "# Output projection\n",
-    "attention += ops_per_matmul * block_size * embedding_dimensions**2\n",
     "\n",
     "attention *= num_hidden_layers\n",
     "\n",
     "# Linear transformations\n",
-    "mlp = ops_per_matmul * block_size * (embedding_dimensions * (4 * embedding_dimensions))\n",
-    "mlp += ops_per_matmul * block_size * ((4 * embedding_dimensions) * embedding_dimensions)\n",
     "\n",
     "# Non-linear activations\n",
     "mlp += ops_per_activation * (4 * embedding_dimensions)\n",
@@ -276,7 +300,9 @@
     "\n",
     "rms_norm = ops_per_rms_norm * embedding_dimensions * (num_hidden_layers + 1)\n",
     "\n",
-    "output_layer = ops_per_matmul * block_size * embedding_dimensions * vocabulary_size\n",
     "\n",
     "flops = {\n",
     "    \"Attention\": attention,\n",
@@ -312,14 +338,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 155,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Total backward FLOPs: 1,653,904,275,456\n"
      ]
     }
    ],
@@ -338,14 +364,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 156,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Total roundtrip FLOPs: 2,480,856,413,184\n"
      ]
     }
    ],
@@ -364,24 +390,24 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 157,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Total PaLM FLOPs: 2,481,161,502,720\n"
      ]
     }
    ],
    "source": [
     "palm_flops_per_token = (\n",
     "    6 * total_parameter_count\n",
-    "    + 12 * num_hidden_layers * num_attention_heads * head_dimensions * block_size\n",
     ")\n",
     "\n",
-    "total_palm_flops = palm_flops_per_token * block_size\n",
     "\n",
     "print(f\"Total PaLM FLOPs: {total_palm_flops:,}\")"
    ]
@@ -390,23 +416,30 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Finally, let's estimate how long it would take to train over the estimated optimal number of tokens given the hardware conifgurations we defined above. Note that these results shown here are a theoretical scenario and do not factor in additional overhead such as activation checkpointing or network latency."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 158,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "RTX A2000: 935.43 seconds/epoch, 18.25 days required, MFU: 17.0%\n",
-      "RTX A4000: 348.64 seconds/epoch, 6.80 days required, MFU: 19.0%\n",
-      "RTX 3090: 154.75 seconds/epoch, 3.02 days required, MFU: 23.0%\n",
-      "A100 SXM: 44.01 seconds/epoch, 0.86 days required, MFU: 37.0%\n",
-      "HGX A100: 6.79 seconds/epoch, 0.13 days required, MFU: 30.0%\n"
      ]
     }
    ],
@@ -421,10 +454,6 @@
     "    mfu: float\n",
     "\n",
     "    @property\n",
-    "    def percentage_utilization(self) -> float:\n",
-    "        return self.mfu * 100\n",
-    "\n",
-    "    @property\n",
     "    def actual_flops(self) -> float:\n",
     "        return self.mfu * self.advertised_flops\n",
     "\n",
@@ -443,7 +472,7 @@
     "    days_required = num_epochs_required * seconds_per_epoch / 60 / 60 / 24\n",
     "\n",
     "    print(\n",
-    "        f\"{device.name}: {seconds_per_epoch:.2f} seconds/epoch, {days_required:,.2f} days required, MFU: {device.percentage_utilization}%\"\n",
     "    )"
    ]
   }

   },
   {
    "cell_type": "code",
+   "execution_count": 188,
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Model\n",
     "vocabulary_size = 50257\n",
     "embedding_dimensions = 1024\n",
     "num_attention_heads = 16\n",
     "num_hidden_layers = 24\n",
+    "feed_forward_ratio = 4\n",
+    "\n",
+    "# Training set\n",
+    "tokens_per_sample = 1024\n",
     "samples_per_epoch = 4096"
    ]
   },
   },
   {
    "cell_type": "code",
+   "execution_count": 189,
    "metadata": {},
    "outputs": [
     {
      "text": [
       "Token Embeddings               51,463,168      14.56%\n",
       "Attention                     100,663,296      28.48%\n",
+      "MLP                           201,326,616      56.95%\n",
       "RMS Norm                           50,176       0.01%\n",
       "Output Layer                            0       0.00%\n",
       "\n",
       "\n",
+      "Total parameters: 353,503,256\n"
      ]
     }
    ],
     "        embedding_dimensions**2 + embedding_dimensions * 3 * embedding_dimensions\n",
     "    )\n",
     "    * num_hidden_layers,\n",
+    "    \"MLP\": embedding_dimensions\n",
+    "    * feed_forward_ratio\n",
+    "    * embedding_dimensions\n",
+    "    * 2\n",
+    "    * num_hidden_layers,\n",
     "    \"RMS Norm\": embedding_dimensions * num_hidden_layers * 2 + embedding_dimensions,\n",
     "    \"Output Layer\": 0,  # Tied to token embeddings\n",
     "}\n",
   },
   {
    "cell_type": "code",
+   "execution_count": 190,
    "metadata": {},
    "outputs": [
     {
   },
   {
    "cell_type": "code",
+   "execution_count": 191,
    "metadata": {},
    "outputs": [
     {
   },
   {
    "cell_type": "code",
+   "execution_count": 192,
    "metadata": {},
    "outputs": [
     {
   },
   {
    "cell_type": "code",
+   "execution_count": 193,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
+      "Optimal training tokens: 7,070,065,120\n",
       "Epochs required: 1,686\n",
       "\n"
      ]
    "source": [
     "num_training_tokens = 20 * total_parameter_count\n",
     "\n",
+    "num_epochs_required = round(\n",
+    "    num_training_tokens / (samples_per_epoch * tokens_per_sample)\n",
+    ")\n",
     "\n",
     "print(f\"Optimal training tokens: {num_training_tokens:,}\")\n",
     "\n",
   },
   {
    "cell_type": "code",
+   "execution_count": 194,
    "metadata": {},
    "outputs": [
     {
      "output_type": "stream",
      "text": [
       "Attention                 309,237,645,312      37.39%\n",
+      "MLP                       412,317,450,240      49.86%\n",
       "RMS Norm                          179,200       0.00%\n",
       "Output Layer              105,396,568,064      12.75%\n",
       "\n",
       "\n",
+      "Total forward FLOPs: 826,951,842,816\n"
      ]
     }
    ],
    "source": [
     "ops_per_matmul = 2  # Multiply + accumulate (MAC)\n",
+    "ops_per_activation = 5  # Assuming SiLU\n",
     "ops_per_rms_norm = 7  # y = (x / sqrt(rms[x] + epsilon)) * gamma\n",
     "\n",
     "head_dimensions = embedding_dimensions // num_attention_heads\n",
     "\n",
     "# K, Q, V projections\n",
     "attention = (\n",
+    "    ops_per_matmul\n",
+    "    * tokens_per_sample\n",
+    "    * (embedding_dimensions * 3 * embedding_dimensions)\n",
     ")\n",
     "\n",
     "# Attention logits\n",
+    "attention += (\n",
+    "    ops_per_matmul * tokens_per_sample * tokens_per_sample * embedding_dimensions\n",
+    ")\n",
     "\n",
     "# Reductions\n",
     "attention += (\n",
+    "    ops_per_matmul\n",
+    "    * num_attention_heads\n",
+    "    * (tokens_per_sample * tokens_per_sample * head_dimensions)\n",
     ")\n",
     "\n",
     "# Output projection\n",
+    "attention += ops_per_matmul * tokens_per_sample * embedding_dimensions**2\n",
     "\n",
     "attention *= num_hidden_layers\n",
     "\n",
     "# Linear transformations\n",
+    "mlp = (\n",
+    "    ops_per_matmul\n",
+    "    * tokens_per_sample\n",
+    "    * (embedding_dimensions * (4 * embedding_dimensions))\n",
+    ")\n",
+    "mlp += (\n",
+    "    ops_per_matmul\n",
+    "    * tokens_per_sample\n",
+    "    * ((4 * embedding_dimensions) * embedding_dimensions)\n",
+    ")\n",
     "\n",
     "# Non-linear activations\n",
     "mlp += ops_per_activation * (4 * embedding_dimensions)\n",
     "\n",
     "rms_norm = ops_per_rms_norm * embedding_dimensions * (num_hidden_layers + 1)\n",
     "\n",
+    "output_layer = (\n",
+    "    ops_per_matmul * tokens_per_sample * embedding_dimensions * vocabulary_size\n",
+    ")\n",
     "\n",
     "flops = {\n",
     "    \"Attention\": attention,\n",
   },
   {
    "cell_type": "code",
+   "execution_count": 195,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
+      "Total backward FLOPs: 1,653,903,685,632\n"
      ]
     }
    ],
   },
   {
    "cell_type": "code",
+   "execution_count": 196,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
+      "Total roundtrip FLOPs: 2,480,855,528,448\n"
      ]
     }
    ],
   },
   {
    "cell_type": "code",
+   "execution_count": 197,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
+      "Total PaLM FLOPs: 2,481,161,650,176\n"
      ]
     }
    ],
    "source": [
     "palm_flops_per_token = (\n",
     "    6 * total_parameter_count\n",
+    "    + 12 * num_hidden_layers * num_attention_heads * head_dimensions * tokens_per_sample\n",
     ")\n",
     "\n",
+    "total_palm_flops = palm_flops_per_token * tokens_per_sample\n",
     "\n",
     "print(f\"Total PaLM FLOPs: {total_palm_flops:,}\")"
    ]
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "The two estimates are pretty close so let's proceed."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, let's estimate how long it would take to train over the optimal number of tokens given some common Nvidia Ampere generation GPU hardware configurations. Note that these results shown here are a theoretical scenario and do not factor in additional overhead such as activation checkpointing or network latency."
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": 198,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
+      "RTX A2000: 935.43 seconds/epoch, 18.25 days required\n",
+      "RTX A4000: 348.64 seconds/epoch, 6.80 days required\n",
+      "RTX 3090: 154.75 seconds/epoch, 3.02 days required\n",
+      "A100 SXM: 44.01 seconds/epoch, 0.86 days required\n",
+      "HGX A100: 6.79 seconds/epoch, 0.13 days required\n"
      ]
     }
    ],
     "    mfu: float\n",
     "\n",
     "    @property\n",
     "    def actual_flops(self) -> float:\n",
     "        return self.mfu * self.advertised_flops\n",
     "\n",
     "    days_required = num_epochs_required * seconds_per_epoch / 60 / 60 / 24\n",
     "\n",
     "    print(\n",
+    "        f\"{device.name}: {seconds_per_epoch:.2f} seconds/epoch, {days_required:,.2f} days required\"\n",
     "    )"
    ]
   }

pre-train.py → pretrain.py RENAMED Viewed

@@ -16,6 +16,7 @@ from torch.cuda import set_device, is_available as cuda_is_available, is_bf16_su
 from torch.nn.utils import clip_grad_norm_
 from torch.distributed import init_process_group, destroy_process_group
 from torch.distributed.fsdp import FullyShardedDataParallel, ShardingStrategy
 from torchmetrics.text import Perplexity
@@ -38,7 +39,7 @@ DDP_BACKEND = "nccl"
 def main():
-    parser = ArgumentParser(description="Pre-train the GPT.")
     parser.add_argument(
         "--dataset_subset",
@@ -54,6 +55,7 @@ def main():
     parser.add_argument("--num_dataset_processes", default=8, type=int)
     parser.add_argument("--batch_size", default=1, type=int)
     parser.add_argument("--gradient_accumulation_steps", default=128, type=int)
     parser.add_argument("--samples_per_epoch", default=4096, type=int)
     parser.add_argument("--num_epochs", default=1686, type=int)
     parser.add_argument("--learning_rate", default=1e-2, type=float)
@@ -61,7 +63,6 @@ def main():
     parser.add_argument("--low_memory_optimizer", action="store_true")
     parser.add_argument("--max_gradient_norm", default=1.0, type=float)
     parser.add_argument("--dropout", default=0.1, type=float)
-    parser.add_argument("--block_size", default=1024, type=int)
     parser.add_argument("--embedding_dimensions", default=1024, type=int)
     parser.add_argument("--num_attention_heads", default=16, type=int)
     parser.add_argument("--num_hidden_layers", default=24, type=int)
@@ -74,6 +75,7 @@ def main():
         "--checkpoint_path", default="./checkpoints/checkpoint.pt", type=str
     )
     parser.add_argument("--resume", action="store_true")
     parser.add_argument("--device", default="cuda", type=str)
     parser.add_argument("--seed", default=None, type=int)
@@ -156,23 +158,25 @@ def main():
         torch.manual_seed(args.seed)
         random.seed(args.seed)
     tokenizer = tiktoken.get_encoding(args.token_encoding)
     training = Fineweb(
-        tokenizer,
         root_path=args.dataset_path,
         subset=args.dataset_subset,
         split="train",
-        tokens_per_sample=args.block_size,
         samples_per_epoch=args.samples_per_epoch,
         num_processes=args.num_dataset_processes,
     )
     testing = Fineweb(
-        tokenizer,
         root_path=args.dataset_path,
         subset=args.dataset_subset,
         split="test",
-        tokens_per_sample=args.block_size,
         samples_per_epoch=args.samples_per_epoch,
         num_processes=args.num_dataset_processes,
     )
@@ -185,13 +189,12 @@ def main():
     )
     model_args = {
-        "block_size": args.block_size,
         "embedding_dimensions": args.embedding_dimensions,
         "num_heads": args.num_attention_heads,
         "num_layers": args.num_hidden_layers,
         "feed_forward_ratio": args.feed_forward_ratio,
         "dropout": args.dropout,
-        "vocabulary_size": tokenizer.n_vocab,
         "padding_index": training.PADDING_INDEX,
         "eos_index": tokenizer.eot_token,
     }
@@ -252,7 +255,7 @@ def main():
     register_signal_handlers()
-    print("Pre-training ...")
     for epoch in range(starting_epoch, args.num_epochs + 1):
         total_cross_entropy, total_gradient_norm = 0.0, 0.0
@@ -292,14 +295,18 @@ def main():
             total_batches += 1
-        average_cross_entropy = total_cross_entropy / total_batches
-        average_gradient_norm = total_gradient_norm / total_steps
-        print(
-            f"Epoch {epoch}:",
-            f"Cross Entropy: {average_cross_entropy:.5f},",
-            f"Gradient Norm: {average_gradient_norm:.4f}",
-        )
         if epoch % args.eval_interval == 0 and IS_MASTER:
             model.eval()
@@ -315,6 +322,8 @@ def main():
             perplexity = perplexity_metric.compute()
             print(f"Perplexity: {perplexity:.3f}")
             perplexity_metric.reset()

 from torch.nn.utils import clip_grad_norm_
 from torch.distributed import init_process_group, destroy_process_group
 from torch.distributed.fsdp import FullyShardedDataParallel, ShardingStrategy
+from torch.utils.tensorboard import SummaryWriter
 from torchmetrics.text import Perplexity
 def main():
+    parser = ArgumentParser(description="Pretrain the GPT.")
     parser.add_argument(
         "--dataset_subset",
     parser.add_argument("--num_dataset_processes", default=8, type=int)
     parser.add_argument("--batch_size", default=1, type=int)
     parser.add_argument("--gradient_accumulation_steps", default=128, type=int)
+    parser.add_argument("--tokens_per_sample", default=1024, type=int)
     parser.add_argument("--samples_per_epoch", default=4096, type=int)
     parser.add_argument("--num_epochs", default=1686, type=int)
     parser.add_argument("--learning_rate", default=1e-2, type=float)
     parser.add_argument("--low_memory_optimizer", action="store_true")
     parser.add_argument("--max_gradient_norm", default=1.0, type=float)
     parser.add_argument("--dropout", default=0.1, type=float)
     parser.add_argument("--embedding_dimensions", default=1024, type=int)
     parser.add_argument("--num_attention_heads", default=16, type=int)
     parser.add_argument("--num_hidden_layers", default=24, type=int)
         "--checkpoint_path", default="./checkpoints/checkpoint.pt", type=str
     )
     parser.add_argument("--resume", action="store_true")
+    parser.add_argument("--run_dir_path", default="./runs/pretrain", type=str)
     parser.add_argument("--device", default="cuda", type=str)
     parser.add_argument("--seed", default=None, type=int)
         torch.manual_seed(args.seed)
         random.seed(args.seed)
+    logger = SummaryWriter(args.run_dir_path)
     tokenizer = tiktoken.get_encoding(args.token_encoding)
     training = Fineweb(
+        tokenizer=tokenizer,
         root_path=args.dataset_path,
         subset=args.dataset_subset,
         split="train",
+        tokens_per_sample=args.tokens_per_sample,
         samples_per_epoch=args.samples_per_epoch,
         num_processes=args.num_dataset_processes,
     )
     testing = Fineweb(
+        tokenizer=tokenizer,
         root_path=args.dataset_path,
         subset=args.dataset_subset,
         split="test",
+        tokens_per_sample=args.tokens_per_sample,
         samples_per_epoch=args.samples_per_epoch,
         num_processes=args.num_dataset_processes,
     )
     )
     model_args = {
+        "vocabulary_size": tokenizer.n_vocab,
         "embedding_dimensions": args.embedding_dimensions,
         "num_heads": args.num_attention_heads,
         "num_layers": args.num_hidden_layers,
         "feed_forward_ratio": args.feed_forward_ratio,
         "dropout": args.dropout,
         "padding_index": training.PADDING_INDEX,
         "eos_index": tokenizer.eot_token,
     }
     register_signal_handlers()
+    print("Pretraining ...")
     for epoch in range(starting_epoch, args.num_epochs + 1):
         total_cross_entropy, total_gradient_norm = 0.0, 0.0
             total_batches += 1
+        if IS_MASTER:
+            average_cross_entropy = total_cross_entropy / total_batches
+            average_gradient_norm = total_gradient_norm / total_steps
+            logger.add_scalar("cross entropy", average_cross_entropy, epoch)
+            logger.add_scalar("gradient norm", average_gradient_norm, epoch)
+            print(
+                f"Epoch {epoch}:",
+                f"Cross Entropy: {average_cross_entropy:.5f},",
+                f"Gradient Norm: {average_gradient_norm:.4f}",
+            )
         if epoch % args.eval_interval == 0 and IS_MASTER:
             model.eval()
             perplexity = perplexity_metric.compute()
+            logger.add_scalar("perplexity", perplexity, epoch)
             print(f"Perplexity: {perplexity:.3f}")
             perplexity_metric.reset()

requirements.txt CHANGED Viewed

@@ -6,5 +6,7 @@ tiktoken==0.8.0
 tqdm==4.66.6
 matplotlib==3.9.2
 safetensors==0.5.2
-onnxscript==0.1.0
 onnxruntime==1.20.1

 tqdm==4.66.6
 matplotlib==3.9.2
 safetensors==0.5.2
+onnx==1.17.0
+onnxscript==0.1.0.dev20250108
 onnxruntime==1.20.1
+tensorboard==2.18.0

runs/.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ *
2	+ !.gitignore