Broad improvements

Files changed (6) hide show

README.md +15 -10
beam_search.py +10 -3
data.py +10 -16
generate.py +10 -3
instruction-tune.py +24 -10
pre-train.py +17 -5

README.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 license: apache-2.0
 datasets:
-- Skylion007/openwebtext
 - tatsu-lab/alpaca
 language:
 - en
@@ -10,7 +10,6 @@ metrics:
 pipeline_tag: text-generation
 tags:
 - LightGPT
-- Open-source
 ---
 # LightGPT
@@ -28,7 +27,7 @@ LightGPT is a lightweight generative pre-trained Transformer (GPT) model for the
 Below is a table of recommended default model training configurations but feel free to experiment with settings on your own. See the `model_sizing.ipynb` notebook to estimate the memory and compute requirements for your model configuration.
-| Name | Vocab. Size | Block Size | Embedding Dim. | Attn. Heads | Layers | Params | Train Tokens |
 |---|---|---|---|---|---|---|---|
 | Small | 50,257 | 1024 | 1024 | 16 | 32 | 454M | 10B |
 | Medium | 50,257 | 1024 | 2048 | 32 | 32 | 1.7B | 20B |
@@ -57,9 +56,9 @@ For the pre-training corpus we use the Fineweb dataset which consists of about 1
 python pre-train.py
 ```
-> Note that it will take a while to download and pre-process the dataset the first time that the training script is run.
-To customize the default "lightgpt-small" architecture you can adjust the `block_size`, `embedding_dimensions`, `num_hidden_layers`, and `num_attention_heads` arguments of the pre-training script. Refer to the `model_sizing.ipynb` notebook for an estimation of the memory and compute requirements for your chosen architecture.
 ```
 python pre-train.py --block_size=2048 --embedding_dimensions=4096 --num_hidden_layers=64 --num_attention_heads=64
@@ -71,13 +70,13 @@ You can also adjust the `batch_size`, `learning_rate`, and `gradient_accumulatio
 python pre-train.py --batch_size=32 --learning_rate=0.01 --gradient_accumulation_steps=128
 ```
-For distributed training, use PyTorch's [torchrun](https://pytorch.org/docs/stable/elastic/run.html) extension to launch a distributed data parallel session. The example below is for executing the training script on a single node with individual 8 GPUs.
 ```
 torchrun --standalone --nnodes=1 --nproc-per-node=8 pre-train.py --batch_size=16 --gradient_accumulation_steps=128
 ```
-> Note that when training in data-parallel mode it's important that the `gradient_accumulation_steps` divides evenly into the world size for maximum performance. For example, if we have an 8 GPU cluster, we could perform 32 gradient accumulation steps in exactly 4 passes over the network.
 ## Text Generation
@@ -108,7 +107,9 @@ Soon ...
 | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
 | --gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the weights. |
 | --samples_per_epoch | 4096 | int | The number of training samples to pass through the network every epoch. |
-| --learning_rate | 5e-4 | float | The global step size taken after every gradient accumulation step. |
 | --max_gradient_norm | 1.0 | float | Clip gradients above this threshold before stepping. |
 | --num_epochs | 2384 | int | The number of epochs to train for. |
 | --eval_interval | 10 | int | Evaluate the model after this many epochs on the testing set. |
@@ -117,7 +118,7 @@ Soon ...
 | --num_attention_heads | 16 | int | The number of attention heads within every block. |
 | --num_hidden_layers | 32 | int | The number of attention/MLP blocks within the hidden layer of the network. |
 | --dropout | 0.1 | float | The proportion of signals to send to zero during training as regularization. |
-| --activation_checkpointing | False | bool | Should we use activation checkpointing? This will drastically reduce memory utilization at the cost of about 30% more runtime per epoch. |
 | --ddp_sharding_level | 2 | int | The level of sharding to use for DDP training. Options are 2 or 3 for partial and full sharding respectively, or 0 for no sharding. |
 | --checkpoint_interval | 20 | int | Save the model parameters to disk every this many epochs. |
 | --checkpoint_path | "./out/checkpoint.pt" | str | The path to the checkpoint file on disk. |
@@ -132,12 +133,15 @@ Soon ...
 | --base_model_path | "./out/checkpoint.pt" | string | The path to the pre-trained model. |
 | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
 | --gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the weights. |
-| --learning_rate | 5e-4 | float | The global step size taken after every gradient accumulation step. |
 | --mask_input | False | bool | Should we mask the input part of the sample i.e. only train on the output? |
 | --rank | 8 | int | The rank of the LoRA decomposition matrices. |
 | --alpha | 1.0 | float | The strength of the LoRA signal. |
 | --dropout | 0.05 | float | The proportion of signals to send to zero during training as regularization. |
 | --num_epochs | 4 | int | The number of epochs to train for. |
 | --eval_interval | 1 | int | Evaluate the model after this many epochs on the testing set. |
 | --checkpoint_interval | 1 | int | Save the model parameters to disk every this many epochs. |
 | --checkpoint_path | "./out/lora_instruction.pt" | string | The path to the checkpoint file on disk. |
@@ -171,6 +175,7 @@ Soon ...
 | --seed | None | int | The seed for the random number generator. |
 ## References:
 >- A. Radford, et al. Language Models are Unsupervised Multitask Learners, OpenAI, 2019.
 >- T. Brown, et al. Language Models are Few-Shot Learners. OpenAI, 2020.
 >- A. Kazemnejad, et al. The Impact of Positional Encoding on Length Generalization in Transformers, 37th Conference on Neural Information Processing Systems (NeurIPS 2023).

 ---
 license: apache-2.0
 datasets:
+- HuggingFaceFW/fineweb
 - tatsu-lab/alpaca
 language:
 - en
 pipeline_tag: text-generation
 tags:
 - LightGPT
 ---
 # LightGPT
 Below is a table of recommended default model training configurations but feel free to experiment with settings on your own. See the `model_sizing.ipynb` notebook to estimate the memory and compute requirements for your model configuration.
+| Name | Vocab. Size | Block Size | Embedding Dim. | Attn. Heads | Layers | Parameters | Training Tokens |
 |---|---|---|---|---|---|---|---|
 | Small | 50,257 | 1024 | 1024 | 16 | 32 | 454M | 10B |
 | Medium | 50,257 | 1024 | 2048 | 32 | 32 | 1.7B | 20B |
 python pre-train.py
 ```
+**Note** that it will take a while to download and pre-process the dataset the first time that the training script is run.
+To customize the default "Small" architecture you can adjust the `block_size`, `embedding_dimensions`, `num_hidden_layers`, and `num_attention_heads` arguments of the pre-training script.
 ```
 python pre-train.py --block_size=2048 --embedding_dimensions=4096 --num_hidden_layers=64 --num_attention_heads=64
 python pre-train.py --batch_size=32 --learning_rate=0.01 --gradient_accumulation_steps=128
 ```
+For distributed training, use PyTorch's [torchrun](https://pytorch.org/docs/stable/elastic/run.html) extension to launch a distributed data parallel (DDP) session. The example below is for executing the training script on a single node with 8 individual GPUs.
 ```
 torchrun --standalone --nnodes=1 --nproc-per-node=8 pre-train.py --batch_size=16 --gradient_accumulation_steps=128
 ```
+**Note** that when training in data-parallel mode it's important that the `gradient_accumulation_steps` divides evenly into the world size for maximum performance. For example, if we have an 8 GPU cluster, we could perform 32 gradient accumulation steps in exactly 4 passes over the network.
 ## Text Generation
 | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
 | --gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the weights. |
 | --samples_per_epoch | 4096 | int | The number of training samples to pass through the network every epoch. |
+| --learning_rate | 5e-4 | float | The learning rate of the Adafactor optimizer. |
+| --rms_decay | -0.8 | float | The decay rate of the RMS coefficient of the Adafactor optimizer. |
+| --optimizer_low_memory | True | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
 | --max_gradient_norm | 1.0 | float | Clip gradients above this threshold before stepping. |
 | --num_epochs | 2384 | int | The number of epochs to train for. |
 | --eval_interval | 10 | int | Evaluate the model after this many epochs on the testing set. |
 | --num_attention_heads | 16 | int | The number of attention heads within every block. |
 | --num_hidden_layers | 32 | int | The number of attention/MLP blocks within the hidden layer of the network. |
 | --dropout | 0.1 | float | The proportion of signals to send to zero during training as regularization. |
+| --activation_checkpointing | False | bool | Should we use activation checkpointing? This will reduce drastically memory utilization during training at the cost of needing to recompute the forward pass. |
 | --ddp_sharding_level | 2 | int | The level of sharding to use for DDP training. Options are 2 or 3 for partial and full sharding respectively, or 0 for no sharding. |
 | --checkpoint_interval | 20 | int | Save the model parameters to disk every this many epochs. |
 | --checkpoint_path | "./out/checkpoint.pt" | str | The path to the checkpoint file on disk. |
 | --base_model_path | "./out/checkpoint.pt" | string | The path to the pre-trained model. |
 | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
 | --gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the weights. |
+| --learning_rate | 5e-4 | float | The learning rate of the Adafactor optimizer. |
+| --rms_decay | -0.8 | float | The decay rate of the RMS coefficient of the Adafactor optimizer. |
+| --optimizer_low_memory | True | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
 | --mask_input | False | bool | Should we mask the input part of the sample i.e. only train on the output? |
 | --rank | 8 | int | The rank of the LoRA decomposition matrices. |
 | --alpha | 1.0 | float | The strength of the LoRA signal. |
 | --dropout | 0.05 | float | The proportion of signals to send to zero during training as regularization. |
 | --num_epochs | 4 | int | The number of epochs to train for. |
+| --activation_checkpointing | False | bool | Should we use activation checkpointing? This will reduce drastically memory utilization during training at the cost of needing to recompute the forward pass. |
 | --eval_interval | 1 | int | Evaluate the model after this many epochs on the testing set. |
 | --checkpoint_interval | 1 | int | Save the model parameters to disk every this many epochs. |
 | --checkpoint_path | "./out/lora_instruction.pt" | string | The path to the checkpoint file on disk. |
 | --seed | None | int | The seed for the random number generator. |
 ## References:
+>- G. Penedo, et al. The FineWeb Datasts: Decanting the Web for the Finest Text Data at Scale, 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks.
 >- A. Radford, et al. Language Models are Unsupervised Multitask Learners, OpenAI, 2019.
 >- T. Brown, et al. Language Models are Few-Shot Learners. OpenAI, 2020.
 >- A. Kazemnejad, et al. The Impact of Positional Encoding on Length Generalization in Transformers, 37th Conference on Neural Information Processing Systems (NeurIPS 2023).

beam_search.py CHANGED Viewed

@@ -37,12 +37,12 @@ def main():
         torch.manual_seed(args.seed)
         random.seed(args.seed)
-    tokenizer = tiktoken.get_encoding(Alpaca.ENCODING)
     checkpoint = torch.load(
         args.checkpoint_path, map_location=args.device, weights_only=True
     )
     model = GPT(**checkpoint["model_args"])
     model = torch.compile(model)
@@ -74,7 +74,14 @@ def main():
         prompt = input("Enter a prompt: ")
         if args.lora_path:
-            prompt = Alpaca.PROMPT_TEMPLATE.format(instruction=prompt)
         prompt = tokenizer.encode_ordinary(prompt)

         torch.manual_seed(args.seed)
         random.seed(args.seed)
     checkpoint = torch.load(
         args.checkpoint_path, map_location=args.device, weights_only=True
     )
+    tokenizer = tiktoken.get_encoding(checkpoint["token_encoding"])
     model = GPT(**checkpoint["model_args"])
     model = torch.compile(model)
         prompt = input("Enter a prompt: ")
         if args.lora_path:
+            context = input("Additional context (leave blank for none): ")
+            if len(context) > 0:
+                prompt = Alpaca.PROMPT_TEMPLATE_WITH_INPUT.format(
+                    input=context, instruction=prompt
+                )
+            else:
+                prompt = Alpaca.PROMPT_TEMPLATE.format(instruction=prompt)
         prompt = tokenizer.encode_ordinary(prompt)

data.py CHANGED Viewed

@@ -5,7 +5,7 @@ from copy import deepcopy
 from datasets import load_dataset
-import tiktoken
 import numpy as np
@@ -28,12 +28,12 @@ class Fineweb(IterableDataset):
     def __init__(
         self,
         root_path: str = "./dataset",
         subset: str | None = "sample-10BT",
         split: str = "train",
         tokens_per_sample: int = 1024,
         samples_per_epoch: int = 4096,
-        token_encoding: str = "r50k_base",
         num_processes: int = 8,
     ):
         super().__init__()
@@ -51,15 +51,12 @@ class Fineweb(IterableDataset):
         if samples_per_epoch < 1:
             raise ValueError(f"Samples per epoch must be greater than 0.")
-        if token_encoding not in ("r50k_base", "cl100k_base", "o200k_base"):
-            raise ValueError(f"Invalid token encoding, {token_encoding} given.")
-        self.tokenizer = tiktoken.get_encoding(token_encoding)
         dataset_name = f"fineweb-{subset}" if subset != None else "fineweb"
-        train_path = path.join(root_path, f"{dataset_name}-train-{token_encoding}.bin")
-        test_path = path.join(root_path, f"{dataset_name}-test-{token_encoding}.bin")
         if not path.exists(train_path) or not path.exists(test_path):
             dataset = load_dataset(
@@ -70,7 +67,7 @@ class Fineweb(IterableDataset):
             ).map(
                 self.tokenize,
                 desc="Tokenizing",
-                remove_columns=["text"],
                 num_proc=num_processes,
             )
@@ -172,9 +169,9 @@ class Alpaca(Dataset):
     def __init__(
         self,
         max_tokens_per_sample: int = 1024,
-        token_encoding: str = "r50k_base",
-        mask_input: bool = True,
     ):
         super().__init__()
@@ -183,10 +180,7 @@ class Alpaca(Dataset):
                 f"Max tokens per sample must be greater than 0, {max_tokens_per_sample} given."
             )
-        if token_encoding not in ("r50k_base", "cl100k_base", "o200k_base"):
-            raise ValueError(f"Invalid token encoding, {token_encoding} given.")
-        self.tokenizer = tiktoken.get_encoding(token_encoding)
         self.dataset = load_dataset(self.DATASET_NAME, split="train")

 from datasets import load_dataset
+from tiktoken import Encoding
 import numpy as np
     def __init__(
         self,
+        tokenizer: Encoding,
         root_path: str = "./dataset",
         subset: str | None = "sample-10BT",
         split: str = "train",
         tokens_per_sample: int = 1024,
         samples_per_epoch: int = 4096,
         num_processes: int = 8,
     ):
         super().__init__()
         if samples_per_epoch < 1:
             raise ValueError(f"Samples per epoch must be greater than 0.")
         dataset_name = f"fineweb-{subset}" if subset != None else "fineweb"
+        train_path = path.join(root_path, f"{dataset_name}-train-{tokenizer.name}.bin")
+        test_path = path.join(root_path, f"{dataset_name}-test-{tokenizer.name}.bin")
+        self.tokenizer = tokenizer
         if not path.exists(train_path) or not path.exists(test_path):
             dataset = load_dataset(
             ).map(
                 self.tokenize,
                 desc="Tokenizing",
+                remove_columns=["text", "token_count"],
                 num_proc=num_processes,
             )
     def __init__(
         self,
+        tokenizer: Encoding,
         max_tokens_per_sample: int = 1024,
+        mask_input: bool = False,
     ):
         super().__init__()
                 f"Max tokens per sample must be greater than 0, {max_tokens_per_sample} given."
             )
+        self.tokenizer = tokenizer
         self.dataset = load_dataset(self.DATASET_NAME, split="train")

generate.py CHANGED Viewed

@@ -38,12 +38,12 @@ def main():
         torch.manual_seed(args.seed)
         random.seed(args.seed)
-    tokenizer = tiktoken.get_encoding(Alpaca.ENCODING)
     checkpoint = torch.load(
         args.checkpoint_path, map_location=args.device, weights_only=True
     )
     model = GPT(**checkpoint["model_args"])
     model = torch.compile(model)
@@ -75,7 +75,14 @@ def main():
         prompt = input("Enter a prompt: ")
         if args.lora_path:
-            prompt = Alpaca.PROMPT_TEMPLATE.format(instruction=prompt)
         prompt = tokenizer.encode_ordinary(prompt)

         torch.manual_seed(args.seed)
         random.seed(args.seed)
     checkpoint = torch.load(
         args.checkpoint_path, map_location=args.device, weights_only=True
     )
+    tokenizer = tiktoken.get_encoding(checkpoint["token_encoding"])
     model = GPT(**checkpoint["model_args"])
     model = torch.compile(model)
         prompt = input("Enter a prompt: ")
         if args.lora_path:
+            context = input("Additional context (leave blank for none): ")
+            if len(context) > 0:
+                prompt = Alpaca.PROMPT_TEMPLATE_WITH_INPUT.format(
+                    input=context, instruction=prompt
+                )
+            else:
+                prompt = Alpaca.PROMPT_TEMPLATE.format(instruction=prompt)
         prompt = tokenizer.encode_ordinary(prompt)

instruction-tune.py CHANGED Viewed

@@ -21,17 +21,20 @@ from tqdm import tqdm
 def main():
-    parser = ArgumentParser(description="Instruction-tune the foundation model.")
     parser.add_argument("--base_model_path", default="./out/checkpoint.pt", type=str)
     parser.add_argument("--batch_size", default=1, type=int)
     parser.add_argument("--gradient_accumulation_steps", default=128, type=int)
     parser.add_argument("--learning_rate", default=5e-4, type=float)
-    parser.add_argument("--mask_input", default=True, type=bool)
     parser.add_argument("--rank", default=8, type=int)
     parser.add_argument("--alpha", default=1.0, type=float)
     parser.add_argument("--dropout", default=0.05, type=float)
-    parser.add_argument("--num_epochs", default=4, type=int)
     parser.add_argument("--eval_interval", default=1, type=int)
     parser.add_argument("--checkpoint_interval", default=1, type=int)
     parser.add_argument(
@@ -66,7 +69,13 @@ def main():
     model_args = checkpoint["model_args"]
-    dataset = Alpaca(model_args["block_size"], args.mask_input)
     training, testing = random_split(dataset, (0.9, 0.1))
@@ -85,7 +94,7 @@ def main():
         shuffle=False,
     )
-    model = GPT(**model_args)
     model = torch.compile(model)
@@ -104,11 +113,12 @@ def main():
     print("Compiling model")
     model.compile()
-    print(f"Model has {model.num_trainable_params:,} trainable parameters")
-    optimizer = Adafactor(model.parameters(), lr=args.learning_rate)
-    perplexity_metric = Perplexity(ignore_index=dataset.PADDING_INDEX).to(args.device)
     starting_epoch = 1
@@ -125,6 +135,10 @@ def main():
     model.train()
     print("Instruction-tuning ...")
     for epoch in range(starting_epoch, args.num_epochs + 1):

 def main():
+    parser = ArgumentParser(description="Instruction-tune the GPT.")
     parser.add_argument("--base_model_path", default="./out/checkpoint.pt", type=str)
     parser.add_argument("--batch_size", default=1, type=int)
     parser.add_argument("--gradient_accumulation_steps", default=128, type=int)
     parser.add_argument("--learning_rate", default=5e-4, type=float)
+    parser.add_argument("--rms_decay", default=-0.8, type=float)
+    parser.add_argument("--optimizer_low_memory", default=True, type=bool)
+    parser.add_argument("--mask_input", default=False, type=bool)
+    parser.add_argument("--num_epochs", default=4, type=int)
     parser.add_argument("--rank", default=8, type=int)
     parser.add_argument("--alpha", default=1.0, type=float)
     parser.add_argument("--dropout", default=0.05, type=float)
+    parser.add_argument("--activation_checkpointing", action="store_true")
     parser.add_argument("--eval_interval", default=1, type=int)
     parser.add_argument("--checkpoint_interval", default=1, type=int)
     parser.add_argument(
     model_args = checkpoint["model_args"]
+    tokenizer = tiktoken.get_encoding(checkpoint["token_encoding"])
+    dataset = Alpaca(
+        tokenizer,
+        max_tokens_per_sample=model_args["block_size"],
+        mask_input=args.mask_input,
+    )
     training, testing = random_split(dataset, (0.9, 0.1))
         shuffle=False,
     )
+    model = GPT(**model_args, activation_checkpointing=args.activation_checkpointing)
     model = torch.compile(model)
     print("Compiling model")
     model.compile()
+    optimizer = Adafactor(
+        model.parameters(),
+        lr=args.learning_rate,
+        beta2_decay=args.rms_decay,
+        foreach=not args.optimizer_low_memory,
+    )
     starting_epoch = 1
     model.train()
+    print(f"Model has {model.num_trainable_params:,} trainable parameters")
+    perplexity_metric = Perplexity(ignore_index=dataset.PADDING_INDEX).to(args.device)
     print("Instruction-tuning ...")
     for epoch in range(starting_epoch, args.num_epochs + 1):

pre-train.py CHANGED Viewed

@@ -19,8 +19,10 @@ from torch.distributed.fsdp import FullyShardedDataParallel, ShardingStrategy
 from torchmetrics.text import Perplexity
-from model import GPT
 from data import Fineweb
 from tqdm import tqdm
@@ -41,7 +43,7 @@ def main():
     parser.add_argument(
         "--dataset_subset",
         default="sample-10BT",
-        choices=("sample-10BT", "sample-100BT", "sample-350BT", None),
     )
     parser.add_argument(
         "--token_encoding",
@@ -54,6 +56,8 @@ def main():
     parser.add_argument("--gradient_accumulation_steps", default=128, type=int)
     parser.add_argument("--samples_per_epoch", default=4096, type=int)
     parser.add_argument("--learning_rate", default=1e-2, type=float)
     parser.add_argument("--max_gradient_norm", default=1.0, type=float)
     parser.add_argument("--dropout", default=0.1, type=float)
     parser.add_argument("--num_epochs", default=2384, type=int)
@@ -149,22 +153,24 @@ def main():
         torch.manual_seed(args.seed)
         random.seed(args.seed)
     training = Fineweb(
         root_path=args.dataset_path,
         subset=args.dataset_subset,
         split="train",
         tokens_per_sample=args.block_size,
         samples_per_epoch=args.samples_per_epoch,
-        token_encoding=args.token_encoding,
         num_processes=args.num_dataset_processes,
     )
     testing = Fineweb(
         root_path=args.dataset_path,
         subset=args.dataset_subset,
         split="test",
         tokens_per_sample=args.block_size,
         samples_per_epoch=args.samples_per_epoch,
-        token_encoding=args.token_encoding,
         num_processes=args.num_dataset_processes,
     )
@@ -209,7 +215,12 @@ def main():
     model = model.to(args.device)
-    optimizer = Adafactor(model.parameters(), lr=args.learning_rate)
     starting_epoch = 1
@@ -309,6 +320,7 @@ def main():
                 "model_args": model_args,
                 "model": model.state_dict(),
                 "optimizer": optimizer.state_dict(),
             }
             torch.save(checkpoint, args.checkpoint_path)

 from torchmetrics.text import Perplexity
+import tiktoken
 from data import Fineweb
+from model import GPT
 from tqdm import tqdm
     parser.add_argument(
         "--dataset_subset",
         default="sample-10BT",
+        choices=(None, "sample-10BT", "sample-100BT", "sample-350BT"),
     )
     parser.add_argument(
         "--token_encoding",
     parser.add_argument("--gradient_accumulation_steps", default=128, type=int)
     parser.add_argument("--samples_per_epoch", default=4096, type=int)
     parser.add_argument("--learning_rate", default=1e-2, type=float)
+    parser.add_argument("--rms_decay", default=-0.8, type=float)
+    parser.add_argument("--optimizer_low_memory", default=True, type=bool)
     parser.add_argument("--max_gradient_norm", default=1.0, type=float)
     parser.add_argument("--dropout", default=0.1, type=float)
     parser.add_argument("--num_epochs", default=2384, type=int)
         torch.manual_seed(args.seed)
         random.seed(args.seed)
+    tokenizer = tiktoken.get_encoding(args.token_encoding)
     training = Fineweb(
+        tokenizer,
         root_path=args.dataset_path,
         subset=args.dataset_subset,
         split="train",
         tokens_per_sample=args.block_size,
         samples_per_epoch=args.samples_per_epoch,
         num_processes=args.num_dataset_processes,
     )
     testing = Fineweb(
+        tokenizer,
         root_path=args.dataset_path,
         subset=args.dataset_subset,
         split="test",
         tokens_per_sample=args.block_size,
         samples_per_epoch=args.samples_per_epoch,
         num_processes=args.num_dataset_processes,
     )
     model = model.to(args.device)
+    optimizer = Adafactor(
+        model.parameters(),
+        lr=args.learning_rate,
+        beta2_decay=args.rms_decay,
+        foreach=not args.optimizer_low_memory,
+    )
     starting_epoch = 1
                 "model_args": model_args,
                 "model": model.state_dict(),
                 "optimizer": optimizer.state_dict(),
+                "token_encoding": args.token_encoding,
             }
             torch.save(checkpoint, args.checkpoint_path)