calum
/

tinystories-gpt2-3M

@@ -7,6 +7,8 @@ model-index:
 datasets:
 - roneneldan/TinyStories
 pipeline_tag: text-generation
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -18,22 +20,58 @@ This model is a tiny (3M trainable parameters) GPT-2 model pre-trained for 3 epo
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
-Trained for 400k steps (~7 hours) on 2xH100 80GB PCIe with 32vCPU and 500GB RAM on Runpod.
 ### Training hyperparameters
 The following hyperparameters were used during training:
 - learning_rate: 5e-05
 - train_batch_size: 16
 - eval_batch_size: 16
@@ -42,9 +80,6 @@ The following hyperparameters were used during training:
 - lr_scheduler_type: linear
 - num_epochs: 3.0
-### Training results
 ### Framework versions

 datasets:
 - roneneldan/TinyStories
 pipeline_tag: text-generation
+language:
+- en
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 ## Model description
+TinyStories-GPT2-3M is a replication of the TinyStories model, using a GPT-2 architecture in place of GPT-Neo. This was a
+deliberate choice made to accelerate research, as the GPT-2 architecture is more widely supported across tooling. We do not
+contribute any performance improvements of note, though similarly to the original model, we find a surprising degree of coherency
+within the model, given its size.
 ## Intended uses & limitations
+Research use only - NOT suitable for commercial use per OpenAI TOS on using their APIs to source training data.
+Note that the vocabulary this model was trained on is quite minimal. Out of distribution inputs will not work as well as
+a larger, more general purpose model. To observe this behaviour, try generating a few tokens after a non-trivial word like
+"Biology". The model typically treats words that did not frequently appear in training as character names in a story.
+All training data is English. As such, input with other languages is out of distribution, and will result in the model treating
+previous input as character names, ignoring it entirely, or generating meaningless tokens.
 ## Training and evaluation data
+Trained for 3 epochs on the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) V2 dataset, produced by GPT-4.
 ## Training procedure
+Trained for 400k steps (~7 hours) on 2xH100 80GB PCIe with 32vCPU and 500GB RAM on Runpod.
+To replicate, download GPT-4 V2 version of the TinyStories dataset alongside HuggingFace's `train_clm.py` script. Then run the following:
+```bash
+#! /bin/bash
+python train_clm.py \
+    --model_type=gpt2 \
+    --config_overrides=n_embd=64,n_layer=8,n_head=16 \
+    --tokenizer_name=gpt2 \
+    --train_file="data/TinyStoriesV2-GPT4-train.txt" \
+    --validation_file="data/TinyStoriesV2-GPT4-valid.txt" \
+    --block_size=256 \
+    --preprocessing_num_workers=8 \
+    --output_dir="out" \
+    --logging_dir="./log" \
+    --logging_steps=100 \
+    --logging_strategy=steps \
+    --save_steps=5000 \
+    --save_total_limit=10 \
+    --do_train
+```
 ### Training hyperparameters
 The following hyperparameters were used during training:
+- n_embd: 64
+- n_layer: 8
+- n_head: 16
 - learning_rate: 5e-05
 - train_batch_size: 16
 - eval_batch_size: 16
 - lr_scheduler_type: linear
 - num_epochs: 3.0
 ### Framework versions