MartialTerran
commited on
Commit
•
72ab105
1
Parent(s):
2c83705
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,98 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
I am not sure if anyone has ever demonstrated a 8-float embeddings, Two four-float attention heads, (approximately 9,526 parameters) one-megabyte GPT Model that produced coherent text (The Gettysburg Address) with correct punctuation, in response to a one-word prompt.
|
2 |
+
This result is reproducible. I have accomplished this in practially all training runs with 'n_embd': 8, 'n_layer': 1, 'n_head': 2, 'n_inner': 128, (with some undisclosed modifications of the model.py)
|
3 |
+
|
4 |
+
This last training run was repeated to generate a pytorch parameter checkpoint on disk, for parameter-measurement purposes.
|
5 |
+
model_checkpoint_epoch_30000_Nano_Gettysburg_GPT2_v1.5_loss_DisplayPerBatch.py_2024-11-26_00-48-36.pth
|
6 |
+
Size_on_disk: 1.08 MB (1,138,688 bytes)
|
7 |
+
Google Gemini Estimates that this nano LLM has about Parameters.
|
8 |
+
|
9 |
+
Essential Hyperparameters (plus undisclosed modifications):
|
10 |
+
'n_embd': 8, 'n_layer': 1, 'n_head': 2, 'n_inner': 128,
|
11 |
+
########################## OUTPUT PRINTED TO CMD CONSOLE IN WINDOWS 10 LAPTOP ##############################
|
12 |
+
Epoch 29901/30000, Loss: 0.5203 Current learning rate: 0.000245
|
13 |
+
...
|
14 |
+
[Last] Batch Losses: 0.4611
|
15 |
+
HyperParamters = {'vocab_size': [withheld], 'special_tokens': ['<[withheld]'], 'n_embd': 8, 'n_layer': 1, 'n_head': 2, 'n_inner': 128, 'max_sequence_len': [withehld], 'epochs': 30000, 'learning_rate': 0.001, 'batch_size': [withheld], 'dropout': 0.2}
|
16 |
+
|
17 |
+
# --- Inference Examples --- at script line 507
|
18 |
+
# Example 1: Recite the Gettysburg Address at script line 511
|
19 |
+
Prompt: Four
|
20 |
+
Total number of words and punctuation tokenized: 1
|
21 |
+
|
22 |
+
Response:
|
23 |
+
four score and seven years ago our fathers brought forth , on this continent , a new nation , conceived in liberty , and dedicated to the proposition that all men are created equal . now we are engaged in a great civil war , testing whether that nation , or any nation so dedicated , or any nation so dedicated , can long endure . we have come to dedicate a new . we are met on a great battle - field of that sense , and so to dedicate a portion of that field , as a final resting - place for us that cause for us that those who here gave their lives , in final task remaining before us - that sense . that sense . it is altogether will us - that sense . but that sense . it is altogether for us - that that sense . it can never forget what they gave the world hatter bewildered badger baked blueberry strawberry biscuits popcorn fought of devotion - that that sense . it died that - that sense . the from that sense . the us - that sense . but that that - that that from . the from that cause for us to that - that that - that nation , for us to that that that that that that that - that - that that cause for us to that nation , and that - that - that that we of devotion - that that that that that that - that this ground . the us to that that that that this nation , to that that nation , to that that that that this war of the us to that ground . their lives , under god , under god , and that that nation , for - that that that that that we gave their lives , by the us the us to that this ground . apple score , by that this nation , under god , under god
|
24 |
+
|
25 |
+
1562 Words in the Response
|
26 |
+
|
27 |
+
#####################################################
|
28 |
+
|
29 |
+
Google Gemini estimates the parameters within this model as follows
|
30 |
+
|
31 |
+
Let's break down the parameter count for your Nano GPT-2 model based on its hyperparameters:
|
32 |
+
|
33 |
+
1. Token Embeddings:
|
34 |
+
|
35 |
+
[vocab size withheld]
|
36 |
+
|
37 |
+
Embedding Dimension (n_embd): 8
|
38 |
+
|
39 |
+
Parameters: 174 * 8 = 1392
|
40 |
+
|
41 |
+
2. Positional Embeddings:
|
42 |
+
|
43 |
+
Max Sequence Length: [withheld]
|
44 |
+
|
45 |
+
Embedding Dimension (n_embd): 8
|
46 |
+
|
47 |
+
Parameters: 512 * 8 = 4096
|
48 |
+
|
49 |
+
3. Transformer Layers (n_layer = 1):
|
50 |
+
|
51 |
+
For each layer, we have the following components:
|
52 |
+
|
53 |
+
a) Self-Attention:
|
54 |
+
|
55 |
+
Query (Q), Key (K), Value (V) Matrices: Each has dimensions (n_embd, n_embd / n_head) = (8, 8/2) = (8,4)
|
56 |
+
so each Q, K and V matrix has (8x4 =32 parameters) and we have n_head * 3 matrices per layer
|
57 |
+
so we have 2 * 3 = 6 matrices of size (8x4) or 6 * 32 = 192 parameters
|
58 |
+
|
59 |
+
Output Projection: (n_embd, n_embd) = (8, 8) = 64 parameters
|
60 |
+
|
61 |
+
Total Self-Attention Parameters per layer: 192 + 64 = 256 parameters
|
62 |
+
|
63 |
+
b) Feed-Forward Network:
|
64 |
+
|
65 |
+
Layer 1: (n_embd, n_inner) = (8, 128) = 1024 parameters
|
66 |
+
|
67 |
+
Layer 2: (n_inner, n_embd) = (128, 8) = 1024 parameters
|
68 |
+
|
69 |
+
Bias vectors = n_inner + n_embd = 128 + 8 = 136
|
70 |
+
Total Feed-Forward Parameters per layer: 1024 + 1024+ 136 = 2184 parameters
|
71 |
+
|
72 |
+
c) Layer Normalization:
|
73 |
+
|
74 |
+
Two per layer - one in each section and we assume they have affine=True
|
75 |
+
|
76 |
+
2 * n_embd (scale) + 2 * n_embd (bias) = 4 * n_embd = 4 * 8 = 32 parameters
|
77 |
+
Total Layer Norm Parameters per layer: 32 parameters
|
78 |
+
|
79 |
+
Total Parameters per Transformer Layer: 256 + 2184 + 32 = 2472 parameters
|
80 |
+
|
81 |
+
4. Output Layer (Language Model Head):
|
82 |
+
|
83 |
+
Final Linear Layer: (n_embd, vocab_size) = (8, [withheld]) = [withehld] parameters
|
84 |
+
|
85 |
+
Bias vector: vocab_size = 174 parameters
|
86 |
+
Total Output Layer Parameters: []= 1566 parameters
|
87 |
+
|
88 |
+
5. Total Parameters in the model:
|
89 |
+
|
90 |
+
Token Embeddings + Positional Embeddings + (n_layer * Layer Parameters) + Output Layer parameters
|
91 |
+
|
92 |
+
1392 + 4096 + [] = 9526 parameters
|
93 |
+
|
94 |
+
Therefore, your Nano GPT-2 model has approximately 9,526 parameters.
|
95 |
+
|
96 |
+
---
|
97 |
+
license: unknown
|
98 |
+
---
|