NotShrirang commited on
Commit
9fde7d2
·
verified ·
1 Parent(s): f9db966

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +212 -0
README.md ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: gpl-3.0
5
+ library_name: transformers
6
+ tags:
7
+ - text-generation
8
+ - gpt2
9
+ - causal-lm
10
+ - instruction-tuned
11
+ - sft
12
+ - rope
13
+ - grouped-query-attention
14
+ - rms-norm
15
+ - custom-architecture
16
+ - educational
17
+ - from-scratch
18
+ datasets:
19
+ - tatsu-lab/alpaca
20
+ - Skylion007/openwebtext
21
+ pipeline_tag: text-generation
22
+ model-index:
23
+ - name: TinyGPT2-IT
24
+ results: []
25
+ ---
26
+
27
+ <div align="center">
28
+
29
+ # TinyGPT2-IT
30
+
31
+ ### A 95M parameter instruction-tuned language model trained from scratch on a single consumer GPU
32
+
33
+ [![GitHub](https://img.shields.io/badge/GitHub-NotShrirang%2Ftinygpt-blue?logo=github)](https://github.com/NotShrirang/tinygpt)
34
+ [![Demo](https://img.shields.io/badge/Demo-Streamlit-FF4B4B?logo=streamlit)](https://tinygpt.streamlit.app/)
35
+ [![License](https://img.shields.io/badge/License-GPL--3.0-green)](https://www.gnu.org/licenses/gpl-3.0.en.html)
36
+
37
+ </div>
38
+
39
+ ---
40
+
41
+ ## Overview
42
+
43
+ **TinyGPT2-IT** is an instruction-tuned variant of [TinyGPT2](https://github.com/NotShrirang/tinygpt) — a modern GPT architecture built from scratch using PyTorch. The base model was pretrained on ~6.7B tokens from OpenWebText, then supervised fine-tuned (SFT) on Stanford Alpaca's 52K instruction-response pairs.
44
+
45
+ The entire pipeline — pretraining, fine-tuning, and inference — runs on a **single NVIDIA RTX 3070 Ti (8 GB VRAM)**.
46
+
47
+ > This model uses a custom architecture and requires `trust_remote_code=True`.
48
+
49
+ ---
50
+
51
+ ## Architecture
52
+
53
+ | Component | Detail |
54
+ |---|---|
55
+ | **Parameters** | ~95M |
56
+ | **Layers** | 12 transformer blocks |
57
+ | **Attention** | Grouped Query Attention (12 query heads, 4 KV groups) |
58
+ | **Embedding dim** | 768 |
59
+ | **FFN hidden dim** | 2048 |
60
+ | **Position encoding** | Rotary Position Embeddings (RoPE) |
61
+ | **Normalization** | RMSNorm |
62
+ | **Context window** | 512 tokens |
63
+ | **Vocabulary** | 50,304 (GPT-2 tiktoken + PAD token) |
64
+ | **Weight tying** | Token embedding ↔ LM head |
65
+ | **KV Cache** | Supported for efficient generation |
66
+
67
+ ---
68
+
69
+ ## Training
70
+
71
+ ### Stage 1 — Pretraining
72
+
73
+ | | |
74
+ |---|---|
75
+ | **Dataset** | OpenWebText (~6.7B tokens) |
76
+ | **Optimizer** | AdamW (fused) |
77
+ | **Effective batch** | 262K tokens/step |
78
+ | **Precision** | bfloat16 + `torch.compile` |
79
+ | **Hardware** | NVIDIA RTX 3070 Ti (8 GB) |
80
+
81
+ ### Stage 2 — Supervised Fine-Tuning (SFT)
82
+
83
+ | | |
84
+ |---|---|
85
+ | **Dataset** | Stanford Alpaca (52K instructions) |
86
+ | **Epochs** | 3 |
87
+ | **Loss masking** | Response-only (instruction tokens are masked) |
88
+ | **Final train loss** | 1.91 |
89
+ | **Final val loss** | 1.98 |
90
+ | **Final val perplexity** | 7.26 |
91
+ | **Tokens processed** | ~72M |
92
+ | **Prompt format** | `### Instruction: ... ### Response: ...` |
93
+
94
+ ---
95
+
96
+ ## Usage
97
+
98
+ ### Quick Start
99
+
100
+ ```python
101
+ from transformers import AutoModelForCausalLM
102
+ import tiktoken
103
+ import torch
104
+
105
+ # Load model
106
+ model = AutoModelForCausalLM.from_pretrained(
107
+ "NotShrirang/tinygpt2-it",
108
+ trust_remote_code=True,
109
+ )
110
+ model.eval()
111
+
112
+ # Tokenize
113
+ enc = tiktoken.get_encoding("gpt2")
114
+ prompt = "### Instruction:\nWhat is the capital of France?\n\n### Response:\n"
115
+ input_ids = torch.tensor([enc.encode(prompt)])
116
+
117
+ # Generate
118
+ with torch.no_grad():
119
+ output = model.generate(input_ids, max_new_tokens=128, do_sample=True, temperature=0.7, top_k=40)
120
+
121
+ print(enc.decode(output[0].tolist()))
122
+ ```
123
+
124
+ ### Prompt Format
125
+
126
+ This model expects instructions in the following template:
127
+
128
+ ```
129
+ ### Instruction:
130
+ {your instruction here}
131
+
132
+ ### Response:
133
+ ```
134
+
135
+ For instructions with additional context:
136
+
137
+ ```
138
+ ### Instruction:
139
+ {your instruction here}
140
+
141
+ ### Input:
142
+ {additional context}
143
+
144
+ ### Response:
145
+ ```
146
+
147
+ ---
148
+
149
+ ## Example Outputs
150
+
151
+ **Factual Q&A**
152
+ ```
153
+ >>> What is the capital of France?
154
+ The capital of France is Paris.
155
+ ```
156
+
157
+ **Explanation**
158
+ ```
159
+ >>> Explain what machine learning is in simple terms.
160
+ Machine learning is a branch of computer science that focuses on using algorithms to
161
+ identify patterns in data. These algorithms are used to analyze large amounts of data
162
+ and make predictions about future trends.
163
+ ```
164
+
165
+ **Creative**
166
+ ```
167
+ >>> Write a motivational quote.
168
+ "The only way to make a difference is to be bold and courageous."
169
+ ```
170
+
171
+ ---
172
+
173
+ ## Limitations
174
+
175
+ - **Small model** — 95M parameters is far below production LLMs; expect factual errors, repetition, and limited reasoning.
176
+ - **Short context** — 512 token window limits the length of conversations and documents.
177
+ - **Training data** — pretrained on web text and fine-tuned on synthetic Alpaca data, which may contain biases or inaccuracies.
178
+ - **Not safety-aligned** — no RLHF/DPO applied to this checkpoint; the model may produce harmful or inappropriate content.
179
+
180
+ ---
181
+
182
+ ## Model Family
183
+
184
+ | Model | Params | Description | Link |
185
+ |---|---|---|---|
186
+ | TinyGPT | 51M | Standard GPT, TinyStories | [GitHub](https://github.com/NotShrirang/tinygpt) |
187
+ | TinyGPT-MoE | 85M | Mixture of Experts, TinyStories | [GitHub](https://github.com/NotShrirang/tinygpt) |
188
+ | Wikipedia-MoE | 135M | 8-expert MoE, Wikipedia/C4 | [GitHub](https://github.com/NotShrirang/tinygpt) |
189
+ | TinyGPT2 | 95M | RoPE + GQA + RMSNorm, OpenWebText | [GitHub](https://github.com/NotShrirang/tinygpt) |
190
+ | TinyGPT2.1 | 183M | Scaled TinyGPT2, FineWeb-Edu | [GitHub](https://github.com/NotShrirang/tinygpt) |
191
+ | **TinyGPT2-IT** | **95M** | **Instruction-tuned (this model)** | **You are here** |
192
+ | TinyGPT2-DPO | 95M | DPO-aligned with Anthropic HH-RLHF | [GitHub](https://github.com/NotShrirang/tinygpt) |
193
+
194
+ ---
195
+
196
+ ## Citation
197
+
198
+ ```bibtex
199
+ @misc{tinygpt2-it,
200
+ author = {Shrirang Mahajan},
201
+ title = {TinyGPT2-IT: Instruction-Tuned 95M Parameter Language Model},
202
+ year = {2025},
203
+ publisher = {Hugging Face},
204
+ url = {https://huggingface.co/NotShrirang/tinygpt2-it}
205
+ }
206
+ ```
207
+
208
+ ---
209
+
210
+ ## License
211
+
212
+ This model is released under the [GPL-3.0 License](https://www.gnu.org/licenses/gpl-3.0.en.html).