Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
datasets:
|
| 3 |
+
- HuggingFaceFW/fineweb-edu
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
pipeline_tag: text-generation
|
| 7 |
+
tags:
|
| 8 |
+
- small
|
| 9 |
+
- cpu
|
| 10 |
+
- open
|
| 11 |
+
- open-source
|
| 12 |
+
- crest
|
| 13 |
+
- lh-tech
|
| 14 |
+
- ai
|
| 15 |
+
- llm
|
| 16 |
+
- nanoGPT
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# Welcome to Crest 20M Base
|
| 20 |
+
This is a tiny 20.75M parameter model showing how small models can perform on a little bunch of data.
|
| 21 |
+
|
| 22 |
+
## Training data
|
| 23 |
+
We used the first 100 million tokens of the 10BT Sample of Fineweb-Edu to train this model for 5000 steps to a final val loss of 4.1566.
|
| 24 |
+
|
| 25 |
+
## Training specs
|
| 26 |
+
- Architecture: nanoGPT
|
| 27 |
+
- Parameters: 20.75M
|
| 28 |
+
- Train Steps: 5000 (5k)
|
| 29 |
+
- Learning Rate: 5e-4
|
| 30 |
+
- Layers: 10
|
| 31 |
+
- Heads: 8
|
| 32 |
+
- Embed Layers: 256
|
| 33 |
+
- Block Size (Context lenght): 512
|
| 34 |
+
- Batch Size: 32
|
| 35 |
+
- Gradient Accumulation Steps: 4
|
| 36 |
+
- Compile model: False
|
| 37 |
+
- Device Type: float16 - CUDA on Kaggle T4 16GB GPU
|
| 38 |
+
|
| 39 |
+
## Training code
|
| 40 |
+
As in all of our models, you can find the full training code in this repo in the files `train.py`, `model.py`, `configurator.py` and `prepare.py`.
|
| 41 |
+
|
| 42 |
+
## Model weights
|
| 43 |
+
The final model weights can be found as `model.pt` in this repo. Use `use.py` to try out the model :D
|
| 44 |
+
|
| 45 |
+
## Example outputs
|
| 46 |
+
**Prompt:** Artificial Intelligence is
|
| 47 |
+
<br>**Output:**
|
| 48 |
+
```plaintext
|
| 49 |
+
Artificial Intelligence is the ability to make intelligent decisions.
|
| 50 |
+
It is a process of understanding how to do things. It is designed to understand the principles of intelligence and the skills to be successful.
|
| 51 |
+
There are various types of intelligence and the ability to communicate information about the process. They can use more than one or more of these functions.
|
| 52 |
+
What is the reason for being successful is that they are successful in one or more of those of the tasks. They must be able to use the knowledge to understand and understand information about the process.
|
| 53 |
+
What is the best way to understand how to communicate information.
|
| 54 |
+
The simplest way to understand the concept of intelligence is to understand how to communicate information about the process of communication.
|
| 55 |
+
In addition to being successful in the process of
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
**Prompt:** The main concept of physics is
|
| 59 |
+
<br>**Output:**
|
| 60 |
+
```plaintext
|
| 61 |
+
The main concept of physics is the energy of the universe, the natural world, and the space in which the universe are determined.
|
| 62 |
+
When we are in a universe, there are no other elements to go with, or a sphere or sphere or sphere. The universe of the universe is determined by the universe, which they are based on the laws of nature and the universe.
|
| 63 |
+
Since we are in the universe, the universe is not just the universe, but the universe is not just the universe. The universe is determined by the universe in the universe by the universe. In the universe, the universe is determined by the universe.
|
| 64 |
+
For the universe, the universe is determined by the universe, because the universe is determined by the universe. The universe is determined by the universe to
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
**Prompt:** Albert Einstein was
|
| 68 |
+
<br>**Output:**
|
| 69 |
+
```plaintext
|
| 70 |
+
Albert Einstein was the first to study the evolution of the universe. The universe of stars in the universe is the same as the universe of stars, which is the same as the universe of stars, which is the one and the other. Astronomers are the smallest universe of stars, which are very different from other stars.
|
| 71 |
+
According to Einstein, this means that the universe of stars is the same with the same star, which are the same as the universe of stars. These galaxies are called stars. But if we see the universe of stars, we see the stars of stars, which are the same that are the same. As we see the universe of stars in the universe of stars in the universe of stars in the universe of stars in the universe of stars.
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
## Quick Start
|
| 75 |
+
*Please install tiktoken first (`pip install tiktoken`)!*
|
| 76 |
+
|
| 77 |
+
If you want to train the model yourself, boot up a fresh T4 (or any other GPU with at least 16GB of VRAM; if you have less VRAM, decrease the Batch Size and increase the Gradient Accumulation Steps.) and start by downloading the needed files from this repository:
|
| 78 |
+
```bash
|
| 79 |
+
mkdir crest_base_20m
|
| 80 |
+
cd crest_base_20m
|
| 81 |
+
wget https://huggingface.co/LH-Tech-AI/Crest-20M-Base/resolve/main/prepare.py
|
| 82 |
+
wget https://huggingface.co/LH-Tech-AI/Crest-20M-Base/resolve/main/model.py
|
| 83 |
+
wget https://huggingface.co/LH-Tech-AI/Crest-20M-Base/resolve/main/train.py
|
| 84 |
+
wget https://huggingface.co/LH-Tech-AI/Crest-20M-Base/resolve/main/configurator.py
|
| 85 |
+
```
|
| 86 |
+
The next step is to prepare the data, so run:
|
| 87 |
+
```bash
|
| 88 |
+
python3 prepare.py
|
| 89 |
+
```
|
| 90 |
+
If all data has loaded, you can start the training:
|
| 91 |
+
```bash
|
| 92 |
+
python3 train.py
|
| 93 |
+
```
|
| 94 |
+
Then, you'll have to wait until iteration 5000 is reached (will log something like `iter 5000: loss 4.2044, time 50601.67ms, mfu 2.23%`).
|
| 95 |
+
|
| 96 |
+
## Use the final model
|
| 97 |
+
To use your trained model - or ours that you can find in this repo as model.pt - you can run:
|
| 98 |
+
```python
|
| 99 |
+
import torch
|
| 100 |
+
import tiktoken
|
| 101 |
+
import os
|
| 102 |
+
from model import GPTConfig, GPT
|
| 103 |
+
|
| 104 |
+
out_dir = 'out'
|
| 105 |
+
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
| 106 |
+
ckpt_path = os.path.join(out_dir, 'ckpt.pt')
|
| 107 |
+
checkpoint = torch.load(ckpt_path, map_location=device)
|
| 108 |
+
gptconf = GPTConfig(**checkpoint['model_args'])
|
| 109 |
+
model = GPT(gptconf)
|
| 110 |
+
state_dict = checkpoint['model']
|
| 111 |
+
unwanted_prefix = '_orig_mod.'
|
| 112 |
+
for k,v in list(state_dict.items()):
|
| 113 |
+
if k.startswith(unwanted_prefix):
|
| 114 |
+
state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
|
| 115 |
+
model.load_state_dict(state_dict)
|
| 116 |
+
model.to(device)
|
| 117 |
+
model.eval()
|
| 118 |
+
|
| 119 |
+
enc = tiktoken.get_encoding("gpt2")
|
| 120 |
+
EOS_TOKEN_ID = 50256
|
| 121 |
+
|
| 122 |
+
def ask_gpt(prompt, max_new_tokens=150, temperature=0.7, top_k=25):
|
| 123 |
+
start_ids = enc.encode(prompt)
|
| 124 |
+
x = torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...]
|
| 125 |
+
|
| 126 |
+
with torch.no_grad():
|
| 127 |
+
y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
|
| 128 |
+
|
| 129 |
+
full_ids = y[0].tolist()
|
| 130 |
+
new_ids = full_ids[len(start_ids):]
|
| 131 |
+
|
| 132 |
+
response = enc.decode(new_ids)
|
| 133 |
+
response = response.split('<|endoftext|>')[0]
|
| 134 |
+
return response
|
| 135 |
+
|
| 136 |
+
print("--- Crest Completion Chat started ---")
|
| 137 |
+
while True:
|
| 138 |
+
user_input = input("\nYour Prompt: ")
|
| 139 |
+
if user_input.lower() in ['exit', 'quit']: break
|
| 140 |
+
|
| 141 |
+
antwort_rest = ask_gpt(user_input)
|
| 142 |
+
|
| 143 |
+
print(f"\nCrest Completion: {user_input}{antwort_rest}")
|
| 144 |
+
print("-" * 30)
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
This will produce something like (You: "The climate change is"):
|
| 148 |
+
```plaintext
|
| 149 |
+
Crest Completion: The climate change is about as much as the global warming is changing. The climate is the result of the climate change.
|
| 150 |
+
In the world that is the case with extreme weather conditions and climate change, it makes the world more productive. And it makes the world more productive, like the planet’s climate change.
|
| 151 |
+
It’s also why we are interested in climate change, we are interested in climate change, like climate change and climate change. We are interested in climate change and climate change.
|
| 152 |
+
The climate change in the world is already underway. It is the next step. The world is going to grow in a world where we live in a world where we live in a global society.
|
| 153 |
+
While we are interested in climate change, we are interested
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
## Limitations
|
| 157 |
+
- This model can't chat - it's a base model!
|
| 158 |
+
- This model is really dumb. It just has learned 100 million tokens for ~3.28 epochs.
|
| 159 |
+
- This model is not GPT-5.4 or Opus-4.7! Definitely not. :D
|
| 160 |
+
|
| 161 |
+
## Final thoughts
|
| 162 |
+
We think, this model shows perfectly on how very small models can perform on general world knowledge data if we train them for multiple epochs.
|
| 163 |
+
We are kind of satisfied with these results and wonder what would happen, if we'd finetune this model with SFT to make it chat.
|