File size: 23,014 Bytes
367a722
7585b58
 
 
367a722
7585b58
 
 
 
 
367a722
7585b58
 
 
 
 
7fb699e
 
7585b58
 
b73ff82
eac5a32
b73ff82
 
 
24825f9
7585b58
e8462c3
7585b58
 
 
 
 
 
 
 
 
add35f5
7585b58
24825f9
7585b58
 
 
 
 
eac5a32
7585b58
 
eac5a32
7585b58
 
eac5a32
7585b58
 
 
 
 
 
 
 
 
 
c731072
7585b58
 
 
cde16c9
 
 
eac5a32
cde16c9
 
eac5a32
cde16c9
 
eac5a32
cde16c9
7fb699e
 
cde16c9
eac5a32
cde16c9
7fb699e
eac5a32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7fb699e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cde16c9
7fb699e
cde16c9
7fb699e
 
 
cde16c9
 
7fb699e
 
 
 
cde16c9
7fb699e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cde16c9
 
 
 
 
 
 
 
 
 
 
 
 
 
7fb699e
cde16c9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eac5a32
cde16c9
 
 
 
 
 
 
 
 
7fb699e
 
 
cde16c9
 
7fb699e
 
 
 
 
cde16c9
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
---
base_model: Maykeye/TinyLLama-v0
language:
- en
license: apache-2.0
tags:
- llamafile
- model-conversion
- text-generation
- gguf
---

# TinyLLama-v0 - llamafile
- Model creator: [Maykeye](https://huggingface.co/Maykeye)
- Original model: [TinyLLama-v0](https://huggingface.co/Maykeye/TinyLLama-v0)

If interested in the internal content of this model you can check [Tinyllama-4.6M-v0.0-F16.dump.md](./Tinyllama-4.6M-v0.0-F16.dump.md) included in this repo.

## Description

* This repo is targeted towards:
  - People who just want to quickly try out the llamafile technology by running `./Tinyllama-5M-v0.2-F16.llamafile --cli -p "hello world"` as this llamafile is only 17.6Β MB in size!
  - Developers who would like a quick demo on the steps to convert an existing model from safe tensor format to a gguf and packaged into a llamafile for easy distribution (Just run `llamafile-creation.sh` to retrace the steps).
  - Researchers who are just curious about the state of AI technology in terms of shrinking AI models, as the original model was from a replication attempt of a research paper.

This repo contains [llamafile](https://github.com/Mozilla-Ocho/llamafile) format model files for [Maykeye/TinyLLama-v0](https://huggingface.co/Maykeye/TinyLLama-v0) that is a recreation of [roneneldan/TinyStories-1M](https://huggingface.co/roneneldan/TinyStories-1M) which was part of this very interesting research paper called [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](https://arxiv.org/abs/2305.07759) by Ronen Eldan and Yuanzhi Li.

In the paper this is their abstract

> Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention).

> In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.

> We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency.

> We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs. 

While Maykeye's replication effort didn't reduce the model down to 1M parameters, Maykeye did get down to 5M parameters which is still quite an achievement as far as known replication efforts have shown.

Anyway, this conversion to [llamafile](https://github.com/Mozilla-Ocho/llamafile) should give you an easy way to give this model a shot and also of the whole [llamafile](https://github.com/Mozilla-Ocho/llamafile) ecosystem in general (as it's quite quite small compared to other larger chat capable models). As this is primarily a text generation model, it will open a web server as part of the [llamafile](https://github.com/Mozilla-Ocho/llamafile) process, but it will not engage in chat as one might expect.. Instead you would give it a story prompt and it will generate a story for you. Don't expect any great stories for this size however, but it's an interesting demo on how small you can squeeze AI models and still have it generate recognisable english.

## Usage In Linux

```bash
# if not already usable 
chmod +x Tinyllama-5M-v0.2-F16.llamafile

# To start the llamafile in web sever mode just call this directly
./Tinyllama-5M-v0.2-F16.llamafile

# To start the llamafile in command line use this command
./Tinyllama-5M-v0.2-F16.llamafile --cli -p "A dog and a cat"
```

## About llamafile

llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023. It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp binaries that run on the stock installs of six OSes for both ARM64 and AMD64.

## Replication Steps Assumption

* You have already pulled in all the submodules including Maykeye's model in safe.tensor format
* Your git has LFS configured correctly or you get this issue https://github.com/ggerganov/llama.cpp/issues/1994 where `safe.tensor` doesn't download properly (and only a small pointer file is downloaded)
* Within llama.cpp repo we already merged a [PR](https://github.com/ggerganov/llama.cpp/pull/4858) for some changes to convert.py to support metadata override (to add some missing authorship information)

## Replication Steps

For the most current replication steps, refer to the bash script `llamafile-creation.sh` in this repo.

```
$ ./llamafile-creation.sh
== Prep Enviroment ==
== Build and prep the llamafile engine execuable ==
~/huggingface/TinyLLama-v0-5M-F16-llamafile/llamafile ~/huggingface/TinyLLama-v0-5M-F16-llamafile
make: Nothing to be done for 'all'.
make: Nothing to be done for 'all'.
~/huggingface/TinyLLama-v0-5M-F16-llamafile
== What is our llamafile name going to be? ==
maykeye_tinyllama/Tinyllama-4.6M-v0.0-F16.gguf
We will be aiming to generate Tinyllama-4.6M-v0.0-F16.llamafile
== Convert from safetensor to gguf ==
INFO:hf-to-gguf:Loading model: maykeye_tinyllama
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:output.weight,              torch.bfloat16 --> F16, shape = {64, 32000}
INFO:hf-to-gguf:token_embd.weight,          torch.bfloat16 --> F16, shape = {64, 32000}
INFO:hf-to-gguf:blk.0.attn_norm.weight,     torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.0.ffn_down.weight,      torch.bfloat16 --> F16, shape = {256, 64}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,      torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.0.ffn_up.weight,        torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.0.attn_k.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.0.attn_output.weight,   torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.0.attn_q.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.0.attn_v.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.1.attn_norm.weight,     torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.1.ffn_down.weight,      torch.bfloat16 --> F16, shape = {256, 64}
INFO:hf-to-gguf:blk.1.ffn_gate.weight,      torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.1.ffn_up.weight,        torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.1.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.1.attn_k.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.1.attn_output.weight,   torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.1.attn_q.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.1.attn_v.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.2.attn_norm.weight,     torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.2.ffn_down.weight,      torch.bfloat16 --> F16, shape = {256, 64}
INFO:hf-to-gguf:blk.2.ffn_gate.weight,      torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.2.ffn_up.weight,        torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.2.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.2.attn_k.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.2.attn_output.weight,   torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.2.attn_q.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.2.attn_v.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.3.attn_norm.weight,     torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.3.ffn_down.weight,      torch.bfloat16 --> F16, shape = {256, 64}
INFO:hf-to-gguf:blk.3.ffn_gate.weight,      torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.3.ffn_up.weight,        torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.3.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.3.attn_k.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.3.attn_output.weight,   torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.3.attn_q.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.3.attn_v.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.4.attn_norm.weight,     torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.4.ffn_down.weight,      torch.bfloat16 --> F16, shape = {256, 64}
INFO:hf-to-gguf:blk.4.ffn_gate.weight,      torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.4.ffn_up.weight,        torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.4.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.4.attn_k.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.4.attn_output.weight,   torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.4.attn_q.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.4.attn_v.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.5.attn_norm.weight,     torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.5.ffn_down.weight,      torch.bfloat16 --> F16, shape = {256, 64}
INFO:hf-to-gguf:blk.5.ffn_gate.weight,      torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.5.ffn_up.weight,        torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.5.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.5.attn_k.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.5.attn_output.weight,   torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.5.attn_q.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.5.attn_v.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.6.attn_norm.weight,     torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.6.ffn_down.weight,      torch.bfloat16 --> F16, shape = {256, 64}
INFO:hf-to-gguf:blk.6.ffn_gate.weight,      torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.6.ffn_up.weight,        torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.6.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.6.attn_k.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.6.attn_output.weight,   torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.6.attn_q.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.6.attn_v.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.7.attn_norm.weight,     torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.7.ffn_down.weight,      torch.bfloat16 --> F16, shape = {256, 64}
INFO:hf-to-gguf:blk.7.ffn_gate.weight,      torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.7.ffn_up.weight,        torch.bfloat16 --> F16, shape = {64, 256}
INFO:hf-to-gguf:blk.7.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.7.attn_k.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.7.attn_output.weight,   torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.7.attn_q.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:blk.7.attn_v.weight,        torch.bfloat16 --> F16, shape = {64, 64}
INFO:hf-to-gguf:output_norm.weight,         torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 2048
INFO:hf-to-gguf:gguf: embedding length = 64
INFO:hf-to-gguf:gguf: feed forward length = 256
INFO:hf-to-gguf:gguf: head count = 16
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 2
INFO:gguf.vocab:Setting special token type unk to 0
INFO:gguf.vocab:Setting special token type pad to 0
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:maykeye_tinyllama/Tinyllama-4.6M-v0.0-F16.gguf: n_tensors = 75, total_size = 9.2M
Writing: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9.24M/9.24M [00:00<00:00, 83.7Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to maykeye_tinyllama/Tinyllama-4.6M-v0.0-F16.gguf
== Generating Llamafile ==
== Test Output ./Tinyllama-4.6M-v0.0-F16.llamafile ==
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
main: llamafile version 0.8.9
main: seed  = 1721461448
llama_model_loader: loaded meta data with 33 key-value pairs and 75 tensors from Tinyllama-4.6M-v0.0-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = TinyLLama
llama_model_loader: - kv   3:                             general.author str              = Maykeye
llama_model_loader: - kv   4:                            general.version str              = v0.0
llama_model_loader: - kv   5:                        general.description str              = This gguf is ported from a first vers...
llama_model_loader: - kv   6:                       general.quantized_by str              = Mofosyne
llama_model_loader: - kv   7:                         general.size_label str              = 4.6M
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                                general.url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  10:                         general.source.url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  11:                               general.tags arr[str,5]       = ["text generation", "transformer", "l...
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                           general.datasets arr[str,2]       = ["https://huggingface.co/datasets/ron...
llama_model_loader: - kv  14:                          llama.block_count u32              = 8
llama_model_loader: - kv  15:                       llama.context_length u32              = 2048
llama_model_loader: - kv  16:                     llama.embedding_length u32              = 64
llama_model_loader: - kv  17:                  llama.feed_forward_length u32              = 256
llama_model_loader: - kv  18:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv  19:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:                          general.file_type u32              = 1
llama_model_loader: - kv  21:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  22:                 llama.rope.dimension_count u32              = 4
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  26:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   17 tensors
llama_model_loader: - type  f16:   58 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 64
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 8
llm_load_print_meta: n_rot            = 4
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 4
llm_load_print_meta: n_embd_head_v    = 4
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 64
llm_load_print_meta: n_embd_v_gqa     = 64
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 256
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 4.62 M
llm_load_print_meta: model size       = 8.82 MiB (16.00 BPW) 
llm_load_print_meta: general.name     = TinyLLama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.04 MiB
llm_load_tensors:        CPU buffer size =     8.82 MiB
..............
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     1.00 MiB
llama_new_context_with_model: KV self size  =    1.00 MiB, K (f16):    0.50 MiB, V (f16):    0.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    62.75 MiB
llama_new_context_with_model: graph nodes  = 262
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1


 hello world the gruff man said no. The man was very sad and wanted to see what was wrong. He asked the man if they could do it. But they did not like his way to the park.
One day, the man decided to go in and he took off his own new home. He gave the bird a little bit of his friend. He said he had to find a way to hide it in his woods. The man was very happy, but he knew he needed to make it in the yard.
The man was very sad and he could not find the bird. He didn't want to get to the park and his friend was very sad. They could not find the bird and his friend. But the man was too sad. He had no friends and no friends. [end of text]


llama_print_timings:        load time =      10.26 ms
llama_print_timings:      sample time =       6.03 ms /   156 runs   (    0.04 ms per token, 25879.23 tokens per second)
llama_print_timings: prompt eval time =       2.16 ms /     8 tokens (    0.27 ms per token,  3696.86 tokens per second)
llama_print_timings:        eval time =     748.08 ms /   155 runs   (    4.83 ms per token,   207.20 tokens per second)
llama_print_timings:       total time =     800.80 ms /   163 tokens
Log end
```