mofosyne commited on
Commit
cde16c9
1 Parent(s): c731072

readme update

Browse files
Files changed (1) hide show
  1. README.md +230 -1
README.md CHANGED
@@ -62,4 +62,233 @@ llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023. It uses C
62
 
63
  ## Replication Steps
64
 
65
- For the most current replication steps, refer to the bash script `llamafile-creation.sh` in this repo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  ## Replication Steps
64
 
65
+ For the most current replication steps, refer to the bash script `llamafile-creation.sh` in this repo.
66
+
67
+ ```
68
+ $ llamafile-creation.sh
69
+ llamafile-creation.sh: command not found
70
+ mofosyne@mofosyne-Z97MX-Gaming-5:~/huggingface/TinyLLama-v0-llamafile$ ./llamafile-creation.sh
71
+ == Prep Enviroment ==
72
+ == Build and prep the llamafile engine execuable ==
73
+ ~/huggingface/TinyLLama-v0-llamafile/llamafile ~/huggingface/TinyLLama-v0-llamafile
74
+ make: Nothing to be done for 'all'.
75
+ make: Nothing to be done for 'all'.
76
+ ~/huggingface/TinyLLama-v0-llamafile
77
+ == What is our llamafile name going to be? ==
78
+ We will be aiming to generate TinyLLama-v0-5M-F16.llamafile
79
+ == Convert from safetensor to gguf ==
80
+ INFO:convert:Loading model file maykeye_tinyllama/model.safetensors
81
+ INFO:convert:model parameters count : 4621392 (5M)
82
+ INFO:convert:params = Params(n_vocab=32000, n_embd=64, n_layer=8, n_ctx=2048, n_ff=256, n_head=16, n_head_kv=16, n_experts=None, n_experts_used=None, f_norm_eps=1e-06, rope_scaling_type=None, f_rope_freq_base=None, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('maykeye_tinyllama'))
83
+ INFO:convert:Loaded vocab file PosixPath('maykeye_tinyllama/tokenizer.model'), type 'spm'
84
+ INFO:convert:Vocab info: <SentencePieceVocab with 32000 base tokens and 0 added tokens>
85
+ INFO:convert:Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0, 'pad': 0}, add special tokens unset>
86
+ INFO:convert:Writing maykeye_tinyllama/TinyLLama-v0-5M-F16.gguf, format 1
87
+ WARNING:convert:Ignoring added_tokens.json since model matches vocab size without it.
88
+ INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
89
+ INFO:gguf.vocab:Setting special token type bos to 1
90
+ INFO:gguf.vocab:Setting special token type eos to 2
91
+ INFO:gguf.vocab:Setting special token type unk to 0
92
+ INFO:gguf.vocab:Setting special token type pad to 0
93
+ INFO:convert:[ 1/75] Writing tensor output.weight | size 32000 x 64 | type F16 | T+ 0
94
+ INFO:convert:[ 2/75] Writing tensor token_embd.weight | size 32000 x 64 | type F16 | T+ 0
95
+ INFO:convert:[ 3/75] Writing tensor blk.0.attn_norm.weight | size 64 | type F32 | T+ 0
96
+ INFO:convert:[ 4/75] Writing tensor blk.0.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
97
+ INFO:convert:[ 5/75] Writing tensor blk.0.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
98
+ INFO:convert:[ 6/75] Writing tensor blk.0.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
99
+ INFO:convert:[ 7/75] Writing tensor blk.0.ffn_norm.weight | size 64 | type F32 | T+ 0
100
+ INFO:convert:[ 8/75] Writing tensor blk.0.attn_k.weight | size 64 x 64 | type F16 | T+ 0
101
+ INFO:convert:[ 9/75] Writing tensor blk.0.attn_output.weight | size 64 x 64 | type F16 | T+ 0
102
+ INFO:convert:[10/75] Writing tensor blk.0.attn_q.weight | size 64 x 64 | type F16 | T+ 0
103
+ INFO:convert:[11/75] Writing tensor blk.0.attn_v.weight | size 64 x 64 | type F16 | T+ 0
104
+ INFO:convert:[12/75] Writing tensor blk.1.attn_norm.weight | size 64 | type F32 | T+ 0
105
+ INFO:convert:[13/75] Writing tensor blk.1.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
106
+ INFO:convert:[14/75] Writing tensor blk.1.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
107
+ INFO:convert:[15/75] Writing tensor blk.1.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
108
+ INFO:convert:[16/75] Writing tensor blk.1.ffn_norm.weight | size 64 | type F32 | T+ 0
109
+ INFO:convert:[17/75] Writing tensor blk.1.attn_k.weight | size 64 x 64 | type F16 | T+ 0
110
+ INFO:convert:[18/75] Writing tensor blk.1.attn_output.weight | size 64 x 64 | type F16 | T+ 0
111
+ INFO:convert:[19/75] Writing tensor blk.1.attn_q.weight | size 64 x 64 | type F16 | T+ 0
112
+ INFO:convert:[20/75] Writing tensor blk.1.attn_v.weight | size 64 x 64 | type F16 | T+ 0
113
+ INFO:convert:[21/75] Writing tensor blk.2.attn_norm.weight | size 64 | type F32 | T+ 0
114
+ INFO:convert:[22/75] Writing tensor blk.2.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
115
+ INFO:convert:[23/75] Writing tensor blk.2.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
116
+ INFO:convert:[24/75] Writing tensor blk.2.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
117
+ INFO:convert:[25/75] Writing tensor blk.2.ffn_norm.weight | size 64 | type F32 | T+ 0
118
+ INFO:convert:[26/75] Writing tensor blk.2.attn_k.weight | size 64 x 64 | type F16 | T+ 0
119
+ INFO:convert:[27/75] Writing tensor blk.2.attn_output.weight | size 64 x 64 | type F16 | T+ 0
120
+ INFO:convert:[28/75] Writing tensor blk.2.attn_q.weight | size 64 x 64 | type F16 | T+ 0
121
+ INFO:convert:[29/75] Writing tensor blk.2.attn_v.weight | size 64 x 64 | type F16 | T+ 0
122
+ INFO:convert:[30/75] Writing tensor blk.3.attn_norm.weight | size 64 | type F32 | T+ 0
123
+ INFO:convert:[31/75] Writing tensor blk.3.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
124
+ INFO:convert:[32/75] Writing tensor blk.3.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
125
+ INFO:convert:[33/75] Writing tensor blk.3.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
126
+ INFO:convert:[34/75] Writing tensor blk.3.ffn_norm.weight | size 64 | type F32 | T+ 0
127
+ INFO:convert:[35/75] Writing tensor blk.3.attn_k.weight | size 64 x 64 | type F16 | T+ 0
128
+ INFO:convert:[36/75] Writing tensor blk.3.attn_output.weight | size 64 x 64 | type F16 | T+ 0
129
+ INFO:convert:[37/75] Writing tensor blk.3.attn_q.weight | size 64 x 64 | type F16 | T+ 0
130
+ INFO:convert:[38/75] Writing tensor blk.3.attn_v.weight | size 64 x 64 | type F16 | T+ 0
131
+ INFO:convert:[39/75] Writing tensor blk.4.attn_norm.weight | size 64 | type F32 | T+ 0
132
+ INFO:convert:[40/75] Writing tensor blk.4.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
133
+ INFO:convert:[41/75] Writing tensor blk.4.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
134
+ INFO:convert:[42/75] Writing tensor blk.4.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
135
+ INFO:convert:[43/75] Writing tensor blk.4.ffn_norm.weight | size 64 | type F32 | T+ 0
136
+ INFO:convert:[44/75] Writing tensor blk.4.attn_k.weight | size 64 x 64 | type F16 | T+ 0
137
+ INFO:convert:[45/75] Writing tensor blk.4.attn_output.weight | size 64 x 64 | type F16 | T+ 0
138
+ INFO:convert:[46/75] Writing tensor blk.4.attn_q.weight | size 64 x 64 | type F16 | T+ 0
139
+ INFO:convert:[47/75] Writing tensor blk.4.attn_v.weight | size 64 x 64 | type F16 | T+ 0
140
+ INFO:convert:[48/75] Writing tensor blk.5.attn_norm.weight | size 64 | type F32 | T+ 0
141
+ INFO:convert:[49/75] Writing tensor blk.5.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
142
+ INFO:convert:[50/75] Writing tensor blk.5.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
143
+ INFO:convert:[51/75] Writing tensor blk.5.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
144
+ INFO:convert:[52/75] Writing tensor blk.5.ffn_norm.weight | size 64 | type F32 | T+ 0
145
+ INFO:convert:[53/75] Writing tensor blk.5.attn_k.weight | size 64 x 64 | type F16 | T+ 0
146
+ INFO:convert:[54/75] Writing tensor blk.5.attn_output.weight | size 64 x 64 | type F16 | T+ 0
147
+ INFO:convert:[55/75] Writing tensor blk.5.attn_q.weight | size 64 x 64 | type F16 | T+ 0
148
+ INFO:convert:[56/75] Writing tensor blk.5.attn_v.weight | size 64 x 64 | type F16 | T+ 0
149
+ INFO:convert:[57/75] Writing tensor blk.6.attn_norm.weight | size 64 | type F32 | T+ 0
150
+ INFO:convert:[58/75] Writing tensor blk.6.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
151
+ INFO:convert:[59/75] Writing tensor blk.6.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
152
+ INFO:convert:[60/75] Writing tensor blk.6.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
153
+ INFO:convert:[61/75] Writing tensor blk.6.ffn_norm.weight | size 64 | type F32 | T+ 0
154
+ INFO:convert:[62/75] Writing tensor blk.6.attn_k.weight | size 64 x 64 | type F16 | T+ 0
155
+ INFO:convert:[63/75] Writing tensor blk.6.attn_output.weight | size 64 x 64 | type F16 | T+ 0
156
+ INFO:convert:[64/75] Writing tensor blk.6.attn_q.weight | size 64 x 64 | type F16 | T+ 0
157
+ INFO:convert:[65/75] Writing tensor blk.6.attn_v.weight | size 64 x 64 | type F16 | T+ 0
158
+ INFO:convert:[66/75] Writing tensor blk.7.attn_norm.weight | size 64 | type F32 | T+ 0
159
+ INFO:convert:[67/75] Writing tensor blk.7.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
160
+ INFO:convert:[68/75] Writing tensor blk.7.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
161
+ INFO:convert:[69/75] Writing tensor blk.7.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
162
+ INFO:convert:[70/75] Writing tensor blk.7.ffn_norm.weight | size 64 | type F32 | T+ 0
163
+ INFO:convert:[71/75] Writing tensor blk.7.attn_k.weight | size 64 x 64 | type F16 | T+ 0
164
+ INFO:convert:[72/75] Writing tensor blk.7.attn_output.weight | size 64 x 64 | type F16 | T+ 0
165
+ INFO:convert:[73/75] Writing tensor blk.7.attn_q.weight | size 64 x 64 | type F16 | T+ 0
166
+ INFO:convert:[74/75] Writing tensor blk.7.attn_v.weight | size 64 x 64 | type F16 | T+ 0
167
+ INFO:convert:[75/75] Writing tensor output_norm.weight | size 64 | type F32 | T+ 0
168
+ INFO:convert:Wrote maykeye_tinyllama/TinyLLama-v0-5M-F16.gguf
169
+ == Generating Llamafile ==
170
+ == Test Output ==
171
+ note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
172
+ main: llamafile version 0.8.4
173
+ main: seed = 1715571182
174
+ llama_model_loader: loaded meta data with 26 key-value pairs and 75 tensors from TinyLLama-v0-5M-F16.gguf (version GGUF V3 (latest))
175
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
176
+ llama_model_loader: - kv 0: general.architecture str = llama
177
+ llama_model_loader: - kv 1: general.name str = TinyLLama
178
+ llama_model_loader: - kv 2: general.author str = mofosyne
179
+ llama_model_loader: - kv 3: general.version str = v0
180
+ llama_model_loader: - kv 4: general.url str = https://huggingface.co/mofosyne/TinyL...
181
+ llama_model_loader: - kv 5: general.description str = This gguf is ported from a first vers...
182
+ llama_model_loader: - kv 6: general.source.url str = https://huggingface.co/Maykeye/TinyLL...
183
+ llama_model_loader: - kv 7: general.source.huggingface.repository str = https://huggingface.co/Maykeye/TinyLL...
184
+ llama_model_loader: - kv 8: llama.vocab_size u32 = 32000
185
+ llama_model_loader: - kv 9: llama.context_length u32 = 2048
186
+ llama_model_loader: - kv 10: llama.embedding_length u32 = 64
187
+ llama_model_loader: - kv 11: llama.block_count u32 = 8
188
+ llama_model_loader: - kv 12: llama.feed_forward_length u32 = 256
189
+ llama_model_loader: - kv 13: llama.rope.dimension_count u32 = 4
190
+ llama_model_loader: - kv 14: llama.attention.head_count u32 = 16
191
+ llama_model_loader: - kv 15: llama.attention.head_count_kv u32 = 16
192
+ llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
193
+ llama_model_loader: - kv 17: general.file_type u32 = 1
194
+ llama_model_loader: - kv 18: tokenizer.ggml.model str = llama
195
+ llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
196
+ llama_model_loader: - kv 20: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
197
+ llama_model_loader: - kv 21: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
198
+ llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 1
199
+ llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 2
200
+ llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 0
201
+ llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 0
202
+ llama_model_loader: - type f32: 17 tensors
203
+ llama_model_loader: - type f16: 58 tensors
204
+ llm_load_vocab: special tokens definition check successful ( 259/32000 ).
205
+ llm_load_print_meta: format = GGUF V3 (latest)
206
+ llm_load_print_meta: arch = llama
207
+ llm_load_print_meta: vocab type = SPM
208
+ llm_load_print_meta: n_vocab = 32000
209
+ llm_load_print_meta: n_merges = 0
210
+ llm_load_print_meta: n_ctx_train = 2048
211
+ llm_load_print_meta: n_embd = 64
212
+ llm_load_print_meta: n_head = 16
213
+ llm_load_print_meta: n_head_kv = 16
214
+ llm_load_print_meta: n_layer = 8
215
+ llm_load_print_meta: n_rot = 4
216
+ llm_load_print_meta: n_embd_head_k = 4
217
+ llm_load_print_meta: n_embd_head_v = 4
218
+ llm_load_print_meta: n_gqa = 1
219
+ llm_load_print_meta: n_embd_k_gqa = 64
220
+ llm_load_print_meta: n_embd_v_gqa = 64
221
+ llm_load_print_meta: f_norm_eps = 0.0e+00
222
+ llm_load_print_meta: f_norm_rms_eps = 1.0e-06
223
+ llm_load_print_meta: f_clamp_kqv = 0.0e+00
224
+ llm_load_print_meta: f_max_alibi_bias = 0.0e+00
225
+ llm_load_print_meta: f_logit_scale = 0.0e+00
226
+ llm_load_print_meta: n_ff = 256
227
+ llm_load_print_meta: n_expert = 0
228
+ llm_load_print_meta: n_expert_used = 0
229
+ llm_load_print_meta: causal attn = 1
230
+ llm_load_print_meta: pooling type = 0
231
+ llm_load_print_meta: rope type = 0
232
+ llm_load_print_meta: rope scaling = linear
233
+ llm_load_print_meta: freq_base_train = 10000.0
234
+ llm_load_print_meta: freq_scale_train = 1
235
+ llm_load_print_meta: n_yarn_orig_ctx = 2048
236
+ llm_load_print_meta: rope_finetuned = unknown
237
+ llm_load_print_meta: ssm_d_conv = 0
238
+ llm_load_print_meta: ssm_d_inner = 0
239
+ llm_load_print_meta: ssm_d_state = 0
240
+ llm_load_print_meta: ssm_dt_rank = 0
241
+ llm_load_print_meta: model type = ?B
242
+ llm_load_print_meta: model ftype = F16
243
+ llm_load_print_meta: model params = 4.62 M
244
+ llm_load_print_meta: model size = 8.82 MiB (16.00 BPW)
245
+ llm_load_print_meta: general.name = TinyLLama
246
+ llm_load_print_meta: BOS token = 1 '<s>'
247
+ llm_load_print_meta: EOS token = 2 '</s>'
248
+ llm_load_print_meta: UNK token = 0 '<unk>'
249
+ llm_load_print_meta: PAD token = 0 '<unk>'
250
+ llm_load_print_meta: LF token = 13 '<0x0A>'
251
+ llm_load_tensors: ggml ctx size = 0.04 MiB
252
+ llm_load_tensors: CPU buffer size = 8.82 MiB
253
+ ..............
254
+ llama_new_context_with_model: n_ctx = 512
255
+ llama_new_context_with_model: n_batch = 512
256
+ llama_new_context_with_model: n_ubatch = 512
257
+ llama_new_context_with_model: flash_attn = 0
258
+ llama_new_context_with_model: freq_base = 10000.0
259
+ llama_new_context_with_model: freq_scale = 1
260
+ llama_kv_cache_init: CPU KV buffer size = 1.00 MiB
261
+ llama_new_context_with_model: KV self size = 1.00 MiB, K (f16): 0.50 MiB, V (f16): 0.50 MiB
262
+ llama_new_context_with_model: CPU output buffer size = 0.12 MiB
263
+ llama_new_context_with_model: CPU compute buffer size = 62.75 MiB
264
+ llama_new_context_with_model: graph nodes = 262
265
+ llama_new_context_with_model: graph splits = 1
266
+
267
+ system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
268
+ sampling:
269
+ repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
270
+ top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
271
+ mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
272
+ sampling order:
273
+ CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
274
+ generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1
275
+
276
+
277
+ <s> hello world the gruff man said no. The man was very sad and he was too scared to come back.
278
+ The man asked the man, "Why are you sad?"
279
+ The man said, "I am scared of a big surprise. I will help you."
280
+ The man looked at the boy and said, "I can help you. I can make the little boy's wings. The man makes the girl laugh. She was so kind and happy.
281
+ The boy said, "You are too mean to me. You can't give out the problem."
282
+ The girl said, "I will help you!"
283
+ The man stopped and said, "I can help you. I'm sorry for a little girl, but you must tell the boy to be careful. Do you want to be kind."
284
+ The boy smiled and said, "Yes, I want to help you. Let's go into the pond and have fun!"
285
+ The boy and the man went to the lake to the pond. They had a great time and the man was able to help.</s> [end of text]
286
+
287
+
288
+ llama_print_timings: load time = 7.35 ms
289
+ llama_print_timings: sample time = 8.40 ms / 218 runs ( 0.04 ms per token, 25958.56 tokens per second)
290
+ llama_print_timings: prompt eval time = 2.90 ms / 8 tokens ( 0.36 ms per token, 2760.52 tokens per second)
291
+ llama_print_timings: eval time = 372.10 ms / 217 runs ( 1.71 ms per token, 583.18 tokens per second)
292
+ llama_print_timings: total time = 427.19 ms / 225 tokens
293
+ Log end
294
+ ```