Severian
/

Nexus-IKM-Mistral-7B-MLX

Text Generation

Transformers

mistral

conversational

Inference Endpoints

Model card Files Files and versions Community

Severian commited on Mar 12

Commit

36c610d

•

1 Parent(s): d460968

Update README.md

Browse files

Files changed (1) hide show

README.md +40 -226

README.md CHANGED Viewed

@@ -9,9 +9,47 @@ pipeline_tag: text-generation
 <img src="https://cdn-uploads.huggingface.co/production/uploads/64740cf7485a7c8e1bd51ac9/GO4MY_3adP2G9EHKZbZpg.webp" width="500" height="500">
-This model is the second trained with experimental 'Internal Knowledge Map' dataset. Developed with an aim to go beyond the scope of usual data processing capabilities, this model gets trained to build comprehensive understanding and reasoning in a wide range of knowledge domains with elaborate guidelines. It bases its reasoning on a specially selected dataset emphasizing the interrelations of the diverse disciplines which aim to synthesize, integrate, and apply complex information in ways that mimic humanly abstract reasoning and creative thought processes.
-At the very core of the development of this model is the desire to make sure that LLMs engage in a kind of cognitive activity not limited to memory but actually taking on abstract reasoning, problem-solving, and generation of new insights. To achieve this, 'Nexus-IKM-Mistral-7B' has been fine-tuned until convergance using a novel Phased Training appraoch on this unique dataset, which resulted in the model demonstrating greater capability for giving rise to insights and problem-solving in complex, multi-disciplinary settings. This involves improved ability in drawing links between different pieces of knowledge, reasoning through complex scenarios, and proposing innovative solutions that cut across various domains, including science, technology, environmental studies, and humanities.
 Test this out and see if you find anything interesting or intriguing. I will keep iterating more versions but this one seems like a fun and useful way to start.
@@ -164,227 +202,3 @@ In summary, bees contribute significantly to ecosystems beyond pollination by en
 ---
-## Training Snapshot
-```
-Step	Training Loss
-1	3.223000
-2	3.221300
-3	3.215900
-4	3.210600
-5	3.203000
-6	3.193500
-7	3.184000
-8	3.173400
-9	3.162400
-10	3.151500
-11	3.140500
-12	3.128800
-13	3.117600
-14	3.106700
-15	3.095500
-16	3.084700
-17	3.073700
-18	3.062700
-19	3.052300
-20	3.041800
-201	1.273200
-202	1.257600
-203	1.241900
-204	1.226100
-205	1.210800
-206	1.195500
-207	1.180800
-208	1.166000
-209	1.151200
-210	1.136900
-211	1.122000
-212	1.106600
-213	1.091200
-214	1.075200
-215	1.059200
-216	1.042900
-217	1.026600
-218	1.010300
-219	0.994200
-416	0.041700
-417	0.041700
-418	0.041600
-419	0.041600
-420	0.041600
-421	0.041600
-422	0.041500
-423	0.041500
-424	0.041500
-425	0.041400
-426	0.041400
-427	0.041400
-428	0.041400
-429	0.041300
-430	0.041300
-431	0.041300
-432	0.041200
-433	0.041200
-434	0.041200
-435	0.041100
-436	0.041200
-437	0.041100
-438	0.041100
-439	0.041100
-440	0.041000
-441	0.041000
-442	0.041000
-443	0.040900
-444	0.040900
-445	0.040900
-668	0.035200
-669	0.035100
-670	0.035100
-671	0.035100
-672	0.035100
-673	0.035000
-674	0.035000
-675	0.035000
-676	0.035000
-677	0.034900
-678	0.034900
-679	0.034900
-680	0.034800
-681	0.034800
-682	0.034800
-683	0.034800
-684	0.034800
-685	0.034700
-686	0.034700
-687	0.034700
-688	0.034700
-689	0.034600
-690	0.034600
-691	0.034600
-692	0.034600
-693	0.034500
-694	0.034500
-695	0.034500
-696	0.034400
-697	0.034400
-698	0.034400
-699	0.034400
-700	0.034300
-701	0.034300
-702	0.034300
-703	0.034300
-704	0.034200
-705	0.034200
-706	0.034200
-707	0.034200
-708	0.034100
-709	0.034100
-710	0.034100
-711	0.034100
-712	0.034000
-713	0.034000
-714	0.034000
-715	0.034000
-716	0.033900
-717	0.033900
-718	0.033800
-719	0.033800
-720	0.033800
-721	0.033800
-1209	0.006600
-1210	0.006500
-1211	0.006300
-1212	0.006200
-1213	0.006100
-1214	0.006000
-1215	0.005800
-1216	0.005700
-1217	0.005600
-1218	0.005500
-1219	0.005400
-1220	0.005300
-1221	0.005100
-1222	0.004900
-1223	0.004800
-1224	0.004700
-1225	0.004600
-1226	0.004500
-1227	0.004400
-1228	0.004300
-1229	0.004200
-1230	0.004000
-1231	0.003900
-1232	0.003800
-1233	0.003700
-1234	0.003500
-1235	0.003400
-1236	0.003300
-1237	0.003200
-1238	0.003000
-1239	0.003000
-1240	0.002900
-1241	0.002800
-1242	0.002700
-1243	0.002600
-1244	0.002500
-1245	0.002400
-1246	0.002300
-1247	0.002200
-1248	0.002100
-1249	0.002000
-1250	0.001900
-1251	0.001800
-1252	0.001800
-1253	0.001700
-1254	0.001600
-1255	0.001600
-1256	0.001500
-1257	0.001400
-1258	0.001300
-1259	0.001300
-1260	0.001200
-1261	0.001200
-1262	0.001100
-1263	0.001100
-1264	0.001000
-1265	0.001000
-1266	0.000900
-1267	0.000900
-1268	0.000800
-1269	0.000800
-1270	0.000800
-1271	0.000800
-1272	0.000700
-1273	0.000700
-1274	0.000700
-1275	0.000600
-1276	0.000600
-1277	0.000600
-1278	0.000600
-1279	0.000500
-1280	0.000500
-1281	0.000500
-1282	0.000500
-1283	0.000500
-1284	0.000500
-1285	0.000500
-1286	0.000400
-1287	0.000400
-1288	0.000400
-1289	0.000400
-1290	0.000400
-1291	0.000400
-1292	0.000400
-1293	0.000400
-1294	0.000400
-1295	0.000400
-1296	0.000400
-1297	0.000300
-1298	0.000300
-```

 <img src="https://cdn-uploads.huggingface.co/production/uploads/64740cf7485a7c8e1bd51ac9/GO4MY_3adP2G9EHKZbZpg.webp" width="500" height="500">
+**This model is the MLX trained version on the experimental 'Internal Knowledge Map' dataset.** Training was done 100% local on my M2 Ultra 128GB. I've found that there are noticable differences between the Transformers (Unsloth) version I trained in a Colab and the ones I have been training locally using MLX. I personally prefer the MLX ones as they seem to train MUCH better and retain more of the aspects from fine-tuning. In particular, I also ripped this model to shreds and it still seems to work amazingly. Here was my training set up:
+```
+    model.freeze()
+    for l in model.model.layers:
+        l.self_attn.q_proj = LoRALinear.from_linear(
+            l.self_attn.q_proj, r=128, lora_alpha=256, lora_dropout=0.001
+        )
+        l.self_attn.v_proj = LoRALinear.from_linear(
+            l.self_attn.v_proj, r=128, lora_alpha=256, lora_dropout=0.001
+        )
+        l.self_attn.o_proj = LoRALinear.from_linear(
+            l.self_attn.o_proj, r=128, lora_alpha=256, lora_dropout=0.001
+        )
+        l.self_attn.k_proj = LoRALinear.from_linear(
+            l.self_attn.k_proj, r=128, lora_alpha=256
+        )
+        if hasattr(l, "block_sparse_moe"):
+            l.block_sparse_moe.gate = LoRALinear.from_linear(
+                l.block_sparse_moe.gate, r=16, lora_alpha=32, lora_dropout=0.001
+            )
+    p = sum(v.size for _, v in tree_flatten(model.parameters())) / 10**6
+    print(f"Total parameters {p:.3f}M")
+    p = sum(v.size for _, v in tree_flatten(model.trainable_parameters())) / 10**6
+    print(f"Trainable parameters {p:.3f}M")
+    trainingArgs = TrainingArgs(
+        batch_size=4,
+        iters=4200,
+        val_batches=25,
+        steps_per_report=10,
+        steps_per_eval=200,
+        steps_per_save=100,
+        adapter_file="adapters.npz",
+        max_seq_length=4096,
+    )
+```
 Test this out and see if you find anything interesting or intriguing. I will keep iterating more versions but this one seems like a fun and useful way to start.
 ---