VectorNomad commited on
Commit
d0e66b7
·
verified ·
1 Parent(s): 3bfa762

Initial release: Arkadiko V4 base, 214M / 100B tokens

Browse files
Files changed (6) hide show
  1. README.md +165 -0
  2. config.json +22 -0
  3. model.safetensors +3 -0
  4. tokenizer.model +3 -0
  5. tokenizer_config.json +22 -0
  6. training_summary.json +35 -0
README.md ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - ar
5
+ - en
6
+ - de
7
+ - fr
8
+ - es
9
+ - it
10
+ tags:
11
+ - arkadiko
12
+ - arabic
13
+ - bilingual
14
+ - pretrained
15
+ - causal-lm
16
+ - research
17
+ library_name: transformers
18
+ pipeline_tag: text-generation
19
+ ---
20
+
21
+ # Arkadiko V4 — Base (pretrained, no SFT)
22
+
23
+ 214M-parameter causal decoder pretrained from scratch on ~100B tokens across 9 domains. **Pretraining only — no instruction tuning, no chat alignment, no RLHF.** Released as a research artifact.
24
+
25
+ This is **V4**, not V5. The Arkadiko model family advances to V5 only after demonstrating four post-SFT capabilities (multi-turn chat, ar↔en translation, tool calling, structured thinking). None of those have been validated on this checkpoint. See the [Honest Limitations](#honest-limitations) section before considering use.
26
+
27
+ ## Quick facts
28
+
29
+ | | |
30
+ |---|---|
31
+ | Parameters | 213,934,720 |
32
+ | Architecture | Pure causal decoder, 18 layers |
33
+ | Hidden size | 640 |
34
+ | Attention | GQA, 10 query heads / 2 KV heads, head_dim=64 |
35
+ | FFN | SwiGLU, hidden=3456 (≈5.4×) |
36
+ | Vocab | 60,000 (SentencePiece BPE) |
37
+ | Context | 2,048 tokens |
38
+ | Position | RoPE, theta=10000 |
39
+ | Tied embeddings | No (separate `wte` and `lm_head`) |
40
+ | Tokens trained | 100,000,006,144 (~100B) |
41
+ | Training steps | 9,114,584 |
42
+ | Training hours | 524.7 |
43
+ | Hardware | 1× NVIDIA RTX PRO 4000 Blackwell (24GB) |
44
+ | Run completed | 2026-05-06 |
45
+
46
+ ## Final evaluation (held-out per-domain)
47
+
48
+ Loss in nats, perplexity = exp(loss). Best-ever overall val PPL was **26.6** at step 8,815k; the released final checkpoint is at PPL ~28.8 (cosine-tail polish).
49
+
50
+ | Domain | Val loss (MA3) | Perplexity |
51
+ |---|---|---|
52
+ | code | 1.93 | 6.9 |
53
+ | math | 3.10 | 22.1 |
54
+ | fr | 3.32 | 27.7 |
55
+ | es | 3.43 | 30.9 |
56
+ | it | 3.50 | 32.9 |
57
+ | de | 3.57 | 35.6 |
58
+ | classical (Arabic) | 3.78 | 43.7 |
59
+ | en | 3.75 | 42.5 |
60
+ | **ar (modern)** | **3.80** | **44.5** |
61
+ | **overall** | 3.36 | 28.8 |
62
+
63
+ ## Training data
64
+
65
+ Roughly:
66
+
67
+ | Domain | Tokens | Source |
68
+ |---|---|---|
69
+ | Arabic (modern) | 24B | ArabicWeb24 + cc100-ar + CulturaX-ar |
70
+ | English | 28B | FineWeb-Edu |
71
+ | German | 12B | cc100-de |
72
+ | French | 8B | cc100-fr |
73
+ | Spanish | 8B | cc100-es |
74
+ | Italian | 7B | cc100-it |
75
+ | Code | 8B | CodeParrot + StarCoderData |
76
+ | Math | 7B | OpenWebMath |
77
+ | Classical Arabic | 2.7B | Custom (hadith, tafsir, OpenITI, poetry, tashkeela) |
78
+
79
+ Single SentencePiece BPE tokenizer shared across all 9 domains. **Token-fertility is uneven** — Arabic averages roughly 2× the tokens-per-word of English in this vocab, which we believe is a primary cause of weaker Arabic perplexity. The next iteration uses an Arabic-aware tokenizer (see [Roadmap](#roadmap)).
80
+
81
+ ## Honest limitations
82
+
83
+ This base model has known structural failures verified through completion testing across the run. Use accordingly.
84
+
85
+ 1. **Coherent generation horizon ≈ 50 tokens.** Past that, drift, topic-loop, or repetition. Capacity-bound at this size; SFT cannot extend it.
86
+ 2. **No factual recall in long form.** Capitals, public figures, dates — the model produces fluent confabulation, not facts. Pair with retrieval/tools, do not deploy as a Q&A system.
87
+ 3. **Cross-language code bleed.** Code prompts in one language frequently produce output flavored by another (JS prompt → Python output). Vocab-level issue.
88
+ 4. **Arabic — the primary target language — is the second-worst text domain by PPL.** Surface fluency reaches ~30-50 token spans; long-form Arabic reasoning is not present. The "Arabic-first" framing was not delivered at this scale.
89
+ 5. **No safety alignment.** No RLHF, no DPO, no toxicity filtering of training data beyond source-level curation. Outputs may be biased, false, or offensive.
90
+ 6. **No instruction-following.** Base model only. Will not reliably follow chat templates, refuse harmful requests, or call tools.
91
+
92
+ ### Configuration / tokenizer ID misalignment (read before using)
93
+
94
+ The `config.json` shipped here records the values used during training: `bos_token_id=0, eos_token_id=2, pad_token_id=1`. The actual SentencePiece model (`tokenizer.model`) defines these tokens at different IDs:
95
+
96
+ | Token | SPM ID | config.json |
97
+ |---|---|---|
98
+ | `<unk>` | 0 | (not specified) |
99
+ | `<bos>` | 1 | `bos_token_id=0` |
100
+ | `<eos>` | 2 | `eos_token_id=2` |
101
+ | `<pad>` | 3 | `pad_token_id=1` |
102
+
103
+ **Use the IDs from the SPM model when serving.** `tokenizer_config.json` lists the SPM-derived IDs in `added_tokens`. The misaligned values in `config.json` are preserved for reproducibility — the model was trained with them — but downstream code should treat the SPM model as the source of truth.
104
+
105
+ This also affects all other special tokens, which the SPM model places at IDs 7–14:
106
+
107
+ ```
108
+ <system>=7 <user>=8 <assistant>=9
109
+ <think>=10 </think>=11 <tool_call>=12 <tool_result>=13 <eot>=14
110
+ ```
111
+
112
+ `<think>` is the only special with a paired closer; `<tool_call>` and `<tool_result>` content is bounded by `<eos>` rather than a closing tag.
113
+
114
+ ## Loading
115
+
116
+ The model uses a custom architecture (`ArkadikoForCausalLM`) which is not part of `transformers` upstream. To load weights, use the `arkadiko/llm/model.py` definition from the project repo, or load the `safetensors` tensors directly:
117
+
118
+ ```python
119
+ import json
120
+ from safetensors.torch import load_file
121
+ state_dict = load_file("model.safetensors")
122
+ config = json.load(open("config.json"))
123
+ # Initialize your ArkadikoConfig + ArkadikoForCausalLM
124
+ # (see https://github.com/... for the model code)
125
+ # model.load_state_dict(state_dict, strict=False)
126
+ ```
127
+
128
+ The repository code is not yet public. Drop a note in the discussions tab if you need it earlier than the planned release.
129
+
130
+ ## What this artifact is good for
131
+
132
+ - **Research baseline.** Reproducible 214M / 100B-token Arabic-inclusive base.
133
+ - **SFT experiments.** Suitable starting point for short-context, structured-output tasks (tool calling, format compliance) at small scale.
134
+ - **Capability-curve studies.** Final eval and run log are included; full per-checkpoint curve available on request.
135
+
136
+ ## What this artifact is **not** good for
137
+
138
+ - Production chat or assistant deployment.
139
+ - Factual question answering.
140
+ - Long-form generation (>50 tokens).
141
+ - Translation as native generation. (A translation tool wrapper around any base may work better than this model alone.)
142
+
143
+ ## Roadmap
144
+
145
+ The next planned iteration drops German/French/Spanish/Italian, focuses on Arabic + English + Classical + Code + Math, and grows to ~700M parameters with a 128K Arabic-aware tokenizer. See ADR-210 / ADR-211 in the project repo. This V4 base remains the experimental control.
146
+
147
+ ## License
148
+
149
+ **CC BY-NC 4.0** — non-commercial use only. Attribution required. No warranty, no liability.
150
+
151
+ ## Citation
152
+
153
+ ```bibtex
154
+ @misc{arkadiko_v4_base_2026,
155
+ author = {{VectorNomad}},
156
+ title = {Arkadiko V4: A 214M Arabic-Inclusive Pretrained Base Model},
157
+ year = {2026},
158
+ publisher = {Hugging Face},
159
+ howpublished = {\url{https://huggingface.co/VectorNomad/arkadiko-v4-base}}
160
+ }
161
+ ```
162
+
163
+ ## Acknowledgements
164
+
165
+ Trained on a single RTX PRO 4000 Blackwell. Bridges, not factories.
config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "arkadiko",
3
+ "architectures": [
4
+ "ArkadikoForCausalLM"
5
+ ],
6
+ "vocab_size": 60000,
7
+ "hidden_size": 640,
8
+ "num_hidden_layers": 18,
9
+ "num_attention_heads": 10,
10
+ "num_key_value_heads": 2,
11
+ "head_dim": 64,
12
+ "intermediate_size": 3456,
13
+ "ffn_mult": 5.4,
14
+ "max_position_embeddings": 2048,
15
+ "rope_theta": 10000.0,
16
+ "tie_word_embeddings": false,
17
+ "torch_dtype": "bfloat16",
18
+ "bos_token_id": 0,
19
+ "eos_token_id": 2,
20
+ "pad_token_id": 1,
21
+ "transformers_version": null
22
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:64bc2a8190c2620aaddc4151443c26fdba49f3984eaf6be643ba73ba6baa578b
3
+ size 427881984
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:45d7a06bfacb1f8112436ea508ebaac0791ea1d0c9165b0f2519d7fed5ce6168
3
+ size 1305066
tokenizer_config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "LlamaTokenizer",
3
+ "model_max_length": 2048,
4
+ "added_tokens": {
5
+ "<unk>": 0,
6
+ "<bos>": 1,
7
+ "<eos>": 2,
8
+ "<pad>": 3,
9
+ "<system>": 7,
10
+ "<user>": 8,
11
+ "<assistant>": 9,
12
+ "<think>": 10,
13
+ "</think>": 11,
14
+ "<tool_call>": 12,
15
+ "<tool_result>": 13,
16
+ "<eot>": 14,
17
+ "<mask>": 4,
18
+ "<sep>": 5,
19
+ "<cls>": 6
20
+ },
21
+ "_arkadiko_note": "The trained model config (config.json) sets bos_token_id=0, eos_token_id=2, pad_token_id=1. The actual SPM model ships <unk>=0, <bos>=1, <eos>=2, <pad>=3. The runtime SHOULD use the tokenizer-derived IDs (this file's `added_tokens`) — config.json values are kept as-trained for reproducibility but are misaligned. See README for details."
22
+ }
training_summary.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "step": 9114584,
3
+ "total_tokens": 100000006144,
4
+ "subphase_idx": 15,
5
+ "final_eval_step": 9110000,
6
+ "final_overall_loss_nats": 3.3363,
7
+ "final_overall_ma3_nats": 3.3602,
8
+ "best_overall_loss_nats": 3.2803,
9
+ "best_overall_step": 8815000,
10
+ "per_domain_ma3_loss_nats": {
11
+ "ar": 3.7952,
12
+ "en": 3.7491,
13
+ "de": 3.5717,
14
+ "fr": 3.3201,
15
+ "es": 3.4335,
16
+ "it": 3.4953,
17
+ "code": 1.9293,
18
+ "math": 3.096,
19
+ "classical": 3.7764
20
+ },
21
+ "per_domain_ma3_ppl": {
22
+ "ar": 44.5,
23
+ "en": 42.5,
24
+ "de": 35.6,
25
+ "fr": 27.7,
26
+ "es": 30.9,
27
+ "it": 32.9,
28
+ "code": 6.9,
29
+ "math": 22.1,
30
+ "classical": 43.7
31
+ },
32
+ "training_hours": 524.7,
33
+ "hardware": "NVIDIA RTX PRO 4000 Blackwell, 24GB",
34
+ "wall_clock_end": "2026-05-06T14:43+00:00"
35
+ }