Initial release: Arkadiko V4 base, 214M / 100B tokens
Browse files- README.md +165 -0
- config.json +22 -0
- model.safetensors +3 -0
- tokenizer.model +3 -0
- tokenizer_config.json +22 -0
- training_summary.json +35 -0
README.md
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
language:
|
| 4 |
+
- ar
|
| 5 |
+
- en
|
| 6 |
+
- de
|
| 7 |
+
- fr
|
| 8 |
+
- es
|
| 9 |
+
- it
|
| 10 |
+
tags:
|
| 11 |
+
- arkadiko
|
| 12 |
+
- arabic
|
| 13 |
+
- bilingual
|
| 14 |
+
- pretrained
|
| 15 |
+
- causal-lm
|
| 16 |
+
- research
|
| 17 |
+
library_name: transformers
|
| 18 |
+
pipeline_tag: text-generation
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# Arkadiko V4 — Base (pretrained, no SFT)
|
| 22 |
+
|
| 23 |
+
214M-parameter causal decoder pretrained from scratch on ~100B tokens across 9 domains. **Pretraining only — no instruction tuning, no chat alignment, no RLHF.** Released as a research artifact.
|
| 24 |
+
|
| 25 |
+
This is **V4**, not V5. The Arkadiko model family advances to V5 only after demonstrating four post-SFT capabilities (multi-turn chat, ar↔en translation, tool calling, structured thinking). None of those have been validated on this checkpoint. See the [Honest Limitations](#honest-limitations) section before considering use.
|
| 26 |
+
|
| 27 |
+
## Quick facts
|
| 28 |
+
|
| 29 |
+
| | |
|
| 30 |
+
|---|---|
|
| 31 |
+
| Parameters | 213,934,720 |
|
| 32 |
+
| Architecture | Pure causal decoder, 18 layers |
|
| 33 |
+
| Hidden size | 640 |
|
| 34 |
+
| Attention | GQA, 10 query heads / 2 KV heads, head_dim=64 |
|
| 35 |
+
| FFN | SwiGLU, hidden=3456 (≈5.4×) |
|
| 36 |
+
| Vocab | 60,000 (SentencePiece BPE) |
|
| 37 |
+
| Context | 2,048 tokens |
|
| 38 |
+
| Position | RoPE, theta=10000 |
|
| 39 |
+
| Tied embeddings | No (separate `wte` and `lm_head`) |
|
| 40 |
+
| Tokens trained | 100,000,006,144 (~100B) |
|
| 41 |
+
| Training steps | 9,114,584 |
|
| 42 |
+
| Training hours | 524.7 |
|
| 43 |
+
| Hardware | 1× NVIDIA RTX PRO 4000 Blackwell (24GB) |
|
| 44 |
+
| Run completed | 2026-05-06 |
|
| 45 |
+
|
| 46 |
+
## Final evaluation (held-out per-domain)
|
| 47 |
+
|
| 48 |
+
Loss in nats, perplexity = exp(loss). Best-ever overall val PPL was **26.6** at step 8,815k; the released final checkpoint is at PPL ~28.8 (cosine-tail polish).
|
| 49 |
+
|
| 50 |
+
| Domain | Val loss (MA3) | Perplexity |
|
| 51 |
+
|---|---|---|
|
| 52 |
+
| code | 1.93 | 6.9 |
|
| 53 |
+
| math | 3.10 | 22.1 |
|
| 54 |
+
| fr | 3.32 | 27.7 |
|
| 55 |
+
| es | 3.43 | 30.9 |
|
| 56 |
+
| it | 3.50 | 32.9 |
|
| 57 |
+
| de | 3.57 | 35.6 |
|
| 58 |
+
| classical (Arabic) | 3.78 | 43.7 |
|
| 59 |
+
| en | 3.75 | 42.5 |
|
| 60 |
+
| **ar (modern)** | **3.80** | **44.5** |
|
| 61 |
+
| **overall** | 3.36 | 28.8 |
|
| 62 |
+
|
| 63 |
+
## Training data
|
| 64 |
+
|
| 65 |
+
Roughly:
|
| 66 |
+
|
| 67 |
+
| Domain | Tokens | Source |
|
| 68 |
+
|---|---|---|
|
| 69 |
+
| Arabic (modern) | 24B | ArabicWeb24 + cc100-ar + CulturaX-ar |
|
| 70 |
+
| English | 28B | FineWeb-Edu |
|
| 71 |
+
| German | 12B | cc100-de |
|
| 72 |
+
| French | 8B | cc100-fr |
|
| 73 |
+
| Spanish | 8B | cc100-es |
|
| 74 |
+
| Italian | 7B | cc100-it |
|
| 75 |
+
| Code | 8B | CodeParrot + StarCoderData |
|
| 76 |
+
| Math | 7B | OpenWebMath |
|
| 77 |
+
| Classical Arabic | 2.7B | Custom (hadith, tafsir, OpenITI, poetry, tashkeela) |
|
| 78 |
+
|
| 79 |
+
Single SentencePiece BPE tokenizer shared across all 9 domains. **Token-fertility is uneven** — Arabic averages roughly 2× the tokens-per-word of English in this vocab, which we believe is a primary cause of weaker Arabic perplexity. The next iteration uses an Arabic-aware tokenizer (see [Roadmap](#roadmap)).
|
| 80 |
+
|
| 81 |
+
## Honest limitations
|
| 82 |
+
|
| 83 |
+
This base model has known structural failures verified through completion testing across the run. Use accordingly.
|
| 84 |
+
|
| 85 |
+
1. **Coherent generation horizon ≈ 50 tokens.** Past that, drift, topic-loop, or repetition. Capacity-bound at this size; SFT cannot extend it.
|
| 86 |
+
2. **No factual recall in long form.** Capitals, public figures, dates — the model produces fluent confabulation, not facts. Pair with retrieval/tools, do not deploy as a Q&A system.
|
| 87 |
+
3. **Cross-language code bleed.** Code prompts in one language frequently produce output flavored by another (JS prompt → Python output). Vocab-level issue.
|
| 88 |
+
4. **Arabic — the primary target language — is the second-worst text domain by PPL.** Surface fluency reaches ~30-50 token spans; long-form Arabic reasoning is not present. The "Arabic-first" framing was not delivered at this scale.
|
| 89 |
+
5. **No safety alignment.** No RLHF, no DPO, no toxicity filtering of training data beyond source-level curation. Outputs may be biased, false, or offensive.
|
| 90 |
+
6. **No instruction-following.** Base model only. Will not reliably follow chat templates, refuse harmful requests, or call tools.
|
| 91 |
+
|
| 92 |
+
### Configuration / tokenizer ID misalignment (read before using)
|
| 93 |
+
|
| 94 |
+
The `config.json` shipped here records the values used during training: `bos_token_id=0, eos_token_id=2, pad_token_id=1`. The actual SentencePiece model (`tokenizer.model`) defines these tokens at different IDs:
|
| 95 |
+
|
| 96 |
+
| Token | SPM ID | config.json |
|
| 97 |
+
|---|---|---|
|
| 98 |
+
| `<unk>` | 0 | (not specified) |
|
| 99 |
+
| `<bos>` | 1 | `bos_token_id=0` |
|
| 100 |
+
| `<eos>` | 2 | `eos_token_id=2` |
|
| 101 |
+
| `<pad>` | 3 | `pad_token_id=1` |
|
| 102 |
+
|
| 103 |
+
**Use the IDs from the SPM model when serving.** `tokenizer_config.json` lists the SPM-derived IDs in `added_tokens`. The misaligned values in `config.json` are preserved for reproducibility — the model was trained with them — but downstream code should treat the SPM model as the source of truth.
|
| 104 |
+
|
| 105 |
+
This also affects all other special tokens, which the SPM model places at IDs 7–14:
|
| 106 |
+
|
| 107 |
+
```
|
| 108 |
+
<system>=7 <user>=8 <assistant>=9
|
| 109 |
+
<think>=10 </think>=11 <tool_call>=12 <tool_result>=13 <eot>=14
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
`<think>` is the only special with a paired closer; `<tool_call>` and `<tool_result>` content is bounded by `<eos>` rather than a closing tag.
|
| 113 |
+
|
| 114 |
+
## Loading
|
| 115 |
+
|
| 116 |
+
The model uses a custom architecture (`ArkadikoForCausalLM`) which is not part of `transformers` upstream. To load weights, use the `arkadiko/llm/model.py` definition from the project repo, or load the `safetensors` tensors directly:
|
| 117 |
+
|
| 118 |
+
```python
|
| 119 |
+
import json
|
| 120 |
+
from safetensors.torch import load_file
|
| 121 |
+
state_dict = load_file("model.safetensors")
|
| 122 |
+
config = json.load(open("config.json"))
|
| 123 |
+
# Initialize your ArkadikoConfig + ArkadikoForCausalLM
|
| 124 |
+
# (see https://github.com/... for the model code)
|
| 125 |
+
# model.load_state_dict(state_dict, strict=False)
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
The repository code is not yet public. Drop a note in the discussions tab if you need it earlier than the planned release.
|
| 129 |
+
|
| 130 |
+
## What this artifact is good for
|
| 131 |
+
|
| 132 |
+
- **Research baseline.** Reproducible 214M / 100B-token Arabic-inclusive base.
|
| 133 |
+
- **SFT experiments.** Suitable starting point for short-context, structured-output tasks (tool calling, format compliance) at small scale.
|
| 134 |
+
- **Capability-curve studies.** Final eval and run log are included; full per-checkpoint curve available on request.
|
| 135 |
+
|
| 136 |
+
## What this artifact is **not** good for
|
| 137 |
+
|
| 138 |
+
- Production chat or assistant deployment.
|
| 139 |
+
- Factual question answering.
|
| 140 |
+
- Long-form generation (>50 tokens).
|
| 141 |
+
- Translation as native generation. (A translation tool wrapper around any base may work better than this model alone.)
|
| 142 |
+
|
| 143 |
+
## Roadmap
|
| 144 |
+
|
| 145 |
+
The next planned iteration drops German/French/Spanish/Italian, focuses on Arabic + English + Classical + Code + Math, and grows to ~700M parameters with a 128K Arabic-aware tokenizer. See ADR-210 / ADR-211 in the project repo. This V4 base remains the experimental control.
|
| 146 |
+
|
| 147 |
+
## License
|
| 148 |
+
|
| 149 |
+
**CC BY-NC 4.0** — non-commercial use only. Attribution required. No warranty, no liability.
|
| 150 |
+
|
| 151 |
+
## Citation
|
| 152 |
+
|
| 153 |
+
```bibtex
|
| 154 |
+
@misc{arkadiko_v4_base_2026,
|
| 155 |
+
author = {{VectorNomad}},
|
| 156 |
+
title = {Arkadiko V4: A 214M Arabic-Inclusive Pretrained Base Model},
|
| 157 |
+
year = {2026},
|
| 158 |
+
publisher = {Hugging Face},
|
| 159 |
+
howpublished = {\url{https://huggingface.co/VectorNomad/arkadiko-v4-base}}
|
| 160 |
+
}
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
## Acknowledgements
|
| 164 |
+
|
| 165 |
+
Trained on a single RTX PRO 4000 Blackwell. Bridges, not factories.
|
config.json
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "arkadiko",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"ArkadikoForCausalLM"
|
| 5 |
+
],
|
| 6 |
+
"vocab_size": 60000,
|
| 7 |
+
"hidden_size": 640,
|
| 8 |
+
"num_hidden_layers": 18,
|
| 9 |
+
"num_attention_heads": 10,
|
| 10 |
+
"num_key_value_heads": 2,
|
| 11 |
+
"head_dim": 64,
|
| 12 |
+
"intermediate_size": 3456,
|
| 13 |
+
"ffn_mult": 5.4,
|
| 14 |
+
"max_position_embeddings": 2048,
|
| 15 |
+
"rope_theta": 10000.0,
|
| 16 |
+
"tie_word_embeddings": false,
|
| 17 |
+
"torch_dtype": "bfloat16",
|
| 18 |
+
"bos_token_id": 0,
|
| 19 |
+
"eos_token_id": 2,
|
| 20 |
+
"pad_token_id": 1,
|
| 21 |
+
"transformers_version": null
|
| 22 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:64bc2a8190c2620aaddc4151443c26fdba49f3984eaf6be643ba73ba6baa578b
|
| 3 |
+
size 427881984
|
tokenizer.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:45d7a06bfacb1f8112436ea508ebaac0791ea1d0c9165b0f2519d7fed5ce6168
|
| 3 |
+
size 1305066
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"tokenizer_class": "LlamaTokenizer",
|
| 3 |
+
"model_max_length": 2048,
|
| 4 |
+
"added_tokens": {
|
| 5 |
+
"<unk>": 0,
|
| 6 |
+
"<bos>": 1,
|
| 7 |
+
"<eos>": 2,
|
| 8 |
+
"<pad>": 3,
|
| 9 |
+
"<system>": 7,
|
| 10 |
+
"<user>": 8,
|
| 11 |
+
"<assistant>": 9,
|
| 12 |
+
"<think>": 10,
|
| 13 |
+
"</think>": 11,
|
| 14 |
+
"<tool_call>": 12,
|
| 15 |
+
"<tool_result>": 13,
|
| 16 |
+
"<eot>": 14,
|
| 17 |
+
"<mask>": 4,
|
| 18 |
+
"<sep>": 5,
|
| 19 |
+
"<cls>": 6
|
| 20 |
+
},
|
| 21 |
+
"_arkadiko_note": "The trained model config (config.json) sets bos_token_id=0, eos_token_id=2, pad_token_id=1. The actual SPM model ships <unk>=0, <bos>=1, <eos>=2, <pad>=3. The runtime SHOULD use the tokenizer-derived IDs (this file's `added_tokens`) — config.json values are kept as-trained for reproducibility but are misaligned. See README for details."
|
| 22 |
+
}
|
training_summary.json
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"step": 9114584,
|
| 3 |
+
"total_tokens": 100000006144,
|
| 4 |
+
"subphase_idx": 15,
|
| 5 |
+
"final_eval_step": 9110000,
|
| 6 |
+
"final_overall_loss_nats": 3.3363,
|
| 7 |
+
"final_overall_ma3_nats": 3.3602,
|
| 8 |
+
"best_overall_loss_nats": 3.2803,
|
| 9 |
+
"best_overall_step": 8815000,
|
| 10 |
+
"per_domain_ma3_loss_nats": {
|
| 11 |
+
"ar": 3.7952,
|
| 12 |
+
"en": 3.7491,
|
| 13 |
+
"de": 3.5717,
|
| 14 |
+
"fr": 3.3201,
|
| 15 |
+
"es": 3.4335,
|
| 16 |
+
"it": 3.4953,
|
| 17 |
+
"code": 1.9293,
|
| 18 |
+
"math": 3.096,
|
| 19 |
+
"classical": 3.7764
|
| 20 |
+
},
|
| 21 |
+
"per_domain_ma3_ppl": {
|
| 22 |
+
"ar": 44.5,
|
| 23 |
+
"en": 42.5,
|
| 24 |
+
"de": 35.6,
|
| 25 |
+
"fr": 27.7,
|
| 26 |
+
"es": 30.9,
|
| 27 |
+
"it": 32.9,
|
| 28 |
+
"code": 6.9,
|
| 29 |
+
"math": 22.1,
|
| 30 |
+
"classical": 43.7
|
| 31 |
+
},
|
| 32 |
+
"training_hours": 524.7,
|
| 33 |
+
"hardware": "NVIDIA RTX PRO 4000 Blackwell, 24GB",
|
| 34 |
+
"wall_clock_end": "2026-05-06T14:43+00:00"
|
| 35 |
+
}
|