Initial release: Arkadiko V4 base, 214M / 100B tokens

Browse files

Files changed (6) hide show

README.md +165 -0
config.json +22 -0
model.safetensors +3 -0
tokenizer.model +3 -0
tokenizer_config.json +22 -0
training_summary.json +35 -0

README.md ADDED Viewed

	@@ -0,0 +1,165 @@

+---
+license: cc-by-nc-4.0
+language:
+  - ar
+  - en
+  - de
+  - fr
+  - es
+  - it
+tags:
+  - arkadiko
+  - arabic
+  - bilingual
+  - pretrained
+  - causal-lm
+  - research
+library_name: transformers
+pipeline_tag: text-generation
+---
+# Arkadiko V4 — Base (pretrained, no SFT)
+214M-parameter causal decoder pretrained from scratch on ~100B tokens across 9 domains. **Pretraining only — no instruction tuning, no chat alignment, no RLHF.** Released as a research artifact.
+This is **V4**, not V5. The Arkadiko model family advances to V5 only after demonstrating four post-SFT capabilities (multi-turn chat, ar↔en translation, tool calling, structured thinking). None of those have been validated on this checkpoint. See the [Honest Limitations](#honest-limitations) section before considering use.
+## Quick facts
+| | |
+|---|---|
+| Parameters | 213,934,720 |
+| Architecture | Pure causal decoder, 18 layers |
+| Hidden size | 640 |
+| Attention | GQA, 10 query heads / 2 KV heads, head_dim=64 |
+| FFN | SwiGLU, hidden=3456 (≈5.4×) |
+| Vocab | 60,000 (SentencePiece BPE) |
+| Context | 2,048 tokens |
+| Position | RoPE, theta=10000 |
+| Tied embeddings | No (separate `wte` and `lm_head`) |
+| Tokens trained | 100,000,006,144 (~100B) |
+| Training steps | 9,114,584 |
+| Training hours | 524.7 |
+| Hardware | 1× NVIDIA RTX PRO 4000 Blackwell (24GB) |
+| Run completed | 2026-05-06 |
+## Final evaluation (held-out per-domain)
+Loss in nats, perplexity = exp(loss). Best-ever overall val PPL was **26.6** at step 8,815k; the released final checkpoint is at PPL ~28.8 (cosine-tail polish).
+| Domain | Val loss (MA3) | Perplexity |
+|---|---|---|
+| code | 1.93 | 6.9 |
+| math | 3.10 | 22.1 |
+| fr | 3.32 | 27.7 |
+| es | 3.43 | 30.9 |
+| it | 3.50 | 32.9 |
+| de | 3.57 | 35.6 |
+| classical (Arabic) | 3.78 | 43.7 |
+| en | 3.75 | 42.5 |
+| **ar (modern)** | **3.80** | **44.5** |
+| **overall** | 3.36 | 28.8 |
+## Training data
+Roughly:
+| Domain | Tokens | Source |
+|---|---|---|
+| Arabic (modern) | 24B | ArabicWeb24 + cc100-ar + CulturaX-ar |
+| English | 28B | FineWeb-Edu |
+| German | 12B | cc100-de |
+| French | 8B | cc100-fr |
+| Spanish | 8B | cc100-es |
+| Italian | 7B | cc100-it |
+| Code | 8B | CodeParrot + StarCoderData |
+| Math | 7B | OpenWebMath |
+| Classical Arabic | 2.7B | Custom (hadith, tafsir, OpenITI, poetry, tashkeela) |
+Single SentencePiece BPE tokenizer shared across all 9 domains. **Token-fertility is uneven** — Arabic averages roughly 2× the tokens-per-word of English in this vocab, which we believe is a primary cause of weaker Arabic perplexity. The next iteration uses an Arabic-aware tokenizer (see [Roadmap](#roadmap)).
+## Honest limitations
+This base model has known structural failures verified through completion testing across the run. Use accordingly.
+1. **Coherent generation horizon ≈ 50 tokens.** Past that, drift, topic-loop, or repetition. Capacity-bound at this size; SFT cannot extend it.
+2. **No factual recall in long form.** Capitals, public figures, dates — the model produces fluent confabulation, not facts. Pair with retrieval/tools, do not deploy as a Q&A system.
+3. **Cross-language code bleed.** Code prompts in one language frequently produce output flavored by another (JS prompt → Python output). Vocab-level issue.
+4. **Arabic — the primary target language — is the second-worst text domain by PPL.** Surface fluency reaches ~30-50 token spans; long-form Arabic reasoning is not present. The "Arabic-first" framing was not delivered at this scale.
+5. **No safety alignment.** No RLHF, no DPO, no toxicity filtering of training data beyond source-level curation. Outputs may be biased, false, or offensive.
+6. **No instruction-following.** Base model only. Will not reliably follow chat templates, refuse harmful requests, or call tools.
+### Configuration / tokenizer ID misalignment (read before using)
+The `config.json` shipped here records the values used during training: `bos_token_id=0, eos_token_id=2, pad_token_id=1`. The actual SentencePiece model (`tokenizer.model`) defines these tokens at different IDs:
+| Token | SPM ID | config.json |
+|---|---|---|
+| `<unk>` | 0 | (not specified) |
+| `<bos>` | 1 | `bos_token_id=0` |
+| `<eos>` | 2 | `eos_token_id=2` |
+| `<pad>` | 3 | `pad_token_id=1` |
+**Use the IDs from the SPM model when serving.** `tokenizer_config.json` lists the SPM-derived IDs in `added_tokens`. The misaligned values in `config.json` are preserved for reproducibility — the model was trained with them — but downstream code should treat the SPM model as the source of truth.
+This also affects all other special tokens, which the SPM model places at IDs 7–14:
+```
+<system>=7  <user>=8  <assistant>=9
+<think>=10  </think>=11  <tool_call>=12  <tool_result>=13  <eot>=14
+```
+`<think>` is the only special with a paired closer; `<tool_call>` and `<tool_result>` content is bounded by `<eos>` rather than a closing tag.
+## Loading
+The model uses a custom architecture (`ArkadikoForCausalLM`) which is not part of `transformers` upstream. To load weights, use the `arkadiko/llm/model.py` definition from the project repo, or load the `safetensors` tensors directly:
+```python
+import json
+from safetensors.torch import load_file
+state_dict = load_file("model.safetensors")
+config = json.load(open("config.json"))
+# Initialize your ArkadikoConfig + ArkadikoForCausalLM
+# (see https://github.com/... for the model code)
+# model.load_state_dict(state_dict, strict=False)
+```
+The repository code is not yet public. Drop a note in the discussions tab if you need it earlier than the planned release.
+## What this artifact is good for
+- **Research baseline.** Reproducible 214M / 100B-token Arabic-inclusive base.
+- **SFT experiments.** Suitable starting point for short-context, structured-output tasks (tool calling, format compliance) at small scale.
+- **Capability-curve studies.** Final eval and run log are included; full per-checkpoint curve available on request.
+## What this artifact is **not** good for
+- Production chat or assistant deployment.
+- Factual question answering.
+- Long-form generation (>50 tokens).
+- Translation as native generation. (A translation tool wrapper around any base may work better than this model alone.)
+## Roadmap
+The next planned iteration drops German/French/Spanish/Italian, focuses on Arabic + English + Classical + Code + Math, and grows to ~700M parameters with a 128K Arabic-aware tokenizer. See ADR-210 / ADR-211 in the project repo. This V4 base remains the experimental control.
+## License
+**CC BY-NC 4.0** — non-commercial use only. Attribution required. No warranty, no liability.
+## Citation
+```bibtex
+@misc{arkadiko_v4_base_2026,
+  author       = {{VectorNomad}},
+  title        = {Arkadiko V4: A 214M Arabic-Inclusive Pretrained Base Model},
+  year         = {2026},
+  publisher    = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/VectorNomad/arkadiko-v4-base}}
+}
+```
+## Acknowledgements
+Trained on a single RTX PRO 4000 Blackwell. Bridges, not factories.

config.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "model_type": "arkadiko",
+  "architectures": [
+    "ArkadikoForCausalLM"
+  ],
+  "vocab_size": 60000,
+  "hidden_size": 640,
+  "num_hidden_layers": 18,
+  "num_attention_heads": 10,
+  "num_key_value_heads": 2,
+  "head_dim": 64,
+  "intermediate_size": 3456,
+  "ffn_mult": 5.4,
+  "max_position_embeddings": 2048,
+  "rope_theta": 10000.0,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "bos_token_id": 0,
+  "eos_token_id": 2,
+  "pad_token_id": 1,
+  "transformers_version": null
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:64bc2a8190c2620aaddc4151443c26fdba49f3984eaf6be643ba73ba6baa578b
+size 427881984

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:45d7a06bfacb1f8112436ea508ebaac0791ea1d0c9165b0f2519d7fed5ce6168
+size 1305066

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "tokenizer_class": "LlamaTokenizer",
+  "model_max_length": 2048,
+  "added_tokens": {
+    "<unk>": 0,
+    "<bos>": 1,
+    "<eos>": 2,
+    "<pad>": 3,
+    "<system>": 7,
+    "<user>": 8,
+    "<assistant>": 9,
+    "<think>": 10,
+    "</think>": 11,
+    "<tool_call>": 12,
+    "<tool_result>": 13,
+    "<eot>": 14,
+    "<mask>": 4,
+    "<sep>": 5,
+    "<cls>": 6
+  },
+  "_arkadiko_note": "The trained model config (config.json) sets bos_token_id=0, eos_token_id=2, pad_token_id=1. The actual SPM model ships <unk>=0, <bos>=1, <eos>=2, <pad>=3. The runtime SHOULD use the tokenizer-derived IDs (this file's `added_tokens`) — config.json values are kept as-trained for reproducibility but are misaligned. See README for details."
+}

training_summary.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "step": 9114584,
+  "total_tokens": 100000006144,
+  "subphase_idx": 15,
+  "final_eval_step": 9110000,
+  "final_overall_loss_nats": 3.3363,
+  "final_overall_ma3_nats": 3.3602,
+  "best_overall_loss_nats": 3.2803,
+  "best_overall_step": 8815000,
+  "per_domain_ma3_loss_nats": {
+    "ar": 3.7952,
+    "en": 3.7491,
+    "de": 3.5717,
+    "fr": 3.3201,
+    "es": 3.4335,
+    "it": 3.4953,
+    "code": 1.9293,
+    "math": 3.096,
+    "classical": 3.7764
+  },
+  "per_domain_ma3_ppl": {
+    "ar": 44.5,
+    "en": 42.5,
+    "de": 35.6,
+    "fr": 27.7,
+    "es": 30.9,
+    "it": 32.9,
+    "code": 6.9,
+    "math": 22.1,
+    "classical": 43.7
+  },
+  "training_hours": 524.7,
+  "hardware": "NVIDIA RTX PRO 4000 Blackwell, 24GB",
+  "wall_clock_end": "2026-05-06T14:43+00:00"
+}