- Akkadian-English DenseLLM 1B
- What Is This?
- Relationship to the Previous Version
- Important Inference Note
- Install
- Quick Usage
- Full Gradio Inference App
- Minimal Inference Pattern
- Example Prompts
- Architecture
- Architecture Details
- Training Focus
- Dataset Focus
- Recommended Generation Settings
- Known Limitations
- Tokenizer Note
- Suggested Use Cases
- Out-of-Scope Uses
- Feedback Welcome
- Contact
- Citation
- License
- What Is This?
Akkadian-English DenseLLM 1B
Akkadian-English DenseLLM 1B is an experimental custom DenseLLM model focused on English ↔ Akkadian / Old Babylonian translation, transliteration, glossing, and grammatical analysis.
This model is a continuation of the earlier AlgoDriveAI/Sanskrit_Akkadian_LLM project. The previous version mixed English, Sanskrit, Akkadian, and some auxiliary material. For this version, the model size was increased to approximately 1B parameters, and the training focus was narrowed to English and Akkadian only.
The main reason for this change is that early testing showed it may be easier to evaluate and improve the model by separating language targets. Instead of combining English/Akkadian/Sanskrit in one testing version, this release focuses specifically on English/Akkadian behavior.
This version achieves higher accuracy on Akkadian-focused translation, glossing, and analysis tasks compared with the earlier mixed-language testing versions.
What Is This?
This is a research model for ancient-language experimentation, especially:
- Akkadian-to-English translation
- English-to-Akkadian style generation
- Old Babylonian-style normalized transliteration
- Literal word-by-word glossing
- Grammatical explanation
- Case-ending analysis
- Verb-form explanation
- Ancient-language prompt-following behavior
This is not a production translation system. Outputs should be checked against reliable Akkadian grammars, dictionaries, corpora, and primary sources.
Relationship to the Previous Version
Compared with the earlier Sanskrit + Akkadian model:
- The model size was increased to approximately 1B parameters
- Sanskrit was removed from the main training target
- The focus was narrowed to English/Akkadian
- The model is intended to achieve higher accuracy on Akkadian-specific tasks
- The architecture remains a custom DenseLLM-style causal language model
- The model uses a custom MLA-style attention architecture
- Inference is handled with custom model code rather than standard
AutoModelForCausalLM
Important Inference Note
This is not a standard Hugging Face AutoModelForCausalLM checkpoint.
The repository currently uses:
pytorch_model.binfor model weightsconfig.jsonfor partial configuration metadata- tokenizer files, if available
Because this is a custom DenseLLM / MLA architecture, inference should instantiate the matching model class manually. Some versions of config.json may not include every MLA-specific field, so the safest inference script should infer architecture values from the checkpoint tensor shapes and then use the known training architecture defaults when needed.
The expected architecture for this release is:
| Hyperparameter | Value |
|---|---|
d_model |
1536 |
n_layers |
36 |
n_heads |
12 |
q_lora_rank |
768 |
kv_lora_rank |
384 |
qk_nope_head_dim |
64 |
qk_rope_head_dim |
64 |
v_head_dim |
128 |
ff_hidden_mult |
3.5 |
max_seq_len |
4096 |
Install
pip install torch transformers huggingface_hub gradio
Optional:
pip install einops
Quick Usage
The easiest way to run the model locally is to copy the full Gradio script below into a file such as:
app.py
Then run:
python app.py
The script loads:
AlgoDriveAI/Akkadian_English_DenseLLM_1B/pytorch_model.bin
and uses:
REPO_ID = "AlgoDriveAI/Akkadian_English_DenseLLM_1B"
WEIGHTS_FILENAME = "pytorch_model.bin"
CONFIG_FILENAME = "config.json"
The Gradio script does not require modeling_dense_llm.py to be uploaded, because the model architecture is included directly inside the script.
Full Gradio Inference App
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Gradio inference app for:
AlgoDriveAI/Akkadian_English_DenseLLM_1B
Fixes included:
- Does NOT trust bad/partial config.json values for architecture.
- Infers core architecture from pytorch_model.bin tensor shapes first.
- Uses the training MLA architecture defaults:
d_model=1536
n_layers=36
n_heads=12
q_lora_rank=768
kv_lora_rank=384
qk_nope_head_dim=64
qk_rope_head_dim=64
v_head_dim=128
ff_hidden_mult=3.5
max_seq_len=4096
- Defines DenseLLM directly in this script.
- Does NOT require modeling_dense_llm.py to be uploaded.
- Launches a streaming Gradio UI.
Install:
pip install torch transformers huggingface_hub gradio
"""
import os
import re
import json
from dataclasses import dataclass
from typing import Optional, Dict, Any
import torch
import torch.nn as nn
import torch.nn.functional as F
import gradio as gr
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
# =============================================================================
# REPO SETTINGS
# =============================================================================
REPO_ID = "AlgoDriveAI/Akkadian_English_DenseLLM_1B"
CONFIG_FILENAME = "config.json"
WEIGHTS_FILENAME = "pytorch_model.bin"
FALLBACK_TOKENIZER = "mistralai/Mistral-7B-Instruct-v0.3"
DOC_EOS_TOKEN = "<|endoftext|>"
# =============================================================================
# KNOWN TRAINING ARCHITECTURE FALLBACKS
# =============================================================================
TRAINING_D_MODEL = 1536
TRAINING_N_LAYERS = 36
TRAINING_N_HEADS = 12
TRAINING_Q_LORA_RANK = 768
TRAINING_KV_LORA_RANK = 384
TRAINING_QK_NOPE_HEAD_DIM = 64
TRAINING_QK_ROPE_HEAD_DIM = 64
TRAINING_V_HEAD_DIM = 128
TRAINING_FF_MULT = 3.5
TRAINING_QK_NORM = True
TRAINING_MAX_SEQ_LEN = 4096
# =============================================================================
# MODEL ARCHITECTURE
# =============================================================================
try:
from torch.nn import RMSNorm
except ImportError:
class RMSNorm(nn.Module):
def __init__(self, normalized_shape, eps: float = 1e-6):
super().__init__()
if isinstance(normalized_shape, int):
normalized_shape = (normalized_shape,)
self.eps = eps
self.weight = nn.Parameter(torch.ones(normalized_shape))
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.weight * (
x.float()
* torch.rsqrt(x.float().pow(2).mean(-1, keepdim=True) + self.eps)
).to(x.dtype)
class RotaryEmbedding(nn.Module):
def __init__(self, dim: int, base: float = 10000.0, max_seq_len: int = 8192):
super().__init__()
inv_freq = 1.0 / (
base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim)
)
t = torch.arange(max_seq_len, dtype=torch.float32)
freqs = torch.outer(t, inv_freq)
self.register_buffer(
"cos_cached",
freqs.cos().repeat(1, 2),
persistent=False,
)
self.register_buffer(
"sin_cached",
freqs.sin().repeat(1, 2),
persistent=False,
)
def forward(self, seq_len: int, dtype: torch.dtype):
return (
self.cos_cached[:seq_len].to(dtype),
self.sin_cached[:seq_len].to(dtype),
)
def _rotate_half(x: torch.Tensor) -> torch.Tensor:
half = x.shape[-1] // 2
return torch.cat([-x[..., half:], x[..., :half]], dim=-1)
def apply_rotary_emb(
q: torch.Tensor,
k: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
):
cos = cos.unsqueeze(0).unsqueeze(0)
sin = sin.unsqueeze(0).unsqueeze(0)
q_out = (q * cos) + (_rotate_half(q) * sin)
k_out = (k * cos) + (_rotate_half(k) * sin)
return q_out, k_out
class MLA(nn.Module):
"""
MLA-style attention matching the training code:
- Q compression
- fused KV down projection
- fused KV up projection
- shared k_rope
- optional QK norm
"""
def __init__(
self,
d_model: int,
n_heads: int,
q_lora_rank: int,
kv_lora_rank: int,
qk_nope_head_dim: int,
qk_rope_head_dim: int,
v_head_dim: int,
rope: RotaryEmbedding,
qk_norm: bool = False,
attn_dropout: float = 0.0,
):
super().__init__()
self.n_heads = n_heads
self.q_lora_rank = q_lora_rank
self.kv_lora_rank = kv_lora_rank
self.qk_nope_head_dim = qk_nope_head_dim
self.qk_rope_head_dim = qk_rope_head_dim
self.q_head_dim = qk_nope_head_dim + qk_rope_head_dim
self.v_head_dim = v_head_dim
self.attn_drop = attn_dropout
self.rope = rope
self.q_a_proj = nn.Linear(d_model, q_lora_rank, bias=False)
self.q_a_norm = RMSNorm(q_lora_rank)
self.q_b_proj = nn.Linear(
q_lora_rank,
n_heads * self.q_head_dim,
bias=False,
)
self.kv_a_proj = nn.Linear(
d_model,
kv_lora_rank + qk_rope_head_dim,
bias=False,
)
self.kv_a_norm = RMSNorm(kv_lora_rank)
self.kv_b_proj = nn.Linear(
kv_lora_rank,
n_heads * (qk_nope_head_dim + v_head_dim),
bias=False,
)
self.o_proj = nn.Linear(
n_heads * v_head_dim,
d_model,
bias=False,
)
self.o_proj._is_residual = True
self.qk_norm = qk_norm
if qk_norm:
self.q_nope_norm = RMSNorm(qk_nope_head_dim)
self.k_nope_norm = RMSNorm(qk_nope_head_dim)
def forward(self, x: torch.Tensor) -> torch.Tensor:
B, T, _ = x.shape
H = self.n_heads
nope_dim = self.qk_nope_head_dim
rope_dim = self.qk_rope_head_dim
v_dim = self.v_head_dim
q_dim = self.q_head_dim
q = self.q_b_proj(self.q_a_norm(self.q_a_proj(x)))
q = q.view(B, T, H, q_dim)
q_nope = q[..., :nope_dim]
q_rope = q[..., nope_dim:]
kv_a = self.kv_a_proj(x)
c, k_rope = torch.split(
kv_a,
[self.kv_lora_rank, rope_dim],
dim=-1,
)
c = self.kv_a_norm(c)
k_rope = k_rope.view(B, T, 1, rope_dim)
kv = self.kv_b_proj(c)
kv = kv.view(B, T, H, nope_dim + v_dim)
k_nope, v = torch.split(kv, [nope_dim, v_dim], dim=-1)
if self.qk_norm:
q_nope = self.q_nope_norm(q_nope)
k_nope = self.k_nope_norm(k_nope)
q_nope = q_nope.transpose(1, 2)
q_rope = q_rope.transpose(1, 2)
k_nope = k_nope.transpose(1, 2)
k_rope = k_rope.transpose(1, 2)
v = v.transpose(1, 2)
cos, sin = self.rope(T, x.dtype)
q_rope, k_rope = apply_rotary_emb(q_rope, k_rope, cos, sin)
q = torch.cat([q_nope, q_rope], dim=-1)
k_rope = k_rope.expand(B, H, T, rope_dim)
k = torch.cat([k_nope, k_rope], dim=-1)
drop_p = self.attn_drop if self.training else 0.0
out = F.scaled_dot_product_attention(
q,
k,
v,
dropout_p=drop_p,
is_causal=True,
)
out = out.transpose(1, 2).reshape(B, T, H * v_dim)
return self.o_proj(out)
class SwiGLU(nn.Module):
def __init__(self, d_model: int, hidden_mult: float = 3.5):
super().__init__()
inner = int(hidden_mult * d_model)
inner = ((inner + 255) // 256) * 256
self.gate_up_proj = nn.Linear(d_model, 2 * inner, bias=False)
self.down_proj = nn.Linear(inner, d_model, bias=False)
self.down_proj._is_residual = True
def forward(self, x: torch.Tensor) -> torch.Tensor:
gate, up = self.gate_up_proj(x).chunk(2, dim=-1)
return self.down_proj(F.silu(gate) * up)
class Block(nn.Module):
def __init__(
self,
d_model: int,
n_heads: int,
q_lora_rank: int,
kv_lora_rank: int,
qk_nope_head_dim: int,
qk_rope_head_dim: int,
v_head_dim: int,
rope: RotaryEmbedding,
ff_hidden_mult: float = 3.5,
qk_norm: bool = False,
attn_dropout: float = 0.0,
resid_dropout: float = 0.0,
):
super().__init__()
self.ln_attn = RMSNorm(d_model)
self.ln_ff = RMSNorm(d_model)
self.attn = MLA(
d_model=d_model,
n_heads=n_heads,
q_lora_rank=q_lora_rank,
kv_lora_rank=kv_lora_rank,
qk_nope_head_dim=qk_nope_head_dim,
qk_rope_head_dim=qk_rope_head_dim,
v_head_dim=v_head_dim,
rope=rope,
qk_norm=qk_norm,
attn_dropout=attn_dropout,
)
self.ff = SwiGLU(
d_model=d_model,
hidden_mult=ff_hidden_mult,
)
self.resid_drop = (
nn.Dropout(resid_dropout)
if resid_dropout > 0
else nn.Identity()
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = x + self.resid_drop(self.attn(self.ln_attn(x)))
x = x + self.resid_drop(self.ff(self.ln_ff(x)))
return x
@dataclass
class ModelConfig:
vocab_size: int
d_model: int
n_layers: int
n_heads: int
q_lora_rank: int
kv_lora_rank: int
qk_nope_head_dim: int
qk_rope_head_dim: int
v_head_dim: int
ff_hidden_mult: float
qk_norm: bool
max_seq_len: int = 4096
attn_dropout: float = 0.0
resid_dropout: float = 0.0
emb_dropout: float = 0.0
label_smoothing: float = 0.0
@property
def q_head_dim(self) -> int:
return self.qk_nope_head_dim + self.qk_rope_head_dim
class DenseLLM(nn.Module):
def __init__(
self,
cfg: ModelConfig,
use_gradient_checkpointing: bool = False,
):
super().__init__()
self.cfg = cfg
self.use_gradient_checkpointing = use_gradient_checkpointing
self.embed = nn.Embedding(cfg.vocab_size, cfg.d_model)
self.emb_drop = (
nn.Dropout(cfg.emb_dropout)
if cfg.emb_dropout > 0
else nn.Identity()
)
self.rope = RotaryEmbedding(
dim=cfg.qk_rope_head_dim,
max_seq_len=cfg.max_seq_len,
)
self.blocks = nn.ModuleList([
Block(
d_model=cfg.d_model,
n_heads=cfg.n_heads,
q_lora_rank=cfg.q_lora_rank,
kv_lora_rank=cfg.kv_lora_rank,
qk_nope_head_dim=cfg.qk_nope_head_dim,
qk_rope_head_dim=cfg.qk_rope_head_dim,
v_head_dim=cfg.v_head_dim,
rope=self.rope,
ff_hidden_mult=cfg.ff_hidden_mult,
qk_norm=cfg.qk_norm,
attn_dropout=cfg.attn_dropout,
resid_dropout=cfg.resid_dropout,
)
for _ in range(cfg.n_layers)
])
self.ln_f = RMSNorm(cfg.d_model)
self.lm_head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False)
self.apply(self._init_weights)
scale = (2 * cfg.n_layers) ** -0.5
for module in self.modules():
if getattr(module, "_is_residual", False):
with torch.no_grad():
module.weight.mul_(scale)
self.lm_head.weight = self.embed.weight
@staticmethod
def _init_weights(module: nn.Module):
if isinstance(module, nn.Linear):
nn.init.normal_(module.weight, mean=0.0, std=0.02)
elif isinstance(module, nn.Embedding):
nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(
self,
idx: torch.Tensor,
targets: Optional[torch.Tensor] = None,
):
x = self.emb_drop(self.embed(idx))
for block in self.blocks:
x = block(x)
logits = self.lm_head(self.ln_f(x))
loss = None
if targets is not None:
loss = F.cross_entropy(
logits[:, :-1].contiguous().view(-1, logits.size(-1)),
targets[:, 1:].contiguous().view(-1),
label_smoothing=self.cfg.label_smoothing,
)
return logits, loss
# =============================================================================
# LOADING HELPERS
# =============================================================================
def load_json_from_hf(repo_id: str, filename: str):
path = hf_hub_download(
repo_id=repo_id,
filename=filename,
)
with open(path, "r", encoding="utf-8") as f:
data = json.load(f)
return data, path
def load_state_dict_safely(weights_path: str):
try:
obj = torch.load(
weights_path,
map_location="cpu",
weights_only=True,
)
except TypeError:
obj = torch.load(weights_path, map_location="cpu")
except Exception:
obj = torch.load(weights_path, map_location="cpu")
if isinstance(obj, dict):
if "model" in obj:
obj = obj["model"]
elif "model_state_dict" in obj:
obj = obj["model_state_dict"]
elif "state_dict" in obj:
obj = obj["state_dict"]
if not isinstance(obj, dict):
raise TypeError("Loaded weights object is not a state_dict dictionary.")
cleaned = {}
for key, value in obj.items():
if key.startswith("_orig_mod."):
key = key.removeprefix("_orig_mod.")
if key.startswith("module."):
key = key.removeprefix("module.")
cleaned[key] = value
return cleaned
def find_key(state_dict: Dict[str, torch.Tensor], *suffixes: str) -> Optional[str]:
"""
Finds a key by exact match first, then by suffix.
Useful if the checkpoint has prefixes.
"""
for suffix in suffixes:
if suffix in state_dict:
return suffix
for key in state_dict.keys():
for suffix in suffixes:
if key.endswith(suffix):
return key
return None
def infer_n_layers(state_dict: Dict[str, torch.Tensor]) -> Optional[int]:
block_indices = []
for key in state_dict.keys():
match = re.search(r"(?:^|\.)blocks\.(\d+)\.", key)
if match:
block_indices.append(int(match.group(1)))
if not block_indices:
return None
return max(block_indices) + 1
def choose_n_heads(
q_b_out: int,
o_proj_in: int,
d_model: int,
raw_config: Dict[str, Any],
) -> int:
"""
Chooses n_heads from checkpoint shapes.
For this training run:
q_b_out = n_heads * q_head_dim
o_proj_in = n_heads * v_head_dim
q_head_dim = 128
v_head_dim = 128
n_heads = 12
d_model = 1536
"""
candidates = []
# Prefer raw config only if it is compatible with checkpoint shapes.
for key in ["n_heads", "num_attention_heads", "num_heads"]:
if key in raw_config:
try:
h = int(raw_config[key])
if h > 0 and q_b_out % h == 0 and o_proj_in % h == 0:
q_head_dim = q_b_out // h
v_head_dim = o_proj_in // h
candidates.append((h, q_head_dim, v_head_dim, "raw_config"))
except Exception:
pass
# Add known training value if compatible.
h = TRAINING_N_HEADS
if q_b_out % h == 0 and o_proj_in % h == 0:
candidates.append((h, q_b_out // h, o_proj_in // h, "training_default"))
# General divisors.
for h in range(1, 129):
if q_b_out % h == 0 and o_proj_in % h == 0:
q_head_dim = q_b_out // h
v_head_dim = o_proj_in // h
candidates.append((h, q_head_dim, v_head_dim, "divisor_search"))
# Best case: q_head_dim == v_head_dim == 128 and h * v_head_dim == d_model.
for h, qhd, vhd, source in candidates:
if qhd == 128 and vhd == 128 and h * vhd == d_model:
return h
# Next: q_head_dim == v_head_dim and h * v_head_dim == d_model.
for h, qhd, vhd, source in candidates:
if qhd == vhd and h * vhd == d_model:
return h
# Next: known training default if compatible.
for h, qhd, vhd, source in candidates:
if source == "training_default":
return h
# Last: raw config if compatible.
for h, qhd, vhd, source in candidates:
if source == "raw_config":
return h
raise ValueError(
f"Could not infer n_heads from shapes: q_b_out={q_b_out}, "
f"o_proj_in={o_proj_in}, d_model={d_model}"
)
def build_model_config_from_checkpoint(
state_dict: Dict[str, torch.Tensor],
raw_config: Dict[str, Any],
) -> Dict[str, Any]:
"""
Build ModelConfig from checkpoint shapes first.
This avoids trusting partial or misleading config.json values.
The checkpoint is the source of truth.
"""
embed_key = find_key(state_dict, "embed.weight")
q_a_key = find_key(state_dict, "blocks.0.attn.q_a_proj.weight")
q_b_key = find_key(state_dict, "blocks.0.attn.q_b_proj.weight")
kv_a_key = find_key(state_dict, "blocks.0.attn.kv_a_proj.weight")
kv_b_key = find_key(state_dict, "blocks.0.attn.kv_b_proj.weight")
o_proj_key = find_key(state_dict, "blocks.0.attn.o_proj.weight")
gate_up_key = find_key(state_dict, "blocks.0.ff.gate_up_proj.weight")
required_keys = {
"embed.weight": embed_key,
"blocks.0.attn.q_a_proj.weight": q_a_key,
"blocks.0.attn.q_b_proj.weight": q_b_key,
"blocks.0.attn.kv_a_proj.weight": kv_a_key,
"blocks.0.attn.kv_b_proj.weight": kv_b_key,
"blocks.0.attn.o_proj.weight": o_proj_key,
"blocks.0.ff.gate_up_proj.weight": gate_up_key,
}
missing = [name for name, key in required_keys.items() if key is None]
if missing:
print("\nAvailable state_dict keys sample:")
for k in list(state_dict.keys())[:80]:
print(" ", k)
raise KeyError("Missing expected checkpoint keys: " + ", ".join(missing))
embed = state_dict[embed_key]
q_a = state_dict[q_a_key]
q_b = state_dict[q_b_key]
kv_a = state_dict[kv_a_key]
kv_b = state_dict[kv_b_key]
o_proj = state_dict[o_proj_key]
gate_up = state_dict[gate_up_key]
vocab_size = int(embed.shape[0])
d_model = int(embed.shape[1])
n_layers = infer_n_layers(state_dict)
if n_layers is None:
n_layers = TRAINING_N_LAYERS
q_lora_rank = int(q_a.shape[0])
q_b_out = int(q_b.shape[0])
kv_a_out = int(kv_a.shape[0])
kv_b_out = int(kv_b.shape[0])
o_proj_in = int(o_proj.shape[1])
n_heads = choose_n_heads(
q_b_out=q_b_out,
o_proj_in=o_proj_in,
d_model=d_model,
raw_config=raw_config,
)
q_head_dim = q_b_out // n_heads
v_head_dim = o_proj_in // n_heads
# Prefer the training split of q_head_dim=64+64 when q_head_dim=128.
if q_head_dim == (
TRAINING_QK_NOPE_HEAD_DIM + TRAINING_QK_ROPE_HEAD_DIM
):
qk_nope_head_dim = TRAINING_QK_NOPE_HEAD_DIM
qk_rope_head_dim = TRAINING_QK_ROPE_HEAD_DIM
else:
# Fallback: split evenly.
qk_nope_head_dim = q_head_dim // 2
qk_rope_head_dim = q_head_dim - qk_nope_head_dim
kv_lora_rank = kv_a_out - qk_rope_head_dim
# Cross-check kv_b shape:
# kv_b_out = n_heads * (qk_nope_head_dim + v_head_dim)
expected_kv_b_out = n_heads * (qk_nope_head_dim + v_head_dim)
if kv_b_out != expected_kv_b_out:
# Try the training defaults before failing.
qk_nope_head_dim = TRAINING_QK_NOPE_HEAD_DIM
qk_rope_head_dim = TRAINING_QK_ROPE_HEAD_DIM
v_head_dim = TRAINING_V_HEAD_DIM
kv_lora_rank = kv_a_out - qk_rope_head_dim
expected_kv_b_out = n_heads * (qk_nope_head_dim + v_head_dim)
if kv_b_out != expected_kv_b_out:
raise ValueError(
"Could not reconcile kv_b shape.\n"
f"kv_b_out={kv_b_out}\n"
f"expected={expected_kv_b_out}\n"
f"n_heads={n_heads}, nope={qk_nope_head_dim}, v={v_head_dim}"
)
inner = int(gate_up.shape[0]) // 2
inferred_ff_mult = inner / float(d_model)
training_inner = ((int(TRAINING_FF_MULT * d_model) + 255) // 256) * 256
if training_inner == inner:
ff_hidden_mult = TRAINING_FF_MULT
else:
ff_hidden_mult = inferred_ff_mult
max_seq_len = raw_config.get(
"max_seq_len",
raw_config.get("context_len", TRAINING_MAX_SEQ_LEN),
)
cfg = {
"vocab_size": vocab_size,
"d_model": d_model,
"n_layers": n_layers,
"n_heads": n_heads,
"q_lora_rank": q_lora_rank,
"kv_lora_rank": kv_lora_rank,
"qk_nope_head_dim": qk_nope_head_dim,
"qk_rope_head_dim": qk_rope_head_dim,
"v_head_dim": v_head_dim,
"ff_hidden_mult": ff_hidden_mult,
"qk_norm": bool(raw_config.get("qk_norm", TRAINING_QK_NORM)),
"max_seq_len": int(max_seq_len),
# Inference-time dropout/smoothing should be off.
"attn_dropout": 0.0,
"resid_dropout": 0.0,
"emb_dropout": 0.0,
"label_smoothing": 0.0,
}
if cfg["qk_nope_head_dim"] + cfg["qk_rope_head_dim"] != cfg["v_head_dim"]:
raise ValueError(
f"Bad inferred config: q_head_dim="
f"{cfg['qk_nope_head_dim'] + cfg['qk_rope_head_dim']} "
f"but v_head_dim={cfg['v_head_dim']}"
)
if cfg["d_model"] != cfg["n_heads"] * cfg["v_head_dim"]:
raise ValueError(
f"Bad inferred config: d_model={cfg['d_model']} but "
f"n_heads*v_head_dim={cfg['n_heads'] * cfg['v_head_dim']}"
)
return cfg
def load_tokenizer_for_model(
repo_id: str,
raw_config: Dict[str, Any],
target_vocab_size: int,
):
"""
Loads tokenizer.
First tries repo tokenizer. If it looks suspiciously tiny, falls back to
the training tokenizer from the training script.
"""
tokenizer = None
try:
print("Loading tokenizer from model repo...")
tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=True)
print(f"Repo tokenizer loaded. vocab={len(tokenizer):,}")
except Exception as e:
print(f"Could not load tokenizer from repo: {e}")
if tokenizer is not None and len(tokenizer) < 1000 and target_vocab_size > 1000:
print(
f"Repo tokenizer looks too small: vocab={len(tokenizer):,}, "
f"model vocab={target_vocab_size:,}. Ignoring repo tokenizer."
)
tokenizer = None
if tokenizer is None:
fallback_name = raw_config.get("vocab_name", FALLBACK_TOKENIZER)
print(f"Loading fallback tokenizer: {fallback_name}")
tokenizer = AutoTokenizer.from_pretrained(fallback_name, use_fast=True)
doc_eos_token = raw_config.get("doc_eos_token", DOC_EOS_TOKEN)
if doc_eos_token not in tokenizer.get_vocab():
tokenizer.add_special_tokens({
"additional_special_tokens": [doc_eos_token],
})
tokenizer.doc_eos_token = doc_eos_token
tokenizer.doc_eos_token_id = tokenizer.convert_tokens_to_ids(doc_eos_token)
if tokenizer.pad_token is None:
tokenizer.pad_token = doc_eos_token
tokenizer.pad_token_id = tokenizer.doc_eos_token_id
if len(tokenizer) < target_vocab_size:
needed = target_vocab_size - len(tokenizer)
print(f"Adding {needed:,} dummy tokens to match model vocab_size={target_vocab_size:,}")
tokenizer.add_tokens(
[f"<|dummy_infer_{i}|>" for i in range(needed)],
special_tokens=False,
)
if len(tokenizer) > target_vocab_size:
print(
f"WARNING: tokenizer vocab={len(tokenizer):,} is larger than "
f"model vocab={target_vocab_size:,}.\n"
"Input token IDs above model vocab will be remapped to EOS/doc token.\n"
"For best quality, upload the exact tokenizer files saved during training."
)
tokenizer.model_max_length = int(1e9)
print(
f"Tokenizer ready. tokenizer_vocab={len(tokenizer):,}, "
f"model_vocab={target_vocab_size:,}, eos_id={tokenizer.doc_eos_token_id}"
)
return tokenizer
def sanitize_input_ids(
input_ids: torch.Tensor,
model_vocab_size: int,
fallback_token_id: int,
):
"""
Prevent embedding-index errors if fallback tokenizer produces IDs outside
the model vocab.
"""
if input_ids.numel() == 0:
return input_ids
if input_ids.max().item() >= model_vocab_size:
input_ids = input_ids.clone()
input_ids[input_ids >= model_vocab_size] = fallback_token_id
return input_ids
# =============================================================================
# LOAD CONFIG, WEIGHTS, TOKENIZER, MODEL
# =============================================================================
print("Downloading config...")
raw_config, config_path = load_json_from_hf(REPO_ID, CONFIG_FILENAME)
print(f"Config path: {config_path}")
print("\nDownloading weights...")
weights_path = hf_hub_download(
repo_id=REPO_ID,
filename=WEIGHTS_FILENAME,
)
print(f"Weights path: {weights_path}")
print("\nLoading state dict...")
state_dict = load_state_dict_safely(weights_path)
print("\nBuilding model config from checkpoint tensor shapes...")
config = build_model_config_from_checkpoint(state_dict, raw_config)
model_cfg = ModelConfig(**config)
print("\nFinal model config:")
for key, value in config.items():
print(f" {key}: {value}")
print("\nLoading tokenizer...")
tokenizer = load_tokenizer_for_model(
repo_id=REPO_ID,
raw_config=raw_config,
target_vocab_size=model_cfg.vocab_size,
)
fallback_token_id = getattr(tokenizer, "doc_eos_token_id", None)
if fallback_token_id is None or fallback_token_id >= model_cfg.vocab_size:
fallback_token_id = tokenizer.eos_token_id
if fallback_token_id is None or fallback_token_id >= model_cfg.vocab_size:
fallback_token_id = 0
device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
torch.set_float32_matmul_precision("high")
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
try:
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)
except Exception:
pass
if torch.cuda.is_bf16_supported():
dtype = torch.bfloat16
else:
dtype = torch.float16
else:
dtype = torch.float32
print(f"\nUsing device={device}, dtype={dtype}")
print("\nBuilding model...")
model = DenseLLM(
model_cfg,
use_gradient_checkpointing=False,
).to(device=device, dtype=dtype)
print("Loading model weights...")
try:
model.load_state_dict(state_dict, strict=True)
print("Weights loaded with strict=True.")
except RuntimeError as e:
print("Strict load failed.")
print(str(e)[:4000])
raise
model.eval()
print("\nModel ready!")
# =============================================================================
# STREAMING GENERATION
# =============================================================================
@torch.inference_mode()
def stream_generate(
prompt: str,
max_new_tokens: int = 256,
temperature: float = 0.55,
top_k: int = 35,
top_p: float = 0.88,
repetition_penalty: float = 1.1,
):
if not prompt or not prompt.strip():
yield ""
return
encoded = tokenizer(
prompt,
return_tensors="pt",
add_special_tokens=False,
)
input_ids = encoded["input_ids"]
input_ids = sanitize_input_ids(
input_ids=input_ids,
model_vocab_size=model_cfg.vocab_size,
fallback_token_id=fallback_token_id,
).to(device)
generated = input_ids.clone()
prompt_len = input_ids.shape[1]
eos_id = getattr(tokenizer, "doc_eos_token_id", None)
if eos_id is None:
eos_id = tokenizer.eos_token_id
if eos_id is not None and eos_id >= model_cfg.vocab_size:
eos_id = None
max_seq_len = int(model_cfg.max_seq_len)
for _ in range(int(max_new_tokens)):
model_input = generated[:, -max_seq_len:]
logits, _ = model(model_input, None)
next_logits = logits[:, -1, :].float()
if temperature <= 0:
next_token = torch.argmax(next_logits, dim=-1, keepdim=True)
else:
next_logits = next_logits / max(float(temperature), 1e-8)
# Repetition penalty
if repetition_penalty and repetition_penalty != 1.0:
used_tokens = torch.unique(generated[0])
used_tokens = used_tokens[used_tokens < model_cfg.vocab_size]
if used_tokens.numel() > 0:
token_scores = next_logits[0, used_tokens]
next_logits[0, used_tokens] = torch.where(
token_scores > 0,
token_scores / repetition_penalty,
token_scores * repetition_penalty,
)
# Top-k filtering
if top_k and top_k > 0:
k = min(int(top_k), next_logits.size(-1))
values, _ = torch.topk(next_logits, k)
cutoff = values[:, [-1]]
next_logits[next_logits < cutoff] = -float("inf")
# Top-p / nucleus filtering
if top_p and top_p < 1.0:
sorted_logits, sorted_indices = torch.sort(next_logits, descending=True)
sorted_probs = F.softmax(sorted_logits, dim=-1)
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
remove_mask = cumulative_probs > float(top_p)
remove_mask[..., 1:] = remove_mask[..., :-1].clone()
remove_mask[..., 0] = False
full_mask = torch.zeros_like(next_logits, dtype=torch.bool)
full_mask.scatter_(1, sorted_indices, remove_mask)
next_logits[full_mask] = -float("inf")
probs = F.softmax(next_logits, dim=-1)
if (
not torch.isfinite(probs).all()
or (probs.sum(dim=-1) <= 0).any()
):
next_token = torch.argmax(logits[:, -1, :], dim=-1, keepdim=True)
else:
next_token = torch.multinomial(probs, num_samples=1)
generated = torch.cat([generated, next_token], dim=-1)
if eos_id is not None and next_token.item() == eos_id:
break
decoded = tokenizer.decode(
generated[0, prompt_len:],
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
yield decoded
def respond(
prompt,
max_tokens,
temperature,
top_k,
top_p,
repetition_penalty,
):
for partial in stream_generate(
prompt=prompt,
max_new_tokens=max_tokens,
temperature=temperature,
top_k=top_k,
top_p=top_p,
repetition_penalty=repetition_penalty,
):
yield partial
# =============================================================================
# GRADIO UI
# =============================================================================
DEFAULT_PROMPT = """Translate the following Akkadian transliteration into English. Include a literal word-by-word gloss:
šarrum bītam rabiam ana ilim ibni.
"""
EXAMPLES = [
[
"Translate the following Akkadian transliteration into English. Include a literal word-by-word gloss:\nšarrum bītam rabiam ana ilim ibni."
],
[
"Translate the following Akkadian transliteration into English. Include grammatical notes:\nṭupšarrum awātim ina ṭuppim išṭur."
],
[
"Translate the following Akkadian transliteration into English. Provide a literal gloss and smooth English translation:\ntamkārum kaspam ana wardim iddin."
],
[
"Translate the following Old Babylonian-style Akkadian transliteration into English. Explain the case endings if possible:\nawīlum dannum abul ālim ina mūšim iṣṣur."
],
[
"Translate the following Akkadian transliteration into English. Give both a literal and natural translation:\nahī ana bīt abīšu īrub."
],
[
"Translate the following Akkadian transliteration into English. If uncertain, explain the possible alternatives:\nlū ilum šarram u ālam liṣṣur."
],
]
with gr.Blocks(
title="Akkadian-English DenseLLM 1B",
theme=gr.themes.Soft(),
) as demo:
gr.Markdown(
"# Akkadian-English DenseLLM 1B\n"
"*AlgoDriveAI — custom DenseLLM / MLA architecture for Akkadian and Old Babylonian translation experiments*"
)
with gr.Row():
with gr.Column(scale=3):
prompt_box = gr.Textbox(
label="Prompt",
placeholder="Translate the following Akkadian transliteration into English...",
lines=6,
value=DEFAULT_PROMPT,
)
output_box = gr.Textbox(
label="Output",
lines=16,
interactive=False,
)
with gr.Row():
generate_btn = gr.Button("Generate", variant="primary")
clear_btn = gr.ClearButton(
components=[prompt_box, output_box],
value="Clear",
)
with gr.Column(scale=1):
max_tokens = gr.Slider(
minimum=16,
maximum=768,
value=256,
step=1,
label="Max new tokens",
)
temperature = gr.Slider(
minimum=0.0,
maximum=2.0,
value=0.55,
step=0.05,
label="Temperature",
)
top_k = gr.Slider(
minimum=0,
maximum=100,
value=35,
step=1,
label="Top-K",
)
top_p = gr.Slider(
minimum=0.0,
maximum=1.0,
value=0.88,
step=0.01,
label="Top-P",
)
repetition_penalty = gr.Slider(
minimum=1.0,
maximum=1.5,
value=1.1,
step=0.01,
label="Repetition penalty",
)
gr.Examples(
examples=EXAMPLES,
inputs=prompt_box,
)
generate_btn.click(
fn=respond,
inputs=[
prompt_box,
max_tokens,
temperature,
top_k,
top_p,
repetition_penalty,
],
outputs=output_box,
)
prompt_box.submit(
fn=respond,
inputs=[
prompt_box,
max_tokens,
temperature,
top_k,
top_p,
repetition_penalty,
],
outputs=output_box,
)
if __name__ == "__main__":
demo.queue()
demo.launch(
server_name="0.0.0.0",
server_port=7860,
share=False,
)
Minimal Inference Pattern
For users who want to build their own script, the minimum checkpoint download pattern is:
import torch
from huggingface_hub import hf_hub_download
repo_id = "AlgoDriveAI/Akkadian_English_DenseLLM_1B"
weights_path = hf_hub_download(
repo_id=repo_id,
filename="pytorch_model.bin",
)
config_path = hf_hub_download(
repo_id=repo_id,
filename="config.json",
)
state_dict = torch.load(weights_path, map_location="cpu")
From there, the model must be loaded into the custom DenseLLM architecture used during training.
Example Prompts
Translate the following Akkadian transliteration into English. Include a literal word-by-word gloss:
šarrum bītam rabiam ana ilim ibni.
Expected meaning:
The king built a great temple for the god.
Translate the following Akkadian transliteration into English. Include grammatical notes:
ṭupšarrum awātim ina ṭuppim išṭur.
Expected meaning:
The scribe wrote the words on a clay tablet.
Translate the following Akkadian transliteration into English. Provide a literal gloss and smooth English translation:
tamkārum kaspam ana wardim iddin.
Expected meaning:
The merchant gave silver to the worker.
Translate the following Old Babylonian-style Akkadian transliteration into English. Explain the case endings if possible:
awīlum dannum abul ālim ina mūšim iṣṣur.
Expected meaning:
The strong man guarded the city gate at night.
Translate the following Akkadian transliteration into English. Explain the verb form:
ištarum ikrib nišī išme.
Expected meaning:
The goddess heard the prayer of the people.
Translate the following Akkadian transliteration into English. Include a word-by-word breakdown:
rē’ûm immerī ina eqlim imnu.
Expected meaning:
The shepherd counted the sheep in the field.
Translate the following Akkadian transliteration into English. Include brief grammar notes:
šumma nārum eli, ikkarū ihaddu.
Expected meaning:
If the river rises, the farmers will rejoice.
Translate the following Akkadian transliteration into English. Give both a literal and natural translation:
ahī ana bīt abīšu īrub.
Expected meaning:
My brother entered the house of his father.
Translate the following Akkadian transliteration into English. Include notes on nouns, verbs, and possessives:
šarratum ṭēmam ana mātim rūqtim išpur.
Expected meaning:
The queen sent a message to the distant land.
Translate the following Akkadian transliteration into English. If uncertain, explain the possible alternatives:
lū ilum šarram u ālam liṣṣur.
Expected meaning:
May the god protect the king and the city.
Architecture
| Component | Details |
|---|---|
| Type | Custom Dense Transformer / DenseLLM |
| Parameters | Approximately 1B |
| Attention | MLA-style attention |
| Positional Encoding | RoPE |
| Activation | SwiGLU |
| Normalization | RMSNorm |
| Task Type | Causal Language Modeling |
Standard HF AutoModelForCausalLM support |
No |
| Custom inference code required | Yes |
| Hyperparameter | Value |
|---|---|
d_model |
1536 |
n_layers |
36 |
n_heads |
12 |
q_lora_rank |
768 |
kv_lora_rank |
384 |
qk_nope_head_dim |
64 |
qk_rope_head_dim |
64 |
v_head_dim |
128 |
q_head_dim |
128 |
ff_hidden_mult |
3.5 |
max_seq_len |
4096 |
Architecture Details
This model uses a custom DenseLLM architecture with MLA-style attention.
The attention layout includes:
- Q compression through
q_lora_rank - Fused KV down-projection
- Fused KV up-projection
- Shared RoPE key component
- QK normalization on non-RoPE dimensions
- RoPE positional encoding
- SwiGLU feed-forward layers
- RMSNorm normalization
- Weight tying between token embeddings and the language-model head
The intended attention dimensions are:
qk_nope_head_dim = 64
qk_rope_head_dim = 64
q_head_dim = 128
v_head_dim = 128
The equality between q_head_dim and v_head_dim is intentional.
Training Focus
This version focuses on:
- English
- Akkadian
- Old Babylonian-style normalized transliteration
- Akkadian-to-English translation
- English-to-Akkadian style generation
- Literal glosses
- Grammatical explanations
- Prompt-following examples for Akkadian translation
Unlike the earlier mixed-language version, this release intentionally does not center Sanskrit as a training target.
Dataset Focus
The training focus was narrowed from a mixed Sanskrit/Akkadian setup into a more targeted English/Akkadian setup.
The goal is to make model behavior easier to evaluate in early testing:
- Fewer target-language interactions
- Less language blending
- Clearer Akkadian translation evaluation
- Easier debugging of glossing and grammar behavior
- More focused tests for Akkadian transliteration and Old Babylonian-style forms
Recommended Generation Settings
| Setting | Suggested Value |
|---|---|
temperature |
0.3–0.7 |
top_k |
30–50 |
top_p |
0.75–0.9 |
repetition_penalty |
1.05–1.15 |
max_new_tokens |
100–300 |
For translation tasks, lower temperature usually gives more stable output.
A good default is:
temperature = 0.55
top_k = 35
top_p = 0.88
repetition_penalty = 1.1
max_new_tokens = 256
Known Limitations
- Not a scholarly authority: Outputs should be checked against reliable Akkadian grammars, dictionaries, and corpora.
- Hallucinated forms: The model may invent plausible-looking Akkadian words or endings.
- Translation uncertainty: Akkadian is highly context-dependent, and short isolated sentences may have multiple possible readings.
- Inconsistent transliteration: The model may mix normalized forms, ASCII approximations, sign-like conventions, or nonstandard spellings.
- Prompt sensitivity: The model may behave differently depending on how explicitly the prompt is written.
- Repetition: Long generations may become repetitive.
- Tokenizer sensitivity: Output quality depends heavily on using the same tokenizer setup used during training.
- Custom architecture: The model requires matching custom inference code and will not load directly with
AutoModelForCausalLM.
Tokenizer Note
For best results, the repository should include the exact tokenizer files used during training.
If the tokenizer files are missing or incomplete, inference scripts may fall back to a compatible tokenizer and pad the tokenizer vocabulary to match the model’s embedding size. This can prevent runtime errors, but it may reduce output quality if the fallback tokenizer does not match the tokenizer used during training.
Recommended tokenizer-related files to include:
tokenizer.json
tokenizer_config.json
special_tokens_map.json
added_tokens.json
Suggested Use Cases
This model may be useful for:
- Testing Akkadian translation prompts
- Exploring Old Babylonian-style normalized transliteration
- Studying how small custom LLMs behave on ancient-language tasks
- Comparing mixed-language vs separated-language training
- Research experiments in low-resource language modeling
Out-of-Scope Uses
This model should not be used as the sole authority for:
- Academic publication translations
- Legal, religious, or historical claims
- Primary-source interpretation without expert review
- High-confidence philological analysis
- Production translation systems
Feedback Welcome
Feedback is especially useful on:
- Akkadian translation accuracy
- Gloss quality
- Verb parsing
- Case ending interpretation
- Old Babylonian-style grammar
- Prompt formats that work well
- Failure cases where the model produces plausible but wrong analysis
- Repetition or language-blending behavior
Contact
Organization: AlgoDriveAI
Author: Christopher Smith
Base / previous version: AlgoDriveAI/Sanskrit_Akkadian_LLM
Repository: AlgoDriveAI/Akkadian_English_DenseLLM_1B
Citation
@misc{algodrive2026akkadian_english_dense_1b,
author = {AlgoDriveAI, Christopher Smith},
title = {Akkadian-English DenseLLM 1B},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/AlgoDriveAI/Akkadian_English_DenseLLM_1B}
}
License
MIT
- Downloads last month
- 32
Model tree for AlgoDriveAI/Akkadian_English_DenseLLM_1B
Base model
AlgoDriveAI/Sanskrit_Akkadian_LLM_v1.0