Akkadian-English DenseLLM 1B

Akkadian-English DenseLLM 1B is an experimental custom DenseLLM model focused on English ↔ Akkadian / Old Babylonian translation, transliteration, glossing, and grammatical analysis.

This model is a continuation of the earlier AlgoDriveAI/Sanskrit_Akkadian_LLM project. The previous version mixed English, Sanskrit, Akkadian, and some auxiliary material. For this version, the model size was increased to approximately 1B parameters, and the training focus was narrowed to English and Akkadian only.

The main reason for this change is that early testing showed it may be easier to evaluate and improve the model by separating language targets. Instead of combining English/Akkadian/Sanskrit in one testing version, this release focuses specifically on English/Akkadian behavior.

This version achieves higher accuracy on Akkadian-focused translation, glossing, and analysis tasks compared with the earlier mixed-language testing versions.

What Is This?

This is a research model for ancient-language experimentation, especially:

Akkadian-to-English translation
English-to-Akkadian style generation
Old Babylonian-style normalized transliteration
Literal word-by-word glossing
Grammatical explanation
Case-ending analysis
Verb-form explanation
Ancient-language prompt-following behavior

This is not a production translation system. Outputs should be checked against reliable Akkadian grammars, dictionaries, corpora, and primary sources.

Relationship to the Previous Version

Compared with the earlier Sanskrit + Akkadian model:

The model size was increased to approximately 1B parameters
Sanskrit was removed from the main training target
The focus was narrowed to English/Akkadian
The model is intended to achieve higher accuracy on Akkadian-specific tasks
The architecture remains a custom DenseLLM-style causal language model
The model uses a custom MLA-style attention architecture
Inference is handled with custom model code rather than standard AutoModelForCausalLM

Important Inference Note

This is not a standard Hugging Face AutoModelForCausalLM checkpoint.

The repository currently uses:

pytorch_model.bin for model weights
config.json for partial configuration metadata
tokenizer files, if available

Because this is a custom DenseLLM / MLA architecture, inference should instantiate the matching model class manually. Some versions of config.json may not include every MLA-specific field, so the safest inference script should infer architecture values from the checkpoint tensor shapes and then use the known training architecture defaults when needed.

The expected architecture for this release is:

Hyperparameter	Value
`d_model`	1536
`n_layers`	36
`n_heads`	12
`q_lora_rank`	768
`kv_lora_rank`	384
`qk_nope_head_dim`	64
`qk_rope_head_dim`	64
`v_head_dim`	128
`ff_hidden_mult`	3.5
`max_seq_len`	4096

Install

pip install torch transformers huggingface_hub gradio

Optional:

pip install einops

Quick Usage

The easiest way to run the model locally is to copy the full Gradio script below into a file such as:

app.py

Then run:

python app.py

The script loads:

AlgoDriveAI/Akkadian_English_DenseLLM_1B/pytorch_model.bin

and uses:

REPO_ID = "AlgoDriveAI/Akkadian_English_DenseLLM_1B"
WEIGHTS_FILENAME = "pytorch_model.bin"
CONFIG_FILENAME = "config.json"

The Gradio script does not require modeling_dense_llm.py to be uploaded, because the model architecture is included directly inside the script.

Full Gradio Inference App

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
Gradio inference app for:
    AlgoDriveAI/Akkadian_English_DenseLLM_1B

Fixes included:
- Does NOT trust bad/partial config.json values for architecture.
- Infers core architecture from pytorch_model.bin tensor shapes first.
- Uses the training MLA architecture defaults:
    d_model=1536
    n_layers=36
    n_heads=12
    q_lora_rank=768
    kv_lora_rank=384
    qk_nope_head_dim=64
    qk_rope_head_dim=64
    v_head_dim=128
    ff_hidden_mult=3.5
    max_seq_len=4096
- Defines DenseLLM directly in this script.
- Does NOT require modeling_dense_llm.py to be uploaded.
- Launches a streaming Gradio UI.

Install:
    pip install torch transformers huggingface_hub gradio
"""

import os
import re
import json
from dataclasses import dataclass
from typing import Optional, Dict, Any

import torch
import torch.nn as nn
import torch.nn.functional as F
import gradio as gr

from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer


# =============================================================================
# REPO SETTINGS
# =============================================================================

REPO_ID = "AlgoDriveAI/Akkadian_English_DenseLLM_1B"

CONFIG_FILENAME = "config.json"
WEIGHTS_FILENAME = "pytorch_model.bin"

FALLBACK_TOKENIZER = "mistralai/Mistral-7B-Instruct-v0.3"
DOC_EOS_TOKEN = "<|endoftext|>"


# =============================================================================
# KNOWN TRAINING ARCHITECTURE FALLBACKS
# =============================================================================

TRAINING_D_MODEL = 1536
TRAINING_N_LAYERS = 36
TRAINING_N_HEADS = 12

TRAINING_Q_LORA_RANK = 768
TRAINING_KV_LORA_RANK = 384

TRAINING_QK_NOPE_HEAD_DIM = 64
TRAINING_QK_ROPE_HEAD_DIM = 64
TRAINING_V_HEAD_DIM = 128

TRAINING_FF_MULT = 3.5
TRAINING_QK_NORM = True
TRAINING_MAX_SEQ_LEN = 4096


# =============================================================================
# MODEL ARCHITECTURE
# =============================================================================

try:
    from torch.nn import RMSNorm
except ImportError:
    class RMSNorm(nn.Module):
        def __init__(self, normalized_shape, eps: float = 1e-6):
            super().__init__()
            if isinstance(normalized_shape, int):
                normalized_shape = (normalized_shape,)
            self.eps = eps
            self.weight = nn.Parameter(torch.ones(normalized_shape))

        def forward(self, x: torch.Tensor) -> torch.Tensor:
            return self.weight * (
                x.float()
                * torch.rsqrt(x.float().pow(2).mean(-1, keepdim=True) + self.eps)
            ).to(x.dtype)


class RotaryEmbedding(nn.Module):
    def __init__(self, dim: int, base: float = 10000.0, max_seq_len: int = 8192):
        super().__init__()

        inv_freq = 1.0 / (
            base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim)
        )

        t = torch.arange(max_seq_len, dtype=torch.float32)
        freqs = torch.outer(t, inv_freq)

        self.register_buffer(
            "cos_cached",
            freqs.cos().repeat(1, 2),
            persistent=False,
        )

        self.register_buffer(
            "sin_cached",
            freqs.sin().repeat(1, 2),
            persistent=False,
        )

    def forward(self, seq_len: int, dtype: torch.dtype):
        return (
            self.cos_cached[:seq_len].to(dtype),
            self.sin_cached[:seq_len].to(dtype),
        )


def _rotate_half(x: torch.Tensor) -> torch.Tensor:
    half = x.shape[-1] // 2
    return torch.cat([-x[..., half:], x[..., :half]], dim=-1)


def apply_rotary_emb(
    q: torch.Tensor,
    k: torch.Tensor,
    cos: torch.Tensor,
    sin: torch.Tensor,
):
    cos = cos.unsqueeze(0).unsqueeze(0)
    sin = sin.unsqueeze(0).unsqueeze(0)

    q_out = (q * cos) + (_rotate_half(q) * sin)
    k_out = (k * cos) + (_rotate_half(k) * sin)

    return q_out, k_out


class MLA(nn.Module):
    """
    MLA-style attention matching the training code:
    - Q compression
    - fused KV down projection
    - fused KV up projection
    - shared k_rope
    - optional QK norm
    """

    def __init__(
        self,
        d_model: int,
        n_heads: int,
        q_lora_rank: int,
        kv_lora_rank: int,
        qk_nope_head_dim: int,
        qk_rope_head_dim: int,
        v_head_dim: int,
        rope: RotaryEmbedding,
        qk_norm: bool = False,
        attn_dropout: float = 0.0,
    ):
        super().__init__()

        self.n_heads = n_heads
        self.q_lora_rank = q_lora_rank
        self.kv_lora_rank = kv_lora_rank
        self.qk_nope_head_dim = qk_nope_head_dim
        self.qk_rope_head_dim = qk_rope_head_dim
        self.q_head_dim = qk_nope_head_dim + qk_rope_head_dim
        self.v_head_dim = v_head_dim
        self.attn_drop = attn_dropout
        self.rope = rope

        self.q_a_proj = nn.Linear(d_model, q_lora_rank, bias=False)
        self.q_a_norm = RMSNorm(q_lora_rank)
        self.q_b_proj = nn.Linear(
            q_lora_rank,
            n_heads * self.q_head_dim,
            bias=False,
        )

        self.kv_a_proj = nn.Linear(
            d_model,
            kv_lora_rank + qk_rope_head_dim,
            bias=False,
        )
        self.kv_a_norm = RMSNorm(kv_lora_rank)

        self.kv_b_proj = nn.Linear(
            kv_lora_rank,
            n_heads * (qk_nope_head_dim + v_head_dim),
            bias=False,
        )

        self.o_proj = nn.Linear(
            n_heads * v_head_dim,
            d_model,
            bias=False,
        )
        self.o_proj._is_residual = True

        self.qk_norm = qk_norm
        if qk_norm:
            self.q_nope_norm = RMSNorm(qk_nope_head_dim)
            self.k_nope_norm = RMSNorm(qk_nope_head_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, _ = x.shape

        H = self.n_heads
        nope_dim = self.qk_nope_head_dim
        rope_dim = self.qk_rope_head_dim
        v_dim = self.v_head_dim
        q_dim = self.q_head_dim

        q = self.q_b_proj(self.q_a_norm(self.q_a_proj(x)))
        q = q.view(B, T, H, q_dim)

        q_nope = q[..., :nope_dim]
        q_rope = q[..., nope_dim:]

        kv_a = self.kv_a_proj(x)
        c, k_rope = torch.split(
            kv_a,
            [self.kv_lora_rank, rope_dim],
            dim=-1,
        )

        c = self.kv_a_norm(c)
        k_rope = k_rope.view(B, T, 1, rope_dim)

        kv = self.kv_b_proj(c)
        kv = kv.view(B, T, H, nope_dim + v_dim)

        k_nope, v = torch.split(kv, [nope_dim, v_dim], dim=-1)

        if self.qk_norm:
            q_nope = self.q_nope_norm(q_nope)
            k_nope = self.k_nope_norm(k_nope)

        q_nope = q_nope.transpose(1, 2)
        q_rope = q_rope.transpose(1, 2)
        k_nope = k_nope.transpose(1, 2)
        k_rope = k_rope.transpose(1, 2)
        v = v.transpose(1, 2)

        cos, sin = self.rope(T, x.dtype)
        q_rope, k_rope = apply_rotary_emb(q_rope, k_rope, cos, sin)

        q = torch.cat([q_nope, q_rope], dim=-1)

        k_rope = k_rope.expand(B, H, T, rope_dim)
        k = torch.cat([k_nope, k_rope], dim=-1)

        drop_p = self.attn_drop if self.training else 0.0

        out = F.scaled_dot_product_attention(
            q,
            k,
            v,
            dropout_p=drop_p,
            is_causal=True,
        )

        out = out.transpose(1, 2).reshape(B, T, H * v_dim)

        return self.o_proj(out)


class SwiGLU(nn.Module):
    def __init__(self, d_model: int, hidden_mult: float = 3.5):
        super().__init__()

        inner = int(hidden_mult * d_model)
        inner = ((inner + 255) // 256) * 256

        self.gate_up_proj = nn.Linear(d_model, 2 * inner, bias=False)
        self.down_proj = nn.Linear(inner, d_model, bias=False)
        self.down_proj._is_residual = True

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        gate, up = self.gate_up_proj(x).chunk(2, dim=-1)
        return self.down_proj(F.silu(gate) * up)


class Block(nn.Module):
    def __init__(
        self,
        d_model: int,
        n_heads: int,
        q_lora_rank: int,
        kv_lora_rank: int,
        qk_nope_head_dim: int,
        qk_rope_head_dim: int,
        v_head_dim: int,
        rope: RotaryEmbedding,
        ff_hidden_mult: float = 3.5,
        qk_norm: bool = False,
        attn_dropout: float = 0.0,
        resid_dropout: float = 0.0,
    ):
        super().__init__()

        self.ln_attn = RMSNorm(d_model)
        self.ln_ff = RMSNorm(d_model)

        self.attn = MLA(
            d_model=d_model,
            n_heads=n_heads,
            q_lora_rank=q_lora_rank,
            kv_lora_rank=kv_lora_rank,
            qk_nope_head_dim=qk_nope_head_dim,
            qk_rope_head_dim=qk_rope_head_dim,
            v_head_dim=v_head_dim,
            rope=rope,
            qk_norm=qk_norm,
            attn_dropout=attn_dropout,
        )

        self.ff = SwiGLU(
            d_model=d_model,
            hidden_mult=ff_hidden_mult,
        )

        self.resid_drop = (
            nn.Dropout(resid_dropout)
            if resid_dropout > 0
            else nn.Identity()
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x + self.resid_drop(self.attn(self.ln_attn(x)))
        x = x + self.resid_drop(self.ff(self.ln_ff(x)))
        return x


@dataclass
class ModelConfig:
    vocab_size: int
    d_model: int
    n_layers: int
    n_heads: int
    q_lora_rank: int
    kv_lora_rank: int
    qk_nope_head_dim: int
    qk_rope_head_dim: int
    v_head_dim: int
    ff_hidden_mult: float
    qk_norm: bool

    max_seq_len: int = 4096
    attn_dropout: float = 0.0
    resid_dropout: float = 0.0
    emb_dropout: float = 0.0
    label_smoothing: float = 0.0

    @property
    def q_head_dim(self) -> int:
        return self.qk_nope_head_dim + self.qk_rope_head_dim


class DenseLLM(nn.Module):
    def __init__(
        self,
        cfg: ModelConfig,
        use_gradient_checkpointing: bool = False,
    ):
        super().__init__()

        self.cfg = cfg
        self.use_gradient_checkpointing = use_gradient_checkpointing

        self.embed = nn.Embedding(cfg.vocab_size, cfg.d_model)

        self.emb_drop = (
            nn.Dropout(cfg.emb_dropout)
            if cfg.emb_dropout > 0
            else nn.Identity()
        )

        self.rope = RotaryEmbedding(
            dim=cfg.qk_rope_head_dim,
            max_seq_len=cfg.max_seq_len,
        )

        self.blocks = nn.ModuleList([
            Block(
                d_model=cfg.d_model,
                n_heads=cfg.n_heads,
                q_lora_rank=cfg.q_lora_rank,
                kv_lora_rank=cfg.kv_lora_rank,
                qk_nope_head_dim=cfg.qk_nope_head_dim,
                qk_rope_head_dim=cfg.qk_rope_head_dim,
                v_head_dim=cfg.v_head_dim,
                rope=self.rope,
                ff_hidden_mult=cfg.ff_hidden_mult,
                qk_norm=cfg.qk_norm,
                attn_dropout=cfg.attn_dropout,
                resid_dropout=cfg.resid_dropout,
            )
            for _ in range(cfg.n_layers)
        ])

        self.ln_f = RMSNorm(cfg.d_model)
        self.lm_head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False)

        self.apply(self._init_weights)

        scale = (2 * cfg.n_layers) ** -0.5
        for module in self.modules():
            if getattr(module, "_is_residual", False):
                with torch.no_grad():
                    module.weight.mul_(scale)

        self.lm_head.weight = self.embed.weight

    @staticmethod
    def _init_weights(module: nn.Module):
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(
        self,
        idx: torch.Tensor,
        targets: Optional[torch.Tensor] = None,
    ):
        x = self.emb_drop(self.embed(idx))

        for block in self.blocks:
            x = block(x)

        logits = self.lm_head(self.ln_f(x))

        loss = None

        if targets is not None:
            loss = F.cross_entropy(
                logits[:, :-1].contiguous().view(-1, logits.size(-1)),
                targets[:, 1:].contiguous().view(-1),
                label_smoothing=self.cfg.label_smoothing,
            )

        return logits, loss


# =============================================================================
# LOADING HELPERS
# =============================================================================

def load_json_from_hf(repo_id: str, filename: str):
    path = hf_hub_download(
        repo_id=repo_id,
        filename=filename,
    )

    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)

    return data, path


def load_state_dict_safely(weights_path: str):
    try:
        obj = torch.load(
            weights_path,
            map_location="cpu",
            weights_only=True,
        )
    except TypeError:
        obj = torch.load(weights_path, map_location="cpu")
    except Exception:
        obj = torch.load(weights_path, map_location="cpu")

    if isinstance(obj, dict):
        if "model" in obj:
            obj = obj["model"]
        elif "model_state_dict" in obj:
            obj = obj["model_state_dict"]
        elif "state_dict" in obj:
            obj = obj["state_dict"]

    if not isinstance(obj, dict):
        raise TypeError("Loaded weights object is not a state_dict dictionary.")

    cleaned = {}

    for key, value in obj.items():
        if key.startswith("_orig_mod."):
            key = key.removeprefix("_orig_mod.")
        if key.startswith("module."):
            key = key.removeprefix("module.")
        cleaned[key] = value

    return cleaned


def find_key(state_dict: Dict[str, torch.Tensor], *suffixes: str) -> Optional[str]:
    """
    Finds a key by exact match first, then by suffix.
    Useful if the checkpoint has prefixes.
    """
    for suffix in suffixes:
        if suffix in state_dict:
            return suffix

    for key in state_dict.keys():
        for suffix in suffixes:
            if key.endswith(suffix):
                return key

    return None


def infer_n_layers(state_dict: Dict[str, torch.Tensor]) -> Optional[int]:
    block_indices = []

    for key in state_dict.keys():
        match = re.search(r"(?:^|\.)blocks\.(\d+)\.", key)
        if match:
            block_indices.append(int(match.group(1)))

    if not block_indices:
        return None

    return max(block_indices) + 1


def choose_n_heads(
    q_b_out: int,
    o_proj_in: int,
    d_model: int,
    raw_config: Dict[str, Any],
) -> int:
    """
    Chooses n_heads from checkpoint shapes.

    For this training run:
      q_b_out = n_heads * q_head_dim
      o_proj_in = n_heads * v_head_dim
      q_head_dim = 128
      v_head_dim = 128
      n_heads = 12
      d_model = 1536
    """

    candidates = []

    # Prefer raw config only if it is compatible with checkpoint shapes.
    for key in ["n_heads", "num_attention_heads", "num_heads"]:
        if key in raw_config:
            try:
                h = int(raw_config[key])
                if h > 0 and q_b_out % h == 0 and o_proj_in % h == 0:
                    q_head_dim = q_b_out // h
                    v_head_dim = o_proj_in // h
                    candidates.append((h, q_head_dim, v_head_dim, "raw_config"))
            except Exception:
                pass

    # Add known training value if compatible.
    h = TRAINING_N_HEADS
    if q_b_out % h == 0 and o_proj_in % h == 0:
        candidates.append((h, q_b_out // h, o_proj_in // h, "training_default"))

    # General divisors.
    for h in range(1, 129):
        if q_b_out % h == 0 and o_proj_in % h == 0:
            q_head_dim = q_b_out // h
            v_head_dim = o_proj_in // h
            candidates.append((h, q_head_dim, v_head_dim, "divisor_search"))

    # Best case: q_head_dim == v_head_dim == 128 and h * v_head_dim == d_model.
    for h, qhd, vhd, source in candidates:
        if qhd == 128 and vhd == 128 and h * vhd == d_model:
            return h

    # Next: q_head_dim == v_head_dim and h * v_head_dim == d_model.
    for h, qhd, vhd, source in candidates:
        if qhd == vhd and h * vhd == d_model:
            return h

    # Next: known training default if compatible.
    for h, qhd, vhd, source in candidates:
        if source == "training_default":
            return h

    # Last: raw config if compatible.
    for h, qhd, vhd, source in candidates:
        if source == "raw_config":
            return h

    raise ValueError(
        f"Could not infer n_heads from shapes: q_b_out={q_b_out}, "
        f"o_proj_in={o_proj_in}, d_model={d_model}"
    )


def build_model_config_from_checkpoint(
    state_dict: Dict[str, torch.Tensor],
    raw_config: Dict[str, Any],
) -> Dict[str, Any]:
    """
    Build ModelConfig from checkpoint shapes first.

    This avoids trusting partial or misleading config.json values.
    The checkpoint is the source of truth.
    """

    embed_key = find_key(state_dict, "embed.weight")
    q_a_key = find_key(state_dict, "blocks.0.attn.q_a_proj.weight")
    q_b_key = find_key(state_dict, "blocks.0.attn.q_b_proj.weight")
    kv_a_key = find_key(state_dict, "blocks.0.attn.kv_a_proj.weight")
    kv_b_key = find_key(state_dict, "blocks.0.attn.kv_b_proj.weight")
    o_proj_key = find_key(state_dict, "blocks.0.attn.o_proj.weight")
    gate_up_key = find_key(state_dict, "blocks.0.ff.gate_up_proj.weight")

    required_keys = {
        "embed.weight": embed_key,
        "blocks.0.attn.q_a_proj.weight": q_a_key,
        "blocks.0.attn.q_b_proj.weight": q_b_key,
        "blocks.0.attn.kv_a_proj.weight": kv_a_key,
        "blocks.0.attn.kv_b_proj.weight": kv_b_key,
        "blocks.0.attn.o_proj.weight": o_proj_key,
        "blocks.0.ff.gate_up_proj.weight": gate_up_key,
    }

    missing = [name for name, key in required_keys.items() if key is None]
    if missing:
        print("\nAvailable state_dict keys sample:")
        for k in list(state_dict.keys())[:80]:
            print(" ", k)
        raise KeyError("Missing expected checkpoint keys: " + ", ".join(missing))

    embed = state_dict[embed_key]
    q_a = state_dict[q_a_key]
    q_b = state_dict[q_b_key]
    kv_a = state_dict[kv_a_key]
    kv_b = state_dict[kv_b_key]
    o_proj = state_dict[o_proj_key]
    gate_up = state_dict[gate_up_key]

    vocab_size = int(embed.shape[0])
    d_model = int(embed.shape[1])

    n_layers = infer_n_layers(state_dict)
    if n_layers is None:
        n_layers = TRAINING_N_LAYERS

    q_lora_rank = int(q_a.shape[0])

    q_b_out = int(q_b.shape[0])
    kv_a_out = int(kv_a.shape[0])
    kv_b_out = int(kv_b.shape[0])
    o_proj_in = int(o_proj.shape[1])

    n_heads = choose_n_heads(
        q_b_out=q_b_out,
        o_proj_in=o_proj_in,
        d_model=d_model,
        raw_config=raw_config,
    )

    q_head_dim = q_b_out // n_heads
    v_head_dim = o_proj_in // n_heads

    # Prefer the training split of q_head_dim=64+64 when q_head_dim=128.
    if q_head_dim == (
        TRAINING_QK_NOPE_HEAD_DIM + TRAINING_QK_ROPE_HEAD_DIM
    ):
        qk_nope_head_dim = TRAINING_QK_NOPE_HEAD_DIM
        qk_rope_head_dim = TRAINING_QK_ROPE_HEAD_DIM
    else:
        # Fallback: split evenly.
        qk_nope_head_dim = q_head_dim // 2
        qk_rope_head_dim = q_head_dim - qk_nope_head_dim

    kv_lora_rank = kv_a_out - qk_rope_head_dim

    # Cross-check kv_b shape:
    # kv_b_out = n_heads * (qk_nope_head_dim + v_head_dim)
    expected_kv_b_out = n_heads * (qk_nope_head_dim + v_head_dim)

    if kv_b_out != expected_kv_b_out:
        # Try the training defaults before failing.
        qk_nope_head_dim = TRAINING_QK_NOPE_HEAD_DIM
        qk_rope_head_dim = TRAINING_QK_ROPE_HEAD_DIM
        v_head_dim = TRAINING_V_HEAD_DIM
        kv_lora_rank = kv_a_out - qk_rope_head_dim
        expected_kv_b_out = n_heads * (qk_nope_head_dim + v_head_dim)

        if kv_b_out != expected_kv_b_out:
            raise ValueError(
                "Could not reconcile kv_b shape.\n"
                f"kv_b_out={kv_b_out}\n"
                f"expected={expected_kv_b_out}\n"
                f"n_heads={n_heads}, nope={qk_nope_head_dim}, v={v_head_dim}"
            )

    inner = int(gate_up.shape[0]) // 2

    inferred_ff_mult = inner / float(d_model)

    training_inner = ((int(TRAINING_FF_MULT * d_model) + 255) // 256) * 256
    if training_inner == inner:
        ff_hidden_mult = TRAINING_FF_MULT
    else:
        ff_hidden_mult = inferred_ff_mult

    max_seq_len = raw_config.get(
        "max_seq_len",
        raw_config.get("context_len", TRAINING_MAX_SEQ_LEN),
    )

    cfg = {
        "vocab_size": vocab_size,
        "d_model": d_model,
        "n_layers": n_layers,
        "n_heads": n_heads,
        "q_lora_rank": q_lora_rank,
        "kv_lora_rank": kv_lora_rank,
        "qk_nope_head_dim": qk_nope_head_dim,
        "qk_rope_head_dim": qk_rope_head_dim,
        "v_head_dim": v_head_dim,
        "ff_hidden_mult": ff_hidden_mult,
        "qk_norm": bool(raw_config.get("qk_norm", TRAINING_QK_NORM)),
        "max_seq_len": int(max_seq_len),

        # Inference-time dropout/smoothing should be off.
        "attn_dropout": 0.0,
        "resid_dropout": 0.0,
        "emb_dropout": 0.0,
        "label_smoothing": 0.0,
    }

    if cfg["qk_nope_head_dim"] + cfg["qk_rope_head_dim"] != cfg["v_head_dim"]:
        raise ValueError(
            f"Bad inferred config: q_head_dim="
            f"{cfg['qk_nope_head_dim'] + cfg['qk_rope_head_dim']} "
            f"but v_head_dim={cfg['v_head_dim']}"
        )

    if cfg["d_model"] != cfg["n_heads"] * cfg["v_head_dim"]:
        raise ValueError(
            f"Bad inferred config: d_model={cfg['d_model']} but "
            f"n_heads*v_head_dim={cfg['n_heads'] * cfg['v_head_dim']}"
        )

    return cfg


def load_tokenizer_for_model(
    repo_id: str,
    raw_config: Dict[str, Any],
    target_vocab_size: int,
):
    """
    Loads tokenizer.

    First tries repo tokenizer. If it looks suspiciously tiny, falls back to
    the training tokenizer from the training script.
    """

    tokenizer = None

    try:
        print("Loading tokenizer from model repo...")
        tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=True)
        print(f"Repo tokenizer loaded. vocab={len(tokenizer):,}")
    except Exception as e:
        print(f"Could not load tokenizer from repo: {e}")

    if tokenizer is not None and len(tokenizer) < 1000 and target_vocab_size > 1000:
        print(
            f"Repo tokenizer looks too small: vocab={len(tokenizer):,}, "
            f"model vocab={target_vocab_size:,}. Ignoring repo tokenizer."
        )
        tokenizer = None

    if tokenizer is None:
        fallback_name = raw_config.get("vocab_name", FALLBACK_TOKENIZER)
        print(f"Loading fallback tokenizer: {fallback_name}")
        tokenizer = AutoTokenizer.from_pretrained(fallback_name, use_fast=True)

    doc_eos_token = raw_config.get("doc_eos_token", DOC_EOS_TOKEN)

    if doc_eos_token not in tokenizer.get_vocab():
        tokenizer.add_special_tokens({
            "additional_special_tokens": [doc_eos_token],
        })

    tokenizer.doc_eos_token = doc_eos_token
    tokenizer.doc_eos_token_id = tokenizer.convert_tokens_to_ids(doc_eos_token)

    if tokenizer.pad_token is None:
        tokenizer.pad_token = doc_eos_token
        tokenizer.pad_token_id = tokenizer.doc_eos_token_id

    if len(tokenizer) < target_vocab_size:
        needed = target_vocab_size - len(tokenizer)
        print(f"Adding {needed:,} dummy tokens to match model vocab_size={target_vocab_size:,}")
        tokenizer.add_tokens(
            [f"<|dummy_infer_{i}|>" for i in range(needed)],
            special_tokens=False,
        )

    if len(tokenizer) > target_vocab_size:
        print(
            f"WARNING: tokenizer vocab={len(tokenizer):,} is larger than "
            f"model vocab={target_vocab_size:,}.\n"
            "Input token IDs above model vocab will be remapped to EOS/doc token.\n"
            "For best quality, upload the exact tokenizer files saved during training."
        )

    tokenizer.model_max_length = int(1e9)

    print(
        f"Tokenizer ready. tokenizer_vocab={len(tokenizer):,}, "
        f"model_vocab={target_vocab_size:,}, eos_id={tokenizer.doc_eos_token_id}"
    )

    return tokenizer


def sanitize_input_ids(
    input_ids: torch.Tensor,
    model_vocab_size: int,
    fallback_token_id: int,
):
    """
    Prevent embedding-index errors if fallback tokenizer produces IDs outside
    the model vocab.
    """

    if input_ids.numel() == 0:
        return input_ids

    if input_ids.max().item() >= model_vocab_size:
        input_ids = input_ids.clone()
        input_ids[input_ids >= model_vocab_size] = fallback_token_id

    return input_ids


# =============================================================================
# LOAD CONFIG, WEIGHTS, TOKENIZER, MODEL
# =============================================================================

print("Downloading config...")
raw_config, config_path = load_json_from_hf(REPO_ID, CONFIG_FILENAME)
print(f"Config path: {config_path}")

print("\nDownloading weights...")
weights_path = hf_hub_download(
    repo_id=REPO_ID,
    filename=WEIGHTS_FILENAME,
)
print(f"Weights path: {weights_path}")

print("\nLoading state dict...")
state_dict = load_state_dict_safely(weights_path)

print("\nBuilding model config from checkpoint tensor shapes...")
config = build_model_config_from_checkpoint(state_dict, raw_config)
model_cfg = ModelConfig(**config)

print("\nFinal model config:")
for key, value in config.items():
    print(f"  {key}: {value}")

print("\nLoading tokenizer...")
tokenizer = load_tokenizer_for_model(
    repo_id=REPO_ID,
    raw_config=raw_config,
    target_vocab_size=model_cfg.vocab_size,
)

fallback_token_id = getattr(tokenizer, "doc_eos_token_id", None)
if fallback_token_id is None or fallback_token_id >= model_cfg.vocab_size:
    fallback_token_id = tokenizer.eos_token_id

if fallback_token_id is None or fallback_token_id >= model_cfg.vocab_size:
    fallback_token_id = 0


device = "cuda" if torch.cuda.is_available() else "cpu"

if device == "cuda":
    torch.set_float32_matmul_precision("high")
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True

    try:
        torch.backends.cuda.enable_flash_sdp(True)
        torch.backends.cuda.enable_mem_efficient_sdp(True)
        torch.backends.cuda.enable_math_sdp(True)
    except Exception:
        pass

    if torch.cuda.is_bf16_supported():
        dtype = torch.bfloat16
    else:
        dtype = torch.float16
else:
    dtype = torch.float32

print(f"\nUsing device={device}, dtype={dtype}")

print("\nBuilding model...")
model = DenseLLM(
    model_cfg,
    use_gradient_checkpointing=False,
).to(device=device, dtype=dtype)

print("Loading model weights...")

try:
    model.load_state_dict(state_dict, strict=True)
    print("Weights loaded with strict=True.")
except RuntimeError as e:
    print("Strict load failed.")
    print(str(e)[:4000])
    raise

model.eval()

print("\nModel ready!")


# =============================================================================
# STREAMING GENERATION
# =============================================================================

@torch.inference_mode()
def stream_generate(
    prompt: str,
    max_new_tokens: int = 256,
    temperature: float = 0.55,
    top_k: int = 35,
    top_p: float = 0.88,
    repetition_penalty: float = 1.1,
):
    if not prompt or not prompt.strip():
        yield ""
        return

    encoded = tokenizer(
        prompt,
        return_tensors="pt",
        add_special_tokens=False,
    )

    input_ids = encoded["input_ids"]

    input_ids = sanitize_input_ids(
        input_ids=input_ids,
        model_vocab_size=model_cfg.vocab_size,
        fallback_token_id=fallback_token_id,
    ).to(device)

    generated = input_ids.clone()
    prompt_len = input_ids.shape[1]

    eos_id = getattr(tokenizer, "doc_eos_token_id", None)
    if eos_id is None:
        eos_id = tokenizer.eos_token_id

    if eos_id is not None and eos_id >= model_cfg.vocab_size:
        eos_id = None

    max_seq_len = int(model_cfg.max_seq_len)

    for _ in range(int(max_new_tokens)):
        model_input = generated[:, -max_seq_len:]

        logits, _ = model(model_input, None)
        next_logits = logits[:, -1, :].float()

        if temperature <= 0:
            next_token = torch.argmax(next_logits, dim=-1, keepdim=True)
        else:
            next_logits = next_logits / max(float(temperature), 1e-8)

            # Repetition penalty
            if repetition_penalty and repetition_penalty != 1.0:
                used_tokens = torch.unique(generated[0])
                used_tokens = used_tokens[used_tokens < model_cfg.vocab_size]

                if used_tokens.numel() > 0:
                    token_scores = next_logits[0, used_tokens]
                    next_logits[0, used_tokens] = torch.where(
                        token_scores > 0,
                        token_scores / repetition_penalty,
                        token_scores * repetition_penalty,
                    )

            # Top-k filtering
            if top_k and top_k > 0:
                k = min(int(top_k), next_logits.size(-1))
                values, _ = torch.topk(next_logits, k)
                cutoff = values[:, [-1]]
                next_logits[next_logits < cutoff] = -float("inf")

            # Top-p / nucleus filtering
            if top_p and top_p < 1.0:
                sorted_logits, sorted_indices = torch.sort(next_logits, descending=True)
                sorted_probs = F.softmax(sorted_logits, dim=-1)
                cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

                remove_mask = cumulative_probs > float(top_p)
                remove_mask[..., 1:] = remove_mask[..., :-1].clone()
                remove_mask[..., 0] = False

                full_mask = torch.zeros_like(next_logits, dtype=torch.bool)
                full_mask.scatter_(1, sorted_indices, remove_mask)
                next_logits[full_mask] = -float("inf")

            probs = F.softmax(next_logits, dim=-1)

            if (
                not torch.isfinite(probs).all()
                or (probs.sum(dim=-1) <= 0).any()
            ):
                next_token = torch.argmax(logits[:, -1, :], dim=-1, keepdim=True)
            else:
                next_token = torch.multinomial(probs, num_samples=1)

        generated = torch.cat([generated, next_token], dim=-1)

        if eos_id is not None and next_token.item() == eos_id:
            break

        decoded = tokenizer.decode(
            generated[0, prompt_len:],
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False,
        )

        yield decoded


def respond(
    prompt,
    max_tokens,
    temperature,
    top_k,
    top_p,
    repetition_penalty,
):
    for partial in stream_generate(
        prompt=prompt,
        max_new_tokens=max_tokens,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
    ):
        yield partial


# =============================================================================
# GRADIO UI
# =============================================================================

DEFAULT_PROMPT = """Translate the following Akkadian transliteration into English. Include a literal word-by-word gloss:
šarrum bītam rabiam ana ilim ibni.
"""

EXAMPLES = [
    [
        "Translate the following Akkadian transliteration into English. Include a literal word-by-word gloss:\nšarrum bītam rabiam ana ilim ibni."
    ],
    [
        "Translate the following Akkadian transliteration into English. Include grammatical notes:\nṭupšarrum awātim ina ṭuppim išṭur."
    ],
    [
        "Translate the following Akkadian transliteration into English. Provide a literal gloss and smooth English translation:\ntamkārum kaspam ana wardim iddin."
    ],
    [
        "Translate the following Old Babylonian-style Akkadian transliteration into English. Explain the case endings if possible:\nawīlum dannum abul ālim ina mūšim iṣṣur."
    ],
    [
        "Translate the following Akkadian transliteration into English. Give both a literal and natural translation:\nahī ana bīt abīšu īrub."
    ],
    [
        "Translate the following Akkadian transliteration into English. If uncertain, explain the possible alternatives:\nlū ilum šarram u ālam liṣṣur."
    ],
]

with gr.Blocks(
    title="Akkadian-English DenseLLM 1B",
    theme=gr.themes.Soft(),
) as demo:
    gr.Markdown(
        "# Akkadian-English DenseLLM 1B\n"
        "*AlgoDriveAI — custom DenseLLM / MLA architecture for Akkadian and Old Babylonian translation experiments*"
    )

    with gr.Row():
        with gr.Column(scale=3):
            prompt_box = gr.Textbox(
                label="Prompt",
                placeholder="Translate the following Akkadian transliteration into English...",
                lines=6,
                value=DEFAULT_PROMPT,
            )

            output_box = gr.Textbox(
                label="Output",
                lines=16,
                interactive=False,
            )

            with gr.Row():
                generate_btn = gr.Button("Generate", variant="primary")
                clear_btn = gr.ClearButton(
                    components=[prompt_box, output_box],
                    value="Clear",
                )

        with gr.Column(scale=1):
            max_tokens = gr.Slider(
                minimum=16,
                maximum=768,
                value=256,
                step=1,
                label="Max new tokens",
            )

            temperature = gr.Slider(
                minimum=0.0,
                maximum=2.0,
                value=0.55,
                step=0.05,
                label="Temperature",
            )

            top_k = gr.Slider(
                minimum=0,
                maximum=100,
                value=35,
                step=1,
                label="Top-K",
            )

            top_p = gr.Slider(
                minimum=0.0,
                maximum=1.0,
                value=0.88,
                step=0.01,
                label="Top-P",
            )

            repetition_penalty = gr.Slider(
                minimum=1.0,
                maximum=1.5,
                value=1.1,
                step=0.01,
                label="Repetition penalty",
            )

    gr.Examples(
        examples=EXAMPLES,
        inputs=prompt_box,
    )

    generate_btn.click(
        fn=respond,
        inputs=[
            prompt_box,
            max_tokens,
            temperature,
            top_k,
            top_p,
            repetition_penalty,
        ],
        outputs=output_box,
    )

    prompt_box.submit(
        fn=respond,
        inputs=[
            prompt_box,
            max_tokens,
            temperature,
            top_k,
            top_p,
            repetition_penalty,
        ],
        outputs=output_box,
    )


if __name__ == "__main__":
    demo.queue()
    demo.launch(
        server_name="0.0.0.0",
        server_port=7860,
        share=False,
    )

Minimal Inference Pattern

For users who want to build their own script, the minimum checkpoint download pattern is:

import torch
from huggingface_hub import hf_hub_download

repo_id = "AlgoDriveAI/Akkadian_English_DenseLLM_1B"

weights_path = hf_hub_download(
    repo_id=repo_id,
    filename="pytorch_model.bin",
)

config_path = hf_hub_download(
    repo_id=repo_id,
    filename="config.json",
)

state_dict = torch.load(weights_path, map_location="cpu")

From there, the model must be loaded into the custom DenseLLM architecture used during training.

Example Prompts

Translate the following Akkadian transliteration into English. Include a literal word-by-word gloss:
šarrum bītam rabiam ana ilim ibni.

Expected meaning:

The king built a great temple for the god.

Translate the following Akkadian transliteration into English. Include grammatical notes:
ṭupšarrum awātim ina ṭuppim išṭur.

Expected meaning:

The scribe wrote the words on a clay tablet.

Translate the following Akkadian transliteration into English. Provide a literal gloss and smooth English translation:
tamkārum kaspam ana wardim iddin.

Expected meaning:

The merchant gave silver to the worker.

Translate the following Old Babylonian-style Akkadian transliteration into English. Explain the case endings if possible:
awīlum dannum abul ālim ina mūšim iṣṣur.

Expected meaning:

The strong man guarded the city gate at night.

Translate the following Akkadian transliteration into English. Explain the verb form:
ištarum ikrib nišī išme.

Expected meaning:

The goddess heard the prayer of the people.

Translate the following Akkadian transliteration into English. Include a word-by-word breakdown:
rē’ûm immerī ina eqlim imnu.

Expected meaning:

The shepherd counted the sheep in the field.

Translate the following Akkadian transliteration into English. Include brief grammar notes:
šumma nārum eli, ikkarū ihaddu.

Expected meaning:

If the river rises, the farmers will rejoice.

Translate the following Akkadian transliteration into English. Give both a literal and natural translation:
ahī ana bīt abīšu īrub.

Expected meaning:

My brother entered the house of his father.

Translate the following Akkadian transliteration into English. Include notes on nouns, verbs, and possessives:
šarratum ṭēmam ana mātim rūqtim išpur.

Expected meaning:

The queen sent a message to the distant land.

Translate the following Akkadian transliteration into English. If uncertain, explain the possible alternatives:
lū ilum šarram u ālam liṣṣur.

Expected meaning:

May the god protect the king and the city.

Architecture

Component	Details
Type	Custom Dense Transformer / DenseLLM
Parameters	Approximately 1B
Attention	MLA-style attention
Positional Encoding	RoPE
Activation	SwiGLU
Normalization	RMSNorm
Task Type	Causal Language Modeling
Standard HF `AutoModelForCausalLM` support	No
Custom inference code required	Yes

Hyperparameter	Value
`d_model`	1536
`n_layers`	36
`n_heads`	12
`q_lora_rank`	768
`kv_lora_rank`	384
`qk_nope_head_dim`	64
`qk_rope_head_dim`	64
`v_head_dim`	128
`q_head_dim`	128
`ff_hidden_mult`	3.5
`max_seq_len`	4096

Architecture Details

This model uses a custom DenseLLM architecture with MLA-style attention.

The attention layout includes:

Q compression through q_lora_rank
Fused KV down-projection
Fused KV up-projection
Shared RoPE key component
QK normalization on non-RoPE dimensions
RoPE positional encoding
SwiGLU feed-forward layers
RMSNorm normalization
Weight tying between token embeddings and the language-model head

The intended attention dimensions are:

qk_nope_head_dim = 64
qk_rope_head_dim = 64
q_head_dim       = 128
v_head_dim       = 128

The equality between q_head_dim and v_head_dim is intentional.

Training Focus

This version focuses on:

English
Akkadian
Old Babylonian-style normalized transliteration
Akkadian-to-English translation
English-to-Akkadian style generation
Literal glosses
Grammatical explanations
Prompt-following examples for Akkadian translation

Unlike the earlier mixed-language version, this release intentionally does not center Sanskrit as a training target.

Dataset Focus

The training focus was narrowed from a mixed Sanskrit/Akkadian setup into a more targeted English/Akkadian setup.

The goal is to make model behavior easier to evaluate in early testing:

Fewer target-language interactions
Less language blending
Clearer Akkadian translation evaluation
Easier debugging of glossing and grammar behavior
More focused tests for Akkadian transliteration and Old Babylonian-style forms

Recommended Generation Settings

Setting	Suggested Value
`temperature`	0.3–0.7
`top_k`	30–50
`top_p`	0.75–0.9
`repetition_penalty`	1.05–1.15
`max_new_tokens`	100–300

For translation tasks, lower temperature usually gives more stable output.

A good default is:

temperature = 0.55
top_k = 35
top_p = 0.88
repetition_penalty = 1.1
max_new_tokens = 256

Known Limitations

Not a scholarly authority: Outputs should be checked against reliable Akkadian grammars, dictionaries, and corpora.
Hallucinated forms: The model may invent plausible-looking Akkadian words or endings.
Translation uncertainty: Akkadian is highly context-dependent, and short isolated sentences may have multiple possible readings.
Inconsistent transliteration: The model may mix normalized forms, ASCII approximations, sign-like conventions, or nonstandard spellings.
Prompt sensitivity: The model may behave differently depending on how explicitly the prompt is written.
Repetition: Long generations may become repetitive.
Tokenizer sensitivity: Output quality depends heavily on using the same tokenizer setup used during training.
Custom architecture: The model requires matching custom inference code and will not load directly with AutoModelForCausalLM.

Tokenizer Note

For best results, the repository should include the exact tokenizer files used during training.

If the tokenizer files are missing or incomplete, inference scripts may fall back to a compatible tokenizer and pad the tokenizer vocabulary to match the model’s embedding size. This can prevent runtime errors, but it may reduce output quality if the fallback tokenizer does not match the tokenizer used during training.

Recommended tokenizer-related files to include:

tokenizer.json
tokenizer_config.json
special_tokens_map.json
added_tokens.json

Suggested Use Cases

This model may be useful for:

Testing Akkadian translation prompts
Exploring Old Babylonian-style normalized transliteration
Studying how small custom LLMs behave on ancient-language tasks
Comparing mixed-language vs separated-language training
Research experiments in low-resource language modeling

Out-of-Scope Uses

This model should not be used as the sole authority for:

Academic publication translations
Legal, religious, or historical claims
Primary-source interpretation without expert review
High-confidence philological analysis
Production translation systems

Feedback Welcome

Feedback is especially useful on:

Akkadian translation accuracy
Gloss quality
Verb parsing
Case ending interpretation
Old Babylonian-style grammar
Prompt formats that work well
Failure cases where the model produces plausible but wrong analysis
Repetition or language-blending behavior

Contact

Organization: AlgoDriveAI
Author: Christopher Smith
Base / previous version: AlgoDriveAI/Sanskrit_Akkadian_LLM
Repository: AlgoDriveAI/Akkadian_English_DenseLLM_1B

Citation

@misc{algodrive2026akkadian_english_dense_1b,
  author = {AlgoDriveAI, Christopher Smith},
  title = {Akkadian-English DenseLLM 1B},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/AlgoDriveAI/Akkadian_English_DenseLLM_1B}
}

License

MIT

Downloads last month: 32

Model tree for AlgoDriveAI/Akkadian_English_DenseLLM_1B

Base model

AlgoDriveAI/Sanskrit_Akkadian_LLM_v1.0

Finetuned

(1)

this model