Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

.gitattributes +1 -0
README.md +153 -0
lang2idx.json +109 -0
lite_onnx_demo.ipynb +91 -0
lite_pytorch_demo.ipynb +91 -0
model.bf16.pt +3 -0
model.pt +3 -0
onnx/lang2idx.json +109 -0
onnx/model.onnx +3 -0
onnx/model.onnx.data +3 -0
onnx/onnx_metadata.json +13 -0
training_history.json +272 -0
training_metadata.json +257 -0
training_summary.json +6 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+onnx/model.onnx.data filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,153 @@

+---
+license: apache-2.0
+language:
+- multilingual
+tags:
+- programming-language-identification
+- code
+- byte-level
+- lite
+library_name: pytorch
+pipeline_tag: text-classification
+metrics:
+- f1
+- accuracy
+---
+# programming-language-identification-100plus-lite
+Byte-level programming-language identification across **107 languages**.
+Lite counterpart to the full ModernBERT model
+`programming-language-identification-100plus`. **2.35M parameters**, no
+tokenizer, ships at **~9 MB fp32 / ~4.5 MB bf16**.
+**[Open PyTorch Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_pytorch_demo.ipynb)** · **[Open ONNX Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_onnx_demo.ipynb)** — Download and run in Colab or Jupyter.
+The architecture is `ByteHybrid` (3 × Conv1D → 1 × bidirectional attention with
+RoPE → masked mean-pool → classifier head, with a 4096-bucket trigram-hash
+embedding), vendored from
+[PleIAs/CommonLingua](https://huggingface.co/PleIAs/CommonLingua) (Apache-2.0)
+and trained from scratch on Rosetta Code + The Stack v1 across 107 canonical
+programming languages.
+## Comparison with `philomath-1209/programming-language-identification`
+3,057 test rows over the **26 labels** philomath supports. ONNX,
+`CPUExecutionProvider`, batch 64.
+| model | params | accuracy | macro F1 | weighted F1 | speed |
+|---|---:|---:|---:|---:|---:|
+| **programming-language-identification-100plus-lite** (ONNX) | 2.35 M | 0.9094 | **0.9410** | **0.9361** | **2.37×** |
+| philomath-1209/programming-language-identification (ONNX) | 84 M | 0.8449 | 0.8445 | 0.8467 | 1.00× |
+Speed is ratio of texts/sec relative to philomath on the same CPU
+(`onnxruntime` `CPUExecutionProvider`, single host, no other GPU/CPU load).
+GPU torch-vs-torch numbers are pending — these CPU figures are the realistic
+edge-deployment scenario.
+## Files
+```
+model.pt              fp32 PyTorch checkpoint (CommonLingua format)
+model.bf16.pt         bf16 sidecar checkpoint (smaller, same accuracy in eval)
+lang2idx.json         107-label index
+training_metadata.json  hyperparameters and dataset stats
+training_history.json   per-epoch loss / val_acc / val_macro_f1
+onnx/
+  model.onnx          ONNX export (opset 20, dynamic batch)
+  model.onnx.data     external weights blob
+  lang2idx.json       (mirror)
+  onnx_metadata.json  parity report vs PyTorch
+```
+## Quick start — PyTorch
+```python
+import torch, numpy as np, sys
+sys.path.append("path/to/code-language-id/src")
+from code_language_id.byte_hybrid import ByteHybrid, CONFIGS
+ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
+model = ByteHybrid(num_classes=ckpt["num_classes"], max_len=ckpt["max_len"],
+                   **CONFIGS[ckpt["config"]]).eval()
+model.load_state_dict(ckpt["model_state_dict"])
+idx2lang = {v: k for k, v in ckpt["lang2idx"].items()}
+def encode(texts, max_len=ckpt["max_len"]):
+    out = np.full((len(texts), max_len), 256, dtype=np.int64)
+    for i, t in enumerate(texts):
+        b = t.encode("utf-8", errors="replace")[:max_len]
+        out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
+    return torch.from_numpy(out)
+with torch.no_grad():
+    logits = model(encode(["def hello():\n    print('hi')"]))
+print(idx2lang[int(logits.argmax(-1))])   # -> Python
+```
+## Quick start — ONNX Runtime
+```python
+import onnxruntime as ort, numpy as np, json
+sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"])
+lang2idx = json.load(open("onnx/lang2idx.json"))
+idx2lang = {v: k for k, v in lang2idx.items()}
+MAX_LEN = 1023
+def encode(texts, max_len=MAX_LEN):
+    out = np.full((len(texts), max_len), 256, dtype=np.int64)
+    for i, t in enumerate(texts):
+        b = t.encode("utf-8", errors="replace")[:max_len]
+        out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
+    return out
+logits = sess.run(None, {"byte_ids": encode(["fn main() {}"])})[0]
+print(idx2lang[int(logits.argmax(-1))])   # -> Rust
+```
+## Training summary
+- **Data**: Rosetta Code (`cakiki/rosetta-code`) + The Stack v1
+  (`bigcode/the-stack`), task-split to prevent leakage.
+  72,549 / 9,495 / 8,880 rows (train / val / test) across 107 canonical labels.
+- **Snippets**: variable-window (64–1023 bytes) UTF-8.
+- **Optimizer**: AdamW (β=0.9, 0.95, weight decay 0.01) + cosine-with-warmup,
+  peak LR 3e-3, 5 % warmup, gradient clipping 1.0.
+- **Schedule**: 30 epochs, bf16 autocast, batch 128 (effective 128 with
+  gradient clipping; SDPA fused attention).
+- **Best val macro F1**: 0.9085 @ epoch 26 (early stopped).
+See `training_metadata.json` for the full hyperparameter dump.
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{mariappan2026codelangidlite,
+  author    = {Mariappan, Vijayachandran},
+  title     = {programming-language-identification-100plus-lite: Byte-level Programming Language Identification across 107 Languages},
+  year      = {2026},
+  publisher = {Hugging Face},
+  url       = {https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite}
+}
+```
+Upstream architecture:
+```bibtex
+@misc{commonlingua,
+  author    = {{PleIAs}},
+  title     = {CommonLingua: Byte-level Language Identification for 334 Languages},
+  year      = {2026},
+  publisher = {Hugging Face},
+  url       = {https://huggingface.co/PleIAs/CommonLingua}
+}
+```
+## License & attribution
+Apache-2.0. Architecture and reference inference code derive from
+**PleIAs/CommonLingua** (Apache-2.0). Trained weights and dataset curation are
+original to this repository.

lang2idx.json ADDED Viewed

	@@ -0,0 +1,109 @@

+{
+  "ABAP": 82,
+  "APL": 56,
+  "ARM Assembly": 95,
+  "ATS": 80,
+  "ActionScript": 69,
+  "Ada": 19,
+  "AppleScript": 46,
+  "AutoHotkey": 25,
+  "AutoIt": 75,
+  "Awk": 33,
+  "BASIC": 90,
+  "BQN": 99,
+  "Batchfile": 59,
+  "Befunge": 100,
+  "C": 8,
+  "C#": 17,
+  "C++": 15,
+  "COBOL": 47,
+  "Ceylon": 74,
+  "Clojure": 29,
+  "CoffeeScript": 58,
+  "ColdFusion": 86,
+  "Common Lisp": 24,
+  "Component Pascal": 96,
+  "Crystal": 55,
+  "D": 23,
+  "Dart": 72,
+  "E": 93,
+  "Eiffel": 64,
+  "Elixir": 37,
+  "Emacs Lisp": 63,
+  "Erlang": 38,
+  "Euphoria": 94,
+  "F#": 27,
+  "Factor": 20,
+  "Fantom": 65,
+  "Forth": 36,
+  "Fortran": 30,
+  "FreeBASIC": 48,
+  "GAP": 61,
+  "Go": 1,
+  "Groovy": 40,
+  "Haskell": 9,
+  "Haxe": 88,
+  "IDL": 84,
+  "Io": 76,
+  "J": 7,
+  "Java": 12,
+  "JavaScript": 26,
+  "Julia": 0,
+  "Kotlin": 11,
+  "LFE": 79,
+  "LabVIEW": 85,
+  "Lasso": 54,
+  "Logtalk": 81,
+  "Lua": 22,
+  "M": 97,
+  "M4": 77,
+  "MATLAB": 51,
+  "MAXScript": 70,
+  "Mathematica/Wolfram Language": 10,
+  "Mercury": 105,
+  "Modula-2": 98,
+  "Modula-3": 104,
+  "Nemerle": 103,
+  "NewLisp": 102,
+  "Nim": 6,
+  "OCaml": 32,
+  "Objective-C": 101,
+  "Oz": 52,
+  "PHP": 43,
+  "Pascal": 3,
+  "Perl": 4,
+  "PicoLisp": 50,
+  "Pike": 67,
+  "PowerShell": 39,
+  "Processing": 73,
+  "Prolog": 44,
+  "PureBasic": 31,
+  "Python": 5,
+  "QuickBASIC": 106,
+  "R": 34,
+  "REXX": 41,
+  "Racket": 14,
+  "Raku": 2,
+  "Rebol": 68,
+  "Red": 62,
+  "Ring": 66,
+  "Ruby": 13,
+  "Rust": 21,
+  "SAS": 87,
+  "Scala": 18,
+  "Scheme": 45,
+  "Scilab": 83,
+  "Smalltalk": 49,
+  "Standard ML": 53,
+  "Stata": 57,
+  "Swift": 35,
+  "Tcl": 16,
+  "V": 91,
+  "VBA": 89,
+  "VBScript": 92,
+  "Vala": 71,
+  "Visual Basic .NET": 42,
+  "Wren": 28,
+  "Zig": 78,
+  "jq": 60
+}

lite_onnx_demo.ipynb ADDED Viewed

	@@ -0,0 +1,91 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# `programming-language-identification-100plus-lite` — ONNX Runtime\n",
+    "\n",
+    "Same model as the PyTorch demo, exported to ONNX (opset 20). No torch needed at inference time. CPU-friendly: ~57 texts/sec single-thread on commodity hardware (2.37× philomath-1209 on the same box).\n",
+    "\n",
+    "Run end-to-end in Colab or Jupyter."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Install dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "%%capture\n!pip install -q -U onnxruntime huggingface_hub numpy\n"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Download ONNX model + label index"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import json\nimport numpy as np\nimport onnxruntime as ort\nfrom huggingface_hub import hf_hub_download\n\nREPO = 'FrameByFrame/programming-language-identification-100plus-lite'\nonnx_path = hf_hub_download(REPO, 'onnx/model.onnx')\n# external weight blob lives next to the .onnx file\nhf_hub_download(REPO, 'onnx/model.onnx.data')\nlang2idx = json.loads(open(hf_hub_download(REPO, 'onnx/lang2idx.json')).read())\nmeta = json.loads(open(hf_hub_download(REPO, 'onnx/onnx_metadata.json')).read())\n\nsess = ort.InferenceSession(onnx_path, providers=['CPUExecutionProvider'])\nidx2lang = {v: k for k, v in lang2idx.items()}\nMAX_LEN = meta['max_len']\nprint(f'{len(idx2lang)} labels | max_len={MAX_LEN} | providers={sess.get_providers()}')"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Helpers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "def encode(texts, max_len=MAX_LEN):\n    out = np.full((len(texts), max_len), 256, dtype=np.int64)\n    for i, t in enumerate(texts):\n        b = t.encode('utf-8', errors='replace')[:max_len]\n        out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)\n    return out\n\n\ndef softmax(logits, axis=-1):\n    e = np.exp(logits - logits.max(axis=axis, keepdims=True))\n    return e / e.sum(axis=axis, keepdims=True)\n\n\ndef predict(texts, top_k=3):\n    logits = sess.run(None, {'byte_ids': encode(texts)})[0]\n    probs = softmax(logits)\n    top_i = np.argsort(-probs, axis=-1)[:, :top_k]\n    return [[(idx2lang[int(j)], float(probs[r, j])) for j in row]\n            for r, row in enumerate(top_i)]"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Predict"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "samples = [\n    \"def fib(n):\\n    return n if n < 2 else fib(n-1) + fib(n-2)\",\n    \"fn main() {\\n    println!(\\\"hello, world\\\");\\n}\",\n    \"package main\\nimport \\\"fmt\\\"\\nfunc main() { fmt.Println(\\\"hi\\\") }\",\n    \"#include <stdio.h>\\nint main() { printf(\\\"hi\\\\n\\\"); return 0; }\",\n    \"SELECT name FROM users WHERE id = 42;\",\n]\nfor text, top in zip(samples, predict(samples)):\n    print(f'{top[0][0]:<14s}  {top[0][1]:.3f}   ({top[1][0]} {top[1][1]:.2f}, {top[2][0]} {top[2][1]:.2f})  | {text[:60]!r}')"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Throughput sanity check"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import time\nwarm = encode(samples * 13)[:64]\nfor _ in range(3):\n    sess.run(None, {'byte_ids': warm})\nt0 = time.time()\nfor _ in range(40):\n    sess.run(None, {'byte_ids': warm})\nelapsed = time.time() - t0\nprint(f'{40*64/elapsed:.0f} texts/sec  ({elapsed:.2f}s for 40 batches of 64)')"
+  }
+ ],
+ "metadata": {
+  "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
+  "language_info": {"name": "python", "version": "3.11"}
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

lite_pytorch_demo.ipynb ADDED Viewed

	@@ -0,0 +1,91 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# `programming-language-identification-100plus-lite` — PyTorch\n",
+    "\n",
+    "2.35M-param byte-level classifier across 107 programming languages. No tokenizer; raw UTF-8 bytes padded to 1023.\n",
+    "\n",
+    "Self-contained: this notebook inlines the model definition (vendored from PleIAs/CommonLingua, Apache-2.0) and downloads the checkpoint from the Hub. Run end-to-end in Colab or Jupyter."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Install dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "%%capture\n!pip install -q -U torch huggingface_hub\n"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model definition (ByteHybrid — vendored from PleIAs/CommonLingua, Apache-2.0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\nclass ByteNgramEmbed(nn.Module):\n    def __init__(self, num_buckets=4096, embed_dim=64, n=3):\n        super().__init__()\n        self.n, self.num_buckets = n, num_buckets\n        self.embed = nn.Embedding(num_buckets, embed_dim)\n\n    def forward(self, byte_ids):\n        B, T = byte_ids.shape\n        clamped = byte_ids.clamp(max=255)\n        padded = F.pad(clamped, (0, self.n - 1), value=0)\n        h = torch.zeros(B, T, dtype=torch.long, device=byte_ids.device)\n        for i in range(self.n):\n            h = h * 257 + padded[:, i:i + T]\n        return self.embed(h % self.num_buckets)\n\n\nclass ByteConvBlock(nn.Module):\n    def __init__(self, d_model, kernel_size=15, expand=2):\n        super().__init__()\n        self.norm1 = nn.LayerNorm(d_model)\n        self.pad = kernel_size - 1\n        self.conv = nn.Conv1d(d_model, d_model, kernel_size, groups=d_model)\n        self.norm2 = nn.LayerNorm(d_model)\n        ffn = d_model * expand\n        self.ffn_gate = nn.Linear(d_model, ffn, bias=False)\n        self.ffn_up = nn.Linear(d_model, ffn, bias=False)\n        self.ffn_down = nn.Linear(ffn, d_model, bias=False)\n\n    def forward(self, x):\n        residual = x\n        x = self.norm1(x).transpose(1, 2)\n        x = F.pad(x, (self.pad, 0))\n        x = F.silu(self.conv(x)).transpose(1, 2)\n        x = residual + x\n        residual = x\n        x = self.norm2(x)\n        return residual + self.ffn_down(F.silu(self.ffn_gate(x)) * self.ffn_up(x))\n\n\ndef _rope(q, k):\n    head_dim, seq_len = q.shape[-1], q.shape[-2]\n    freqs = 1.0 / (10000.0 ** (torch.arange(0, head_dim, 2, device=q.device).float() / head_dim))\n    a = torch.outer(torch.arange(seq_len, device=q.device), freqs)\n    cos, sin = a.cos().to(q.dtype), a.sin().to(q.dtype)\n    def rot(x):\n        x1, x2 = x[..., : head_dim // 2], x[..., head_dim // 2:]\n        return torch.cat([x1 * cos - x2 * sin, x2 * cos + x1 * sin], dim=-1)\n    return rot(q), rot(k)\n\n\nclass ByteAttnBlock(nn.Module):\n    def __init__(self, d_model, n_heads=4, expand=2):\n        super().__init__()\n        self.n_heads, self.head_dim = n_heads, d_model // n_heads\n        self.norm1 = nn.LayerNorm(d_model)\n        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)\n        self.out_proj = nn.Linear(d_model, d_model, bias=False)\n        self.norm2 = nn.LayerNorm(d_model)\n        ffn = d_model * expand\n        self.ffn_gate = nn.Linear(d_model, ffn, bias=False)\n        self.ffn_up = nn.Linear(d_model, ffn, bias=False)\n        self.ffn_down = nn.Linear(ffn, d_model, bias=False)\n\n    def forward(self, x):\n        B, T, D = x.shape\n        residual = x\n        h = self.norm1(x)\n        qkv = self.qkv(h).reshape(B, T, 3, self.n_heads, self.head_dim)\n        q, k, v = (t.transpose(1, 2) for t in qkv.unbind(dim=2))\n        q, k = _rope(q, k)\n        out = F.scaled_dot_product_attention(q, k, v, attn_mask=None, is_causal=False)\n        out = out.transpose(1, 2).contiguous().view(B, T, D)\n        x = residual + self.out_proj(out)\n        residual = x\n        h = self.norm2(x)\n        return residual + self.ffn_down(F.silu(self.ffn_gate(h)) * self.ffn_up(h))\n\n\nclass ByteHybrid(nn.Module):\n    def __init__(self, num_classes, d_model=256, n_conv=3, n_attn=1, n_heads=4,\n                 ffn_expand=2, max_len=512, conv_kernel=15, ngram_buckets=4096, ngram_dim=64):\n        super().__init__()\n        self.max_len = max_len\n        self.embed = nn.Embedding(257, d_model, padding_idx=256)\n        self.ngram_embed = ByteNgramEmbed(ngram_buckets, ngram_dim, n=3) if ngram_buckets else None\n        if self.ngram_embed is not None:\n            self.ngram_proj = nn.Linear(ngram_dim, d_model, bias=False)\n        self.conv_layers = nn.ModuleList([ByteConvBlock(d_model, conv_kernel, ffn_expand) for _ in range(n_conv)])\n        self.attn_layers = nn.ModuleList([ByteAttnBlock(d_model, n_heads, ffn_expand) for _ in range(n_attn)])\n        self.final_norm = nn.LayerNorm(d_model)\n        self.head = nn.Sequential(nn.Linear(d_model, d_model), nn.GELU(), nn.Dropout(0.1), nn.Linear(d_model, num_classes))\n\n    def forward(self, byte_ids):\n        pad_mask = byte_ids != 256\n        x = self.embed(byte_ids)\n        if self.ngram_embed is not None:\n            x = x + self.ngram_proj(self.ngram_embed(byte_ids))\n        for layer in self.conv_layers:\n            x = layer(x)\n        for layer in self.attn_layers:\n            x = layer(x)\n        x = self.final_norm(x)\n        mask = pad_mask.unsqueeze(-1).to(x.dtype)\n        x = (x * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)\n        return self.head(x)\n"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load checkpoint from the Hub"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "from huggingface_hub import hf_hub_download\nimport numpy as np\n\nREPO = 'FrameByFrame/programming-language-identification-100plus-lite'\nckpt_path = hf_hub_download(REPO, 'model.pt')\nckpt = torch.load(ckpt_path, map_location='cpu', weights_only=False)\n\nBASE_NGRAM = dict(d_model=256, n_conv=3, n_attn=1, n_heads=4, conv_kernel=15,\n                  ngram_buckets=4096, ngram_dim=64)\nmodel = ByteHybrid(num_classes=ckpt['num_classes'], max_len=ckpt['max_len'], **BASE_NGRAM).eval()\nmodel.load_state_dict(ckpt['model_state_dict'])\nidx2lang = {v: k for k, v in ckpt['lang2idx'].items()}\nMAX_LEN = ckpt['max_len']\nprint(f'{ckpt[\"num_classes\"]} labels | max_len={MAX_LEN} | params={sum(p.numel() for p in model.parameters())/1e6:.2f}M')"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Helpers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "def encode(texts, max_len=MAX_LEN):\n    out = np.full((len(texts), max_len), 256, dtype=np.int64)\n    for i, t in enumerate(texts):\n        b = t.encode('utf-8', errors='replace')[:max_len]\n        out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)\n    return torch.from_numpy(out)\n\n\n@torch.no_grad()\ndef predict(texts, top_k=3):\n    probs = torch.softmax(model(encode(texts)).float(), dim=-1)\n    top_p, top_i = probs.topk(top_k, dim=-1)\n    return [[(idx2lang[int(j)], float(p)) for p, j in zip(pr, ix)]\n            for pr, ix in zip(top_p.tolist(), top_i.tolist())]"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Predict"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "samples = [\n    \"def fib(n):\\n    return n if n < 2 else fib(n-1) + fib(n-2)\",\n    \"fn main() {\\n    println!(\\\"hello, world\\\");\\n}\",\n    \"package main\\nimport \\\"fmt\\\"\\nfunc main() { fmt.Println(\\\"hi\\\") }\",\n    \"#include <stdio.h>\\nint main() { printf(\\\"hi\\\\n\\\"); return 0; }\",\n    \"SELECT name FROM users WHERE id = 42;\",\n]\nfor text, top in zip(samples, predict(samples)):\n    print(f'{top[0][0]:<14s}  {top[0][1]:.3f}   ({top[1][0]} {top[1][1]:.2f}, {top[2][0]} {top[2][1]:.2f})  | {text[:60]!r}')"
+  }
+ ],
+ "metadata": {
+  "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
+  "language_info": {"name": "python", "version": "3.11"}
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

model.bf16.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eb17daed0d0b5007d27133abd67b305a3059665713a34c9942dd6ff28b6545f3
+size 4595893

model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:16090268695bb1c44ee411e1e78893850d4cae90753faf2a45eaecc9c2444216
+size 9173558

onnx/lang2idx.json ADDED Viewed

	@@ -0,0 +1,109 @@

+{
+  "ABAP": 82,
+  "APL": 56,
+  "ARM Assembly": 95,
+  "ATS": 80,
+  "ActionScript": 69,
+  "Ada": 19,
+  "AppleScript": 46,
+  "AutoHotkey": 25,
+  "AutoIt": 75,
+  "Awk": 33,
+  "BASIC": 90,
+  "BQN": 99,
+  "Batchfile": 59,
+  "Befunge": 100,
+  "C": 8,
+  "C#": 17,
+  "C++": 15,
+  "COBOL": 47,
+  "Ceylon": 74,
+  "Clojure": 29,
+  "CoffeeScript": 58,
+  "ColdFusion": 86,
+  "Common Lisp": 24,
+  "Component Pascal": 96,
+  "Crystal": 55,
+  "D": 23,
+  "Dart": 72,
+  "E": 93,
+  "Eiffel": 64,
+  "Elixir": 37,
+  "Emacs Lisp": 63,
+  "Erlang": 38,
+  "Euphoria": 94,
+  "F#": 27,
+  "Factor": 20,
+  "Fantom": 65,
+  "Forth": 36,
+  "Fortran": 30,
+  "FreeBASIC": 48,
+  "GAP": 61,
+  "Go": 1,
+  "Groovy": 40,
+  "Haskell": 9,
+  "Haxe": 88,
+  "IDL": 84,
+  "Io": 76,
+  "J": 7,
+  "Java": 12,
+  "JavaScript": 26,
+  "Julia": 0,
+  "Kotlin": 11,
+  "LFE": 79,
+  "LabVIEW": 85,
+  "Lasso": 54,
+  "Logtalk": 81,
+  "Lua": 22,
+  "M": 97,
+  "M4": 77,
+  "MATLAB": 51,
+  "MAXScript": 70,
+  "Mathematica/Wolfram Language": 10,
+  "Mercury": 105,
+  "Modula-2": 98,
+  "Modula-3": 104,
+  "Nemerle": 103,
+  "NewLisp": 102,
+  "Nim": 6,
+  "OCaml": 32,
+  "Objective-C": 101,
+  "Oz": 52,
+  "PHP": 43,
+  "Pascal": 3,
+  "Perl": 4,
+  "PicoLisp": 50,
+  "Pike": 67,
+  "PowerShell": 39,
+  "Processing": 73,
+  "Prolog": 44,
+  "PureBasic": 31,
+  "Python": 5,
+  "QuickBASIC": 106,
+  "R": 34,
+  "REXX": 41,
+  "Racket": 14,
+  "Raku": 2,
+  "Rebol": 68,
+  "Red": 62,
+  "Ring": 66,
+  "Ruby": 13,
+  "Rust": 21,
+  "SAS": 87,
+  "Scala": 18,
+  "Scheme": 45,
+  "Scilab": 83,
+  "Smalltalk": 49,
+  "Standard ML": 53,
+  "Stata": 57,
+  "Swift": 35,
+  "Tcl": 16,
+  "V": 91,
+  "VBA": 89,
+  "VBScript": 92,
+  "Vala": 71,
+  "Visual Basic .NET": 42,
+  "Wren": 28,
+  "Zig": 78,
+  "jq": 60
+}

onnx/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6f0d666c7861a9c60f9d0b8f02ff9f47d04aa64237929db7d7ec9255b297668d
+size 145695

onnx/model.onnx.data ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0843fd67445b7e0447627cfa4d04b2b60f5fa0c2f8cf3e4d3e7cf5b5dcf1d2d0
+size 9289004

onnx/onnx_metadata.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "config": "base_ngram",
+  "max_len": 1023,
+  "num_classes": 107,
+  "opset": 20,
+  "parity": {
+    "argmax_match": 1.0,
+    "max_abs_diff": 3.62396240234375e-05,
+    "max_rel_diff": 4.5247681555338204e-05,
+    "samples": 8
+  },
+  "source_checkpoint": "/models/guardrail_code_models/programming-language-identification-100plus-lite/model.pt"
+}

training_history.json ADDED Viewed

	@@ -0,0 +1,272 @@

+[
+  {
+    "epoch": 0,
+    "train_loss": 2.9708972894443297,
+    "lr_end": 0.0020011764705882354,
+    "elapsed_seconds": 66.09606695175171,
+    "accuracy": 0.6856240126382307,
+    "macro_f1": 0.612607947403309,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 1,
+    "train_loss": 1.3010274481744786,
+    "lr_end": 0.002997738005457924,
+    "elapsed_seconds": 67.71248269081116,
+    "accuracy": 0.7769352290679304,
+    "macro_f1": 0.7317482364510279,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 2,
+    "train_loss": 1.0224097316606127,
+    "lr_end": 0.0029796353183136315,
+    "elapsed_seconds": 67.818186044693,
+    "accuracy": 0.8169562927856767,
+    "macro_f1": 0.78981588471773,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 3,
+    "train_loss": 0.8672642597971545,
+    "lr_end": 0.0029436336625379826,
+    "elapsed_seconds": 67.77718615531921,
+    "accuracy": 0.8457082675092154,
+    "macro_f1": 0.8202063248482179,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 4,
+    "train_loss": 0.7868876520952358,
+    "lr_end": 0.002890170022447983,
+    "elapsed_seconds": 67.75592398643494,
+    "accuracy": 0.8608741442864666,
+    "macro_f1": 0.8353007183490015,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 5,
+    "train_loss": 0.7165160898856732,
+    "lr_end": 0.0028198933340924342,
+    "elapsed_seconds": 67.73916673660278,
+    "accuracy": 0.8655081621906267,
+    "macro_f1": 0.8460978781253898,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 6,
+    "train_loss": 0.6669132913845994,
+    "lr_end": 0.0027336566085343216,
+    "elapsed_seconds": 67.72217750549316,
+    "accuracy": 0.8682464454976303,
+    "macro_f1": 0.845950703682244,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 7,
+    "train_loss": 0.61862961949253,
+    "lr_end": 0.002632506578092115,
+    "elapsed_seconds": 67.68294095993042,
+    "accuracy": 0.8798314902580305,
+    "macro_f1": 0.8561684647466593,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 8,
+    "train_loss": 0.5789330196325734,
+    "lr_end": 0.0025176709912128107,
+    "elapsed_seconds": 67.72237920761108,
+    "accuracy": 0.8833070036861506,
+    "macro_f1": 0.8626534044519957,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 9,
+    "train_loss": 0.5419822454815652,
+    "lr_end": 0.002390543710190218,
+    "elapsed_seconds": 67.71956896781921,
+    "accuracy": 0.8906793048973144,
+    "macro_f1": 0.8723699895551144,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 10,
+    "train_loss": 0.5017477142915819,
+    "lr_end": 0.0022526677926108145,
+    "elapsed_seconds": 67.73478865623474,
+    "accuracy": 0.893417588204318,
+    "macro_f1": 0.876241729959754,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 11,
+    "train_loss": 0.47520807102780815,
+    "lr_end": 0.002105716761882813,
+    "elapsed_seconds": 67.72090244293213,
+    "accuracy": 0.8967877830437072,
+    "macro_f1": 0.877042583233766,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 12,
+    "train_loss": 0.4500465152393418,
+    "lr_end": 0.0019514742941847767,
+    "elapsed_seconds": 67.71593427658081,
+    "accuracy": 0.9001579778830964,
+    "macro_f1": 0.8804974841902045,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 13,
+    "train_loss": 0.42171269541808853,
+    "lr_end": 0.0017918125683914858,
+    "elapsed_seconds": 67.82398700714111,
+    "accuracy": 0.9034228541337546,
+    "macro_f1": 0.8822687844188983,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 14,
+    "train_loss": 0.3915723686496021,
+    "lr_end": 0.0016286695417633624,
+    "elapsed_seconds": 67.73259091377258,
+    "accuracy": 0.9016324381253291,
+    "macro_f1": 0.8837984553247075,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 15,
+    "train_loss": 0.3607438381742673,
+    "lr_end": 0.0014640254272247667,
+    "elapsed_seconds": 67.74051308631897,
+    "accuracy": 0.9083728278041074,
+    "macro_f1": 0.8927869860007333,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 16,
+    "train_loss": 0.3410214547847712,
+    "lr_end": 0.0012998786577474743,
+    "elapsed_seconds": 67.72065258026123,
+    "accuracy": 0.905423907319642,
+    "macro_f1": 0.8922696411524683,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 17,
+    "train_loss": 0.320683361050666,
+    "lr_end": 0.0011382216295811381,
+    "elapsed_seconds": 67.73088240623474,
+    "accuracy": 0.9070036861506056,
+    "macro_f1": 0.8910737901237671,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 18,
+    "train_loss": 0.3039343846906733,
+    "lr_end": 0.0009810165187568425,
+    "elapsed_seconds": 67.75304675102234,
+    "accuracy": 0.9130068457082675,
+    "macro_f1": 0.8995491401951339,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 19,
+    "train_loss": 0.27867962774641303,
+    "lr_end": 0.0008301714644005056,
+    "elapsed_seconds": 67.74108362197876,
+    "accuracy": 0.9141653501843076,
+    "macro_f1": 0.9006247702483327,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 20,
+    "train_loss": 0.26235123887701456,
+    "lr_end": 0.0006875174079405514,
+    "elapsed_seconds": 67.73433208465576,
+    "accuracy": 0.9152185360716166,
+    "macro_f1": 0.8962481088006313,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 21,
+    "train_loss": 0.25165350653420004,
+    "lr_end": 0.0005547858693331366,
+    "elapsed_seconds": 67.7448239326477,
+    "accuracy": 0.9149025803054239,
+    "macro_f1": 0.9016399778230075,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 22,
+    "train_loss": 0.23295196383896977,
+    "lr_end": 0.00043358793005475964,
+    "elapsed_seconds": 67.78508257865906,
+    "accuracy": 0.9162717219589257,
+    "macro_f1": 0.902575877603001,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 23,
+    "train_loss": 0.22248560638001763,
+    "lr_end": 0.0003253946779644913,
+    "elapsed_seconds": 67.751882314682,
+    "accuracy": 0.9191153238546603,
+    "macro_f1": 0.904615785832685,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 24,
+    "train_loss": 0.21600844006860187,
+    "lr_end": 0.00023151935139403203,
+    "elapsed_seconds": 67.75333857536316,
+    "accuracy": 0.9215376513954713,
+    "macro_f1": 0.9077659376985585,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 25,
+    "train_loss": 0.20916091654575555,
+    "lr_end": 0.0001531013991987532,
+    "elapsed_seconds": 68.03508472442627,
+    "accuracy": 0.921853607161664,
+    "macro_f1": 0.90760546460601,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 26,
+    "train_loss": 0.1965290139273287,
+    "lr_end": 9.109265024715332e-05,
+    "elapsed_seconds": 67.76995801925659,
+    "accuracy": 0.9228014744602422,
+    "macro_f1": 0.9085411885232777,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 27,
+    "train_loss": 0.19573741489643806,
+    "lr_end": 4.6245760222010575e-05,
+    "elapsed_seconds": 67.75984287261963,
+    "accuracy": 0.921853607161664,
+    "macro_f1": 0.9077462312429695,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 28,
+    "train_loss": 0.19350992440770687,
+    "lr_end": 1.910507596474794e-05,
+    "elapsed_seconds": 67.75555872917175,
+    "accuracy": 0.921853607161664,
+    "macro_f1": 0.9073942641770526,
+    "num_eval": 9495
+  },
+  {
+    "epoch": 29,
+    "train_loss": 0.19182746196034614,
+    "lr_end": 1.0000028250635854e-05,
+    "elapsed_seconds": 67.74632787704468,
+    "accuracy": 0.9216429699842023,
+    "macro_f1": 0.9066374838199934,
+    "num_eval": 9495
+  }
+]

training_metadata.json ADDED Viewed

	@@ -0,0 +1,257 @@

+{
+  "autocast_dtype": "bf16",
+  "config": {
+    "conv_kernel": 15,
+    "d_model": 256,
+    "n_attn": 1,
+    "n_conv": 3,
+    "n_heads": 4,
+    "ngram_buckets": 4096,
+    "ngram_dim": 64
+  },
+  "config_name": "base_ngram",
+  "device": "cuda",
+  "early_stopping_patience": 4,
+  "early_stopping_threshold": 0.0,
+  "eval_batch_size": 256,
+  "eval_rows": 9495,
+  "id2label": {
+    "0": "Julia",
+    "1": "Go",
+    "2": "Raku",
+    "3": "Pascal",
+    "4": "Perl",
+    "5": "Python",
+    "6": "Nim",
+    "7": "J",
+    "8": "C",
+    "9": "Haskell",
+    "10": "Mathematica/Wolfram Language",
+    "11": "Kotlin",
+    "12": "Java",
+    "13": "Ruby",
+    "14": "Racket",
+    "15": "C++",
+    "16": "Tcl",
+    "17": "C#",
+    "18": "Scala",
+    "19": "Ada",
+    "20": "Factor",
+    "21": "Rust",
+    "22": "Lua",
+    "23": "D",
+    "24": "Common Lisp",
+    "25": "AutoHotkey",
+    "26": "JavaScript",
+    "27": "F#",
+    "28": "Wren",
+    "29": "Clojure",
+    "30": "Fortran",
+    "31": "PureBasic",
+    "32": "OCaml",
+    "33": "Awk",
+    "34": "R",
+    "35": "Swift",
+    "36": "Forth",
+    "37": "Elixir",
+    "38": "Erlang",
+    "39": "PowerShell",
+    "40": "Groovy",
+    "41": "REXX",
+    "42": "Visual Basic .NET",
+    "43": "PHP",
+    "44": "Prolog",
+    "45": "Scheme",
+    "46": "AppleScript",
+    "47": "COBOL",
+    "48": "FreeBASIC",
+    "49": "Smalltalk",
+    "50": "PicoLisp",
+    "51": "MATLAB",
+    "52": "Oz",
+    "53": "Standard ML",
+    "54": "Lasso",
+    "55": "Crystal",
+    "56": "APL",
+    "57": "Stata",
+    "58": "CoffeeScript",
+    "59": "Batchfile",
+    "60": "jq",
+    "61": "GAP",
+    "62": "Red",
+    "63": "Emacs Lisp",
+    "64": "Eiffel",
+    "65": "Fantom",
+    "66": "Ring",
+    "67": "Pike",
+    "68": "Rebol",
+    "69": "ActionScript",
+    "70": "MAXScript",
+    "71": "Vala",
+    "72": "Dart",
+    "73": "Processing",
+    "74": "Ceylon",
+    "75": "AutoIt",
+    "76": "Io",
+    "77": "M4",
+    "78": "Zig",
+    "79": "LFE",
+    "80": "ATS",
+    "81": "Logtalk",
+    "82": "ABAP",
+    "83": "Scilab",
+    "84": "IDL",
+    "85": "LabVIEW",
+    "86": "ColdFusion",
+    "87": "SAS",
+    "88": "Haxe",
+    "89": "VBA",
+    "90": "BASIC",
+    "91": "V",
+    "92": "VBScript",
+    "93": "E",
+    "94": "Euphoria",
+    "95": "ARM Assembly",
+    "96": "Component Pascal",
+    "97": "M",
+    "98": "Modula-2",
+    "99": "BQN",
+    "100": "Befunge",
+    "101": "Objective-C",
+    "102": "NewLisp",
+    "103": "Nemerle",
+    "104": "Modula-3",
+    "105": "Mercury",
+    "106": "QuickBASIC"
+  },
+  "label2id": {
+    "ABAP": 82,
+    "APL": 56,
+    "ARM Assembly": 95,
+    "ATS": 80,
+    "ActionScript": 69,
+    "Ada": 19,
+    "AppleScript": 46,
+    "AutoHotkey": 25,
+    "AutoIt": 75,
+    "Awk": 33,
+    "BASIC": 90,
+    "BQN": 99,
+    "Batchfile": 59,
+    "Befunge": 100,
+    "C": 8,
+    "C#": 17,
+    "C++": 15,
+    "COBOL": 47,
+    "Ceylon": 74,
+    "Clojure": 29,
+    "CoffeeScript": 58,
+    "ColdFusion": 86,
+    "Common Lisp": 24,
+    "Component Pascal": 96,
+    "Crystal": 55,
+    "D": 23,
+    "Dart": 72,
+    "E": 93,
+    "Eiffel": 64,
+    "Elixir": 37,
+    "Emacs Lisp": 63,
+    "Erlang": 38,
+    "Euphoria": 94,
+    "F#": 27,
+    "Factor": 20,
+    "Fantom": 65,
+    "Forth": 36,
+    "Fortran": 30,
+    "FreeBASIC": 48,
+    "GAP": 61,
+    "Go": 1,
+    "Groovy": 40,
+    "Haskell": 9,
+    "Haxe": 88,
+    "IDL": 84,
+    "Io": 76,
+    "J": 7,
+    "Java": 12,
+    "JavaScript": 26,
+    "Julia": 0,
+    "Kotlin": 11,
+    "LFE": 79,
+    "LabVIEW": 85,
+    "Lasso": 54,
+    "Logtalk": 81,
+    "Lua": 22,
+    "M": 97,
+    "M4": 77,
+    "MATLAB": 51,
+    "MAXScript": 70,
+    "Mathematica/Wolfram Language": 10,
+    "Mercury": 105,
+    "Modula-2": 98,
+    "Modula-3": 104,
+    "Nemerle": 103,
+    "NewLisp": 102,
+    "Nim": 6,
+    "OCaml": 32,
+    "Objective-C": 101,
+    "Oz": 52,
+    "PHP": 43,
+    "Pascal": 3,
+    "Perl": 4,
+    "PicoLisp": 50,
+    "Pike": 67,
+    "PowerShell": 39,
+    "Processing": 73,
+    "Prolog": 44,
+    "PureBasic": 31,
+    "Python": 5,
+    "QuickBASIC": 106,
+    "R": 34,
+    "REXX": 41,
+    "Racket": 14,
+    "Raku": 2,
+    "Rebol": 68,
+    "Red": 62,
+    "Ring": 66,
+    "Ruby": 13,
+    "Rust": 21,
+    "SAS": 87,
+    "Scala": 18,
+    "Scheme": 45,
+    "Scilab": 83,
+    "Smalltalk": 49,
+    "Standard ML": 53,
+    "Stata": 57,
+    "Swift": 35,
+    "Tcl": 16,
+    "V": 91,
+    "VBA": 89,
+    "VBScript": 92,
+    "Vala": 71,
+    "Visual Basic .NET": 42,
+    "Wren": 28,
+    "Zig": 78,
+    "jq": 60
+  },
+  "learning_rate": 0.003,
+  "max_len": 1023,
+  "min_learning_rate": 1e-05,
+  "model_arch": "ByteHybrid",
+  "n_params": 2289515,
+  "num_classes": 107,
+  "num_train_epochs": 30,
+  "snippet_config": {
+    "eval_strategy": "head",
+    "max_chars": 1023,
+    "min_chars": 64,
+    "seed": 20260420,
+    "short_chars": 128,
+    "train_strategies": [
+      "variable_window"
+    ]
+  },
+  "train_batch_size": 128,
+  "train_rows": 72549,
+  "warmup_ratio": 0.05,
+  "weight_decay": 0.01
+}

training_summary.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "best_epoch": 26,
+  "best_macro_f1": 0.9085411885232777,
+  "epochs_run": 30,
+  "history_path": "training_history.json"
+}