vijaym commited on
Commit
58b8e27
·
verified ·
1 Parent(s): 20eb65e

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ onnx/model.onnx.data filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - multilingual
5
+ tags:
6
+ - programming-language-identification
7
+ - code
8
+ - byte-level
9
+ - lite
10
+ library_name: pytorch
11
+ pipeline_tag: text-classification
12
+ metrics:
13
+ - f1
14
+ - accuracy
15
+ ---
16
+
17
+ # programming-language-identification-100plus-lite
18
+
19
+ Byte-level programming-language identification across **107 languages**.
20
+ Lite counterpart to the full ModernBERT model
21
+ `programming-language-identification-100plus`. **2.35M parameters**, no
22
+ tokenizer, ships at **~9 MB fp32 / ~4.5 MB bf16**.
23
+
24
+ **[Open PyTorch Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_pytorch_demo.ipynb)** · **[Open ONNX Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_onnx_demo.ipynb)** — Download and run in Colab or Jupyter.
25
+
26
+ The architecture is `ByteHybrid` (3 × Conv1D → 1 × bidirectional attention with
27
+ RoPE → masked mean-pool → classifier head, with a 4096-bucket trigram-hash
28
+ embedding), vendored from
29
+ [PleIAs/CommonLingua](https://huggingface.co/PleIAs/CommonLingua) (Apache-2.0)
30
+ and trained from scratch on Rosetta Code + The Stack v1 across 107 canonical
31
+ programming languages.
32
+
33
+ ## Comparison with `philomath-1209/programming-language-identification`
34
+
35
+ 3,057 test rows over the **26 labels** philomath supports. ONNX,
36
+ `CPUExecutionProvider`, batch 64.
37
+
38
+ | model | params | accuracy | macro F1 | weighted F1 | speed |
39
+ |---|---:|---:|---:|---:|---:|
40
+ | **programming-language-identification-100plus-lite** (ONNX) | 2.35 M | 0.9094 | **0.9410** | **0.9361** | **2.37×** |
41
+ | philomath-1209/programming-language-identification (ONNX) | 84 M | 0.8449 | 0.8445 | 0.8467 | 1.00× |
42
+
43
+ Speed is ratio of texts/sec relative to philomath on the same CPU
44
+ (`onnxruntime` `CPUExecutionProvider`, single host, no other GPU/CPU load).
45
+ GPU torch-vs-torch numbers are pending — these CPU figures are the realistic
46
+ edge-deployment scenario.
47
+
48
+ ## Files
49
+
50
+ ```
51
+ model.pt fp32 PyTorch checkpoint (CommonLingua format)
52
+ model.bf16.pt bf16 sidecar checkpoint (smaller, same accuracy in eval)
53
+ lang2idx.json 107-label index
54
+ training_metadata.json hyperparameters and dataset stats
55
+ training_history.json per-epoch loss / val_acc / val_macro_f1
56
+ onnx/
57
+ model.onnx ONNX export (opset 20, dynamic batch)
58
+ model.onnx.data external weights blob
59
+ lang2idx.json (mirror)
60
+ onnx_metadata.json parity report vs PyTorch
61
+ ```
62
+
63
+ ## Quick start — PyTorch
64
+
65
+ ```python
66
+ import torch, numpy as np, sys
67
+ sys.path.append("path/to/code-language-id/src")
68
+ from code_language_id.byte_hybrid import ByteHybrid, CONFIGS
69
+
70
+ ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
71
+ model = ByteHybrid(num_classes=ckpt["num_classes"], max_len=ckpt["max_len"],
72
+ **CONFIGS[ckpt["config"]]).eval()
73
+ model.load_state_dict(ckpt["model_state_dict"])
74
+ idx2lang = {v: k for k, v in ckpt["lang2idx"].items()}
75
+
76
+ def encode(texts, max_len=ckpt["max_len"]):
77
+ out = np.full((len(texts), max_len), 256, dtype=np.int64)
78
+ for i, t in enumerate(texts):
79
+ b = t.encode("utf-8", errors="replace")[:max_len]
80
+ out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
81
+ return torch.from_numpy(out)
82
+
83
+ with torch.no_grad():
84
+ logits = model(encode(["def hello():\n print('hi')"]))
85
+ print(idx2lang[int(logits.argmax(-1))]) # -> Python
86
+ ```
87
+
88
+ ## Quick start — ONNX Runtime
89
+
90
+ ```python
91
+ import onnxruntime as ort, numpy as np, json
92
+
93
+ sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"])
94
+ lang2idx = json.load(open("onnx/lang2idx.json"))
95
+ idx2lang = {v: k for k, v in lang2idx.items()}
96
+ MAX_LEN = 1023
97
+
98
+ def encode(texts, max_len=MAX_LEN):
99
+ out = np.full((len(texts), max_len), 256, dtype=np.int64)
100
+ for i, t in enumerate(texts):
101
+ b = t.encode("utf-8", errors="replace")[:max_len]
102
+ out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
103
+ return out
104
+
105
+ logits = sess.run(None, {"byte_ids": encode(["fn main() {}"])})[0]
106
+ print(idx2lang[int(logits.argmax(-1))]) # -> Rust
107
+ ```
108
+
109
+ ## Training summary
110
+
111
+ - **Data**: Rosetta Code (`cakiki/rosetta-code`) + The Stack v1
112
+ (`bigcode/the-stack`), task-split to prevent leakage.
113
+ 72,549 / 9,495 / 8,880 rows (train / val / test) across 107 canonical labels.
114
+ - **Snippets**: variable-window (64–1023 bytes) UTF-8.
115
+ - **Optimizer**: AdamW (β=0.9, 0.95, weight decay 0.01) + cosine-with-warmup,
116
+ peak LR 3e-3, 5 % warmup, gradient clipping 1.0.
117
+ - **Schedule**: 30 epochs, bf16 autocast, batch 128 (effective 128 with
118
+ gradient clipping; SDPA fused attention).
119
+ - **Best val macro F1**: 0.9085 @ epoch 26 (early stopped).
120
+
121
+ See `training_metadata.json` for the full hyperparameter dump.
122
+
123
+ ## Citation
124
+
125
+ If you use this model, please cite:
126
+
127
+ ```bibtex
128
+ @misc{mariappan2026codelangidlite,
129
+ author = {Mariappan, Vijayachandran},
130
+ title = {programming-language-identification-100plus-lite: Byte-level Programming Language Identification across 107 Languages},
131
+ year = {2026},
132
+ publisher = {Hugging Face},
133
+ url = {https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite}
134
+ }
135
+ ```
136
+
137
+ Upstream architecture:
138
+
139
+ ```bibtex
140
+ @misc{commonlingua,
141
+ author = {{PleIAs}},
142
+ title = {CommonLingua: Byte-level Language Identification for 334 Languages},
143
+ year = {2026},
144
+ publisher = {Hugging Face},
145
+ url = {https://huggingface.co/PleIAs/CommonLingua}
146
+ }
147
+ ```
148
+
149
+ ## License & attribution
150
+
151
+ Apache-2.0. Architecture and reference inference code derive from
152
+ **PleIAs/CommonLingua** (Apache-2.0). Trained weights and dataset curation are
153
+ original to this repository.
lang2idx.json ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "ABAP": 82,
3
+ "APL": 56,
4
+ "ARM Assembly": 95,
5
+ "ATS": 80,
6
+ "ActionScript": 69,
7
+ "Ada": 19,
8
+ "AppleScript": 46,
9
+ "AutoHotkey": 25,
10
+ "AutoIt": 75,
11
+ "Awk": 33,
12
+ "BASIC": 90,
13
+ "BQN": 99,
14
+ "Batchfile": 59,
15
+ "Befunge": 100,
16
+ "C": 8,
17
+ "C#": 17,
18
+ "C++": 15,
19
+ "COBOL": 47,
20
+ "Ceylon": 74,
21
+ "Clojure": 29,
22
+ "CoffeeScript": 58,
23
+ "ColdFusion": 86,
24
+ "Common Lisp": 24,
25
+ "Component Pascal": 96,
26
+ "Crystal": 55,
27
+ "D": 23,
28
+ "Dart": 72,
29
+ "E": 93,
30
+ "Eiffel": 64,
31
+ "Elixir": 37,
32
+ "Emacs Lisp": 63,
33
+ "Erlang": 38,
34
+ "Euphoria": 94,
35
+ "F#": 27,
36
+ "Factor": 20,
37
+ "Fantom": 65,
38
+ "Forth": 36,
39
+ "Fortran": 30,
40
+ "FreeBASIC": 48,
41
+ "GAP": 61,
42
+ "Go": 1,
43
+ "Groovy": 40,
44
+ "Haskell": 9,
45
+ "Haxe": 88,
46
+ "IDL": 84,
47
+ "Io": 76,
48
+ "J": 7,
49
+ "Java": 12,
50
+ "JavaScript": 26,
51
+ "Julia": 0,
52
+ "Kotlin": 11,
53
+ "LFE": 79,
54
+ "LabVIEW": 85,
55
+ "Lasso": 54,
56
+ "Logtalk": 81,
57
+ "Lua": 22,
58
+ "M": 97,
59
+ "M4": 77,
60
+ "MATLAB": 51,
61
+ "MAXScript": 70,
62
+ "Mathematica/Wolfram Language": 10,
63
+ "Mercury": 105,
64
+ "Modula-2": 98,
65
+ "Modula-3": 104,
66
+ "Nemerle": 103,
67
+ "NewLisp": 102,
68
+ "Nim": 6,
69
+ "OCaml": 32,
70
+ "Objective-C": 101,
71
+ "Oz": 52,
72
+ "PHP": 43,
73
+ "Pascal": 3,
74
+ "Perl": 4,
75
+ "PicoLisp": 50,
76
+ "Pike": 67,
77
+ "PowerShell": 39,
78
+ "Processing": 73,
79
+ "Prolog": 44,
80
+ "PureBasic": 31,
81
+ "Python": 5,
82
+ "QuickBASIC": 106,
83
+ "R": 34,
84
+ "REXX": 41,
85
+ "Racket": 14,
86
+ "Raku": 2,
87
+ "Rebol": 68,
88
+ "Red": 62,
89
+ "Ring": 66,
90
+ "Ruby": 13,
91
+ "Rust": 21,
92
+ "SAS": 87,
93
+ "Scala": 18,
94
+ "Scheme": 45,
95
+ "Scilab": 83,
96
+ "Smalltalk": 49,
97
+ "Standard ML": 53,
98
+ "Stata": 57,
99
+ "Swift": 35,
100
+ "Tcl": 16,
101
+ "V": 91,
102
+ "VBA": 89,
103
+ "VBScript": 92,
104
+ "Vala": 71,
105
+ "Visual Basic .NET": 42,
106
+ "Wren": 28,
107
+ "Zig": 78,
108
+ "jq": 60
109
+ }
lite_onnx_demo.ipynb ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# `programming-language-identification-100plus-lite` — ONNX Runtime\n",
8
+ "\n",
9
+ "Same model as the PyTorch demo, exported to ONNX (opset 20). No torch needed at inference time. CPU-friendly: ~57 texts/sec single-thread on commodity hardware (2.37× philomath-1209 on the same box).\n",
10
+ "\n",
11
+ "Run end-to-end in Colab or Jupyter."
12
+ ]
13
+ },
14
+ {
15
+ "cell_type": "markdown",
16
+ "metadata": {},
17
+ "source": [
18
+ "## Install dependencies"
19
+ ]
20
+ },
21
+ {
22
+ "cell_type": "code",
23
+ "execution_count": null,
24
+ "metadata": {},
25
+ "outputs": [],
26
+ "source": "%%capture\n!pip install -q -U onnxruntime huggingface_hub numpy\n"
27
+ },
28
+ {
29
+ "cell_type": "markdown",
30
+ "metadata": {},
31
+ "source": [
32
+ "## Download ONNX model + label index"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "code",
37
+ "execution_count": null,
38
+ "metadata": {},
39
+ "outputs": [],
40
+ "source": "import json\nimport numpy as np\nimport onnxruntime as ort\nfrom huggingface_hub import hf_hub_download\n\nREPO = 'FrameByFrame/programming-language-identification-100plus-lite'\nonnx_path = hf_hub_download(REPO, 'onnx/model.onnx')\n# external weight blob lives next to the .onnx file\nhf_hub_download(REPO, 'onnx/model.onnx.data')\nlang2idx = json.loads(open(hf_hub_download(REPO, 'onnx/lang2idx.json')).read())\nmeta = json.loads(open(hf_hub_download(REPO, 'onnx/onnx_metadata.json')).read())\n\nsess = ort.InferenceSession(onnx_path, providers=['CPUExecutionProvider'])\nidx2lang = {v: k for k, v in lang2idx.items()}\nMAX_LEN = meta['max_len']\nprint(f'{len(idx2lang)} labels | max_len={MAX_LEN} | providers={sess.get_providers()}')"
41
+ },
42
+ {
43
+ "cell_type": "markdown",
44
+ "metadata": {},
45
+ "source": [
46
+ "## Helpers"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "code",
51
+ "execution_count": null,
52
+ "metadata": {},
53
+ "outputs": [],
54
+ "source": "def encode(texts, max_len=MAX_LEN):\n out = np.full((len(texts), max_len), 256, dtype=np.int64)\n for i, t in enumerate(texts):\n b = t.encode('utf-8', errors='replace')[:max_len]\n out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)\n return out\n\n\ndef softmax(logits, axis=-1):\n e = np.exp(logits - logits.max(axis=axis, keepdims=True))\n return e / e.sum(axis=axis, keepdims=True)\n\n\ndef predict(texts, top_k=3):\n logits = sess.run(None, {'byte_ids': encode(texts)})[0]\n probs = softmax(logits)\n top_i = np.argsort(-probs, axis=-1)[:, :top_k]\n return [[(idx2lang[int(j)], float(probs[r, j])) for j in row]\n for r, row in enumerate(top_i)]"
55
+ },
56
+ {
57
+ "cell_type": "markdown",
58
+ "metadata": {},
59
+ "source": [
60
+ "## Predict"
61
+ ]
62
+ },
63
+ {
64
+ "cell_type": "code",
65
+ "execution_count": null,
66
+ "metadata": {},
67
+ "outputs": [],
68
+ "source": "samples = [\n \"def fib(n):\\n return n if n < 2 else fib(n-1) + fib(n-2)\",\n \"fn main() {\\n println!(\\\"hello, world\\\");\\n}\",\n \"package main\\nimport \\\"fmt\\\"\\nfunc main() { fmt.Println(\\\"hi\\\") }\",\n \"#include <stdio.h>\\nint main() { printf(\\\"hi\\\\n\\\"); return 0; }\",\n \"SELECT name FROM users WHERE id = 42;\",\n]\nfor text, top in zip(samples, predict(samples)):\n print(f'{top[0][0]:<14s} {top[0][1]:.3f} ({top[1][0]} {top[1][1]:.2f}, {top[2][0]} {top[2][1]:.2f}) | {text[:60]!r}')"
69
+ },
70
+ {
71
+ "cell_type": "markdown",
72
+ "metadata": {},
73
+ "source": [
74
+ "## Throughput sanity check"
75
+ ]
76
+ },
77
+ {
78
+ "cell_type": "code",
79
+ "execution_count": null,
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": "import time\nwarm = encode(samples * 13)[:64]\nfor _ in range(3):\n sess.run(None, {'byte_ids': warm})\nt0 = time.time()\nfor _ in range(40):\n sess.run(None, {'byte_ids': warm})\nelapsed = time.time() - t0\nprint(f'{40*64/elapsed:.0f} texts/sec ({elapsed:.2f}s for 40 batches of 64)')"
83
+ }
84
+ ],
85
+ "metadata": {
86
+ "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
87
+ "language_info": {"name": "python", "version": "3.11"}
88
+ },
89
+ "nbformat": 4,
90
+ "nbformat_minor": 5
91
+ }
lite_pytorch_demo.ipynb ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# `programming-language-identification-100plus-lite` — PyTorch\n",
8
+ "\n",
9
+ "2.35M-param byte-level classifier across 107 programming languages. No tokenizer; raw UTF-8 bytes padded to 1023.\n",
10
+ "\n",
11
+ "Self-contained: this notebook inlines the model definition (vendored from PleIAs/CommonLingua, Apache-2.0) and downloads the checkpoint from the Hub. Run end-to-end in Colab or Jupyter."
12
+ ]
13
+ },
14
+ {
15
+ "cell_type": "markdown",
16
+ "metadata": {},
17
+ "source": [
18
+ "## Install dependencies"
19
+ ]
20
+ },
21
+ {
22
+ "cell_type": "code",
23
+ "execution_count": null,
24
+ "metadata": {},
25
+ "outputs": [],
26
+ "source": "%%capture\n!pip install -q -U torch huggingface_hub\n"
27
+ },
28
+ {
29
+ "cell_type": "markdown",
30
+ "metadata": {},
31
+ "source": [
32
+ "## Model definition (ByteHybrid — vendored from PleIAs/CommonLingua, Apache-2.0)"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "code",
37
+ "execution_count": null,
38
+ "metadata": {},
39
+ "outputs": [],
40
+ "source": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\nclass ByteNgramEmbed(nn.Module):\n def __init__(self, num_buckets=4096, embed_dim=64, n=3):\n super().__init__()\n self.n, self.num_buckets = n, num_buckets\n self.embed = nn.Embedding(num_buckets, embed_dim)\n\n def forward(self, byte_ids):\n B, T = byte_ids.shape\n clamped = byte_ids.clamp(max=255)\n padded = F.pad(clamped, (0, self.n - 1), value=0)\n h = torch.zeros(B, T, dtype=torch.long, device=byte_ids.device)\n for i in range(self.n):\n h = h * 257 + padded[:, i:i + T]\n return self.embed(h % self.num_buckets)\n\n\nclass ByteConvBlock(nn.Module):\n def __init__(self, d_model, kernel_size=15, expand=2):\n super().__init__()\n self.norm1 = nn.LayerNorm(d_model)\n self.pad = kernel_size - 1\n self.conv = nn.Conv1d(d_model, d_model, kernel_size, groups=d_model)\n self.norm2 = nn.LayerNorm(d_model)\n ffn = d_model * expand\n self.ffn_gate = nn.Linear(d_model, ffn, bias=False)\n self.ffn_up = nn.Linear(d_model, ffn, bias=False)\n self.ffn_down = nn.Linear(ffn, d_model, bias=False)\n\n def forward(self, x):\n residual = x\n x = self.norm1(x).transpose(1, 2)\n x = F.pad(x, (self.pad, 0))\n x = F.silu(self.conv(x)).transpose(1, 2)\n x = residual + x\n residual = x\n x = self.norm2(x)\n return residual + self.ffn_down(F.silu(self.ffn_gate(x)) * self.ffn_up(x))\n\n\ndef _rope(q, k):\n head_dim, seq_len = q.shape[-1], q.shape[-2]\n freqs = 1.0 / (10000.0 ** (torch.arange(0, head_dim, 2, device=q.device).float() / head_dim))\n a = torch.outer(torch.arange(seq_len, device=q.device), freqs)\n cos, sin = a.cos().to(q.dtype), a.sin().to(q.dtype)\n def rot(x):\n x1, x2 = x[..., : head_dim // 2], x[..., head_dim // 2:]\n return torch.cat([x1 * cos - x2 * sin, x2 * cos + x1 * sin], dim=-1)\n return rot(q), rot(k)\n\n\nclass ByteAttnBlock(nn.Module):\n def __init__(self, d_model, n_heads=4, expand=2):\n super().__init__()\n self.n_heads, self.head_dim = n_heads, d_model // n_heads\n self.norm1 = nn.LayerNorm(d_model)\n self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)\n self.out_proj = nn.Linear(d_model, d_model, bias=False)\n self.norm2 = nn.LayerNorm(d_model)\n ffn = d_model * expand\n self.ffn_gate = nn.Linear(d_model, ffn, bias=False)\n self.ffn_up = nn.Linear(d_model, ffn, bias=False)\n self.ffn_down = nn.Linear(ffn, d_model, bias=False)\n\n def forward(self, x):\n B, T, D = x.shape\n residual = x\n h = self.norm1(x)\n qkv = self.qkv(h).reshape(B, T, 3, self.n_heads, self.head_dim)\n q, k, v = (t.transpose(1, 2) for t in qkv.unbind(dim=2))\n q, k = _rope(q, k)\n out = F.scaled_dot_product_attention(q, k, v, attn_mask=None, is_causal=False)\n out = out.transpose(1, 2).contiguous().view(B, T, D)\n x = residual + self.out_proj(out)\n residual = x\n h = self.norm2(x)\n return residual + self.ffn_down(F.silu(self.ffn_gate(h)) * self.ffn_up(h))\n\n\nclass ByteHybrid(nn.Module):\n def __init__(self, num_classes, d_model=256, n_conv=3, n_attn=1, n_heads=4,\n ffn_expand=2, max_len=512, conv_kernel=15, ngram_buckets=4096, ngram_dim=64):\n super().__init__()\n self.max_len = max_len\n self.embed = nn.Embedding(257, d_model, padding_idx=256)\n self.ngram_embed = ByteNgramEmbed(ngram_buckets, ngram_dim, n=3) if ngram_buckets else None\n if self.ngram_embed is not None:\n self.ngram_proj = nn.Linear(ngram_dim, d_model, bias=False)\n self.conv_layers = nn.ModuleList([ByteConvBlock(d_model, conv_kernel, ffn_expand) for _ in range(n_conv)])\n self.attn_layers = nn.ModuleList([ByteAttnBlock(d_model, n_heads, ffn_expand) for _ in range(n_attn)])\n self.final_norm = nn.LayerNorm(d_model)\n self.head = nn.Sequential(nn.Linear(d_model, d_model), nn.GELU(), nn.Dropout(0.1), nn.Linear(d_model, num_classes))\n\n def forward(self, byte_ids):\n pad_mask = byte_ids != 256\n x = self.embed(byte_ids)\n if self.ngram_embed is not None:\n x = x + self.ngram_proj(self.ngram_embed(byte_ids))\n for layer in self.conv_layers:\n x = layer(x)\n for layer in self.attn_layers:\n x = layer(x)\n x = self.final_norm(x)\n mask = pad_mask.unsqueeze(-1).to(x.dtype)\n x = (x * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)\n return self.head(x)\n"
41
+ },
42
+ {
43
+ "cell_type": "markdown",
44
+ "metadata": {},
45
+ "source": [
46
+ "## Load checkpoint from the Hub"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "code",
51
+ "execution_count": null,
52
+ "metadata": {},
53
+ "outputs": [],
54
+ "source": "from huggingface_hub import hf_hub_download\nimport numpy as np\n\nREPO = 'FrameByFrame/programming-language-identification-100plus-lite'\nckpt_path = hf_hub_download(REPO, 'model.pt')\nckpt = torch.load(ckpt_path, map_location='cpu', weights_only=False)\n\nBASE_NGRAM = dict(d_model=256, n_conv=3, n_attn=1, n_heads=4, conv_kernel=15,\n ngram_buckets=4096, ngram_dim=64)\nmodel = ByteHybrid(num_classes=ckpt['num_classes'], max_len=ckpt['max_len'], **BASE_NGRAM).eval()\nmodel.load_state_dict(ckpt['model_state_dict'])\nidx2lang = {v: k for k, v in ckpt['lang2idx'].items()}\nMAX_LEN = ckpt['max_len']\nprint(f'{ckpt[\"num_classes\"]} labels | max_len={MAX_LEN} | params={sum(p.numel() for p in model.parameters())/1e6:.2f}M')"
55
+ },
56
+ {
57
+ "cell_type": "markdown",
58
+ "metadata": {},
59
+ "source": [
60
+ "## Helpers"
61
+ ]
62
+ },
63
+ {
64
+ "cell_type": "code",
65
+ "execution_count": null,
66
+ "metadata": {},
67
+ "outputs": [],
68
+ "source": "def encode(texts, max_len=MAX_LEN):\n out = np.full((len(texts), max_len), 256, dtype=np.int64)\n for i, t in enumerate(texts):\n b = t.encode('utf-8', errors='replace')[:max_len]\n out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)\n return torch.from_numpy(out)\n\n\n@torch.no_grad()\ndef predict(texts, top_k=3):\n probs = torch.softmax(model(encode(texts)).float(), dim=-1)\n top_p, top_i = probs.topk(top_k, dim=-1)\n return [[(idx2lang[int(j)], float(p)) for p, j in zip(pr, ix)]\n for pr, ix in zip(top_p.tolist(), top_i.tolist())]"
69
+ },
70
+ {
71
+ "cell_type": "markdown",
72
+ "metadata": {},
73
+ "source": [
74
+ "## Predict"
75
+ ]
76
+ },
77
+ {
78
+ "cell_type": "code",
79
+ "execution_count": null,
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": "samples = [\n \"def fib(n):\\n return n if n < 2 else fib(n-1) + fib(n-2)\",\n \"fn main() {\\n println!(\\\"hello, world\\\");\\n}\",\n \"package main\\nimport \\\"fmt\\\"\\nfunc main() { fmt.Println(\\\"hi\\\") }\",\n \"#include <stdio.h>\\nint main() { printf(\\\"hi\\\\n\\\"); return 0; }\",\n \"SELECT name FROM users WHERE id = 42;\",\n]\nfor text, top in zip(samples, predict(samples)):\n print(f'{top[0][0]:<14s} {top[0][1]:.3f} ({top[1][0]} {top[1][1]:.2f}, {top[2][0]} {top[2][1]:.2f}) | {text[:60]!r}')"
83
+ }
84
+ ],
85
+ "metadata": {
86
+ "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
87
+ "language_info": {"name": "python", "version": "3.11"}
88
+ },
89
+ "nbformat": 4,
90
+ "nbformat_minor": 5
91
+ }
model.bf16.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb17daed0d0b5007d27133abd67b305a3059665713a34c9942dd6ff28b6545f3
3
+ size 4595893
model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:16090268695bb1c44ee411e1e78893850d4cae90753faf2a45eaecc9c2444216
3
+ size 9173558
onnx/lang2idx.json ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "ABAP": 82,
3
+ "APL": 56,
4
+ "ARM Assembly": 95,
5
+ "ATS": 80,
6
+ "ActionScript": 69,
7
+ "Ada": 19,
8
+ "AppleScript": 46,
9
+ "AutoHotkey": 25,
10
+ "AutoIt": 75,
11
+ "Awk": 33,
12
+ "BASIC": 90,
13
+ "BQN": 99,
14
+ "Batchfile": 59,
15
+ "Befunge": 100,
16
+ "C": 8,
17
+ "C#": 17,
18
+ "C++": 15,
19
+ "COBOL": 47,
20
+ "Ceylon": 74,
21
+ "Clojure": 29,
22
+ "CoffeeScript": 58,
23
+ "ColdFusion": 86,
24
+ "Common Lisp": 24,
25
+ "Component Pascal": 96,
26
+ "Crystal": 55,
27
+ "D": 23,
28
+ "Dart": 72,
29
+ "E": 93,
30
+ "Eiffel": 64,
31
+ "Elixir": 37,
32
+ "Emacs Lisp": 63,
33
+ "Erlang": 38,
34
+ "Euphoria": 94,
35
+ "F#": 27,
36
+ "Factor": 20,
37
+ "Fantom": 65,
38
+ "Forth": 36,
39
+ "Fortran": 30,
40
+ "FreeBASIC": 48,
41
+ "GAP": 61,
42
+ "Go": 1,
43
+ "Groovy": 40,
44
+ "Haskell": 9,
45
+ "Haxe": 88,
46
+ "IDL": 84,
47
+ "Io": 76,
48
+ "J": 7,
49
+ "Java": 12,
50
+ "JavaScript": 26,
51
+ "Julia": 0,
52
+ "Kotlin": 11,
53
+ "LFE": 79,
54
+ "LabVIEW": 85,
55
+ "Lasso": 54,
56
+ "Logtalk": 81,
57
+ "Lua": 22,
58
+ "M": 97,
59
+ "M4": 77,
60
+ "MATLAB": 51,
61
+ "MAXScript": 70,
62
+ "Mathematica/Wolfram Language": 10,
63
+ "Mercury": 105,
64
+ "Modula-2": 98,
65
+ "Modula-3": 104,
66
+ "Nemerle": 103,
67
+ "NewLisp": 102,
68
+ "Nim": 6,
69
+ "OCaml": 32,
70
+ "Objective-C": 101,
71
+ "Oz": 52,
72
+ "PHP": 43,
73
+ "Pascal": 3,
74
+ "Perl": 4,
75
+ "PicoLisp": 50,
76
+ "Pike": 67,
77
+ "PowerShell": 39,
78
+ "Processing": 73,
79
+ "Prolog": 44,
80
+ "PureBasic": 31,
81
+ "Python": 5,
82
+ "QuickBASIC": 106,
83
+ "R": 34,
84
+ "REXX": 41,
85
+ "Racket": 14,
86
+ "Raku": 2,
87
+ "Rebol": 68,
88
+ "Red": 62,
89
+ "Ring": 66,
90
+ "Ruby": 13,
91
+ "Rust": 21,
92
+ "SAS": 87,
93
+ "Scala": 18,
94
+ "Scheme": 45,
95
+ "Scilab": 83,
96
+ "Smalltalk": 49,
97
+ "Standard ML": 53,
98
+ "Stata": 57,
99
+ "Swift": 35,
100
+ "Tcl": 16,
101
+ "V": 91,
102
+ "VBA": 89,
103
+ "VBScript": 92,
104
+ "Vala": 71,
105
+ "Visual Basic .NET": 42,
106
+ "Wren": 28,
107
+ "Zig": 78,
108
+ "jq": 60
109
+ }
onnx/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f0d666c7861a9c60f9d0b8f02ff9f47d04aa64237929db7d7ec9255b297668d
3
+ size 145695
onnx/model.onnx.data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0843fd67445b7e0447627cfa4d04b2b60f5fa0c2f8cf3e4d3e7cf5b5dcf1d2d0
3
+ size 9289004
onnx/onnx_metadata.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "config": "base_ngram",
3
+ "max_len": 1023,
4
+ "num_classes": 107,
5
+ "opset": 20,
6
+ "parity": {
7
+ "argmax_match": 1.0,
8
+ "max_abs_diff": 3.62396240234375e-05,
9
+ "max_rel_diff": 4.5247681555338204e-05,
10
+ "samples": 8
11
+ },
12
+ "source_checkpoint": "/models/guardrail_code_models/programming-language-identification-100plus-lite/model.pt"
13
+ }
training_history.json ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "epoch": 0,
4
+ "train_loss": 2.9708972894443297,
5
+ "lr_end": 0.0020011764705882354,
6
+ "elapsed_seconds": 66.09606695175171,
7
+ "accuracy": 0.6856240126382307,
8
+ "macro_f1": 0.612607947403309,
9
+ "num_eval": 9495
10
+ },
11
+ {
12
+ "epoch": 1,
13
+ "train_loss": 1.3010274481744786,
14
+ "lr_end": 0.002997738005457924,
15
+ "elapsed_seconds": 67.71248269081116,
16
+ "accuracy": 0.7769352290679304,
17
+ "macro_f1": 0.7317482364510279,
18
+ "num_eval": 9495
19
+ },
20
+ {
21
+ "epoch": 2,
22
+ "train_loss": 1.0224097316606127,
23
+ "lr_end": 0.0029796353183136315,
24
+ "elapsed_seconds": 67.818186044693,
25
+ "accuracy": 0.8169562927856767,
26
+ "macro_f1": 0.78981588471773,
27
+ "num_eval": 9495
28
+ },
29
+ {
30
+ "epoch": 3,
31
+ "train_loss": 0.8672642597971545,
32
+ "lr_end": 0.0029436336625379826,
33
+ "elapsed_seconds": 67.77718615531921,
34
+ "accuracy": 0.8457082675092154,
35
+ "macro_f1": 0.8202063248482179,
36
+ "num_eval": 9495
37
+ },
38
+ {
39
+ "epoch": 4,
40
+ "train_loss": 0.7868876520952358,
41
+ "lr_end": 0.002890170022447983,
42
+ "elapsed_seconds": 67.75592398643494,
43
+ "accuracy": 0.8608741442864666,
44
+ "macro_f1": 0.8353007183490015,
45
+ "num_eval": 9495
46
+ },
47
+ {
48
+ "epoch": 5,
49
+ "train_loss": 0.7165160898856732,
50
+ "lr_end": 0.0028198933340924342,
51
+ "elapsed_seconds": 67.73916673660278,
52
+ "accuracy": 0.8655081621906267,
53
+ "macro_f1": 0.8460978781253898,
54
+ "num_eval": 9495
55
+ },
56
+ {
57
+ "epoch": 6,
58
+ "train_loss": 0.6669132913845994,
59
+ "lr_end": 0.0027336566085343216,
60
+ "elapsed_seconds": 67.72217750549316,
61
+ "accuracy": 0.8682464454976303,
62
+ "macro_f1": 0.845950703682244,
63
+ "num_eval": 9495
64
+ },
65
+ {
66
+ "epoch": 7,
67
+ "train_loss": 0.61862961949253,
68
+ "lr_end": 0.002632506578092115,
69
+ "elapsed_seconds": 67.68294095993042,
70
+ "accuracy": 0.8798314902580305,
71
+ "macro_f1": 0.8561684647466593,
72
+ "num_eval": 9495
73
+ },
74
+ {
75
+ "epoch": 8,
76
+ "train_loss": 0.5789330196325734,
77
+ "lr_end": 0.0025176709912128107,
78
+ "elapsed_seconds": 67.72237920761108,
79
+ "accuracy": 0.8833070036861506,
80
+ "macro_f1": 0.8626534044519957,
81
+ "num_eval": 9495
82
+ },
83
+ {
84
+ "epoch": 9,
85
+ "train_loss": 0.5419822454815652,
86
+ "lr_end": 0.002390543710190218,
87
+ "elapsed_seconds": 67.71956896781921,
88
+ "accuracy": 0.8906793048973144,
89
+ "macro_f1": 0.8723699895551144,
90
+ "num_eval": 9495
91
+ },
92
+ {
93
+ "epoch": 10,
94
+ "train_loss": 0.5017477142915819,
95
+ "lr_end": 0.0022526677926108145,
96
+ "elapsed_seconds": 67.73478865623474,
97
+ "accuracy": 0.893417588204318,
98
+ "macro_f1": 0.876241729959754,
99
+ "num_eval": 9495
100
+ },
101
+ {
102
+ "epoch": 11,
103
+ "train_loss": 0.47520807102780815,
104
+ "lr_end": 0.002105716761882813,
105
+ "elapsed_seconds": 67.72090244293213,
106
+ "accuracy": 0.8967877830437072,
107
+ "macro_f1": 0.877042583233766,
108
+ "num_eval": 9495
109
+ },
110
+ {
111
+ "epoch": 12,
112
+ "train_loss": 0.4500465152393418,
113
+ "lr_end": 0.0019514742941847767,
114
+ "elapsed_seconds": 67.71593427658081,
115
+ "accuracy": 0.9001579778830964,
116
+ "macro_f1": 0.8804974841902045,
117
+ "num_eval": 9495
118
+ },
119
+ {
120
+ "epoch": 13,
121
+ "train_loss": 0.42171269541808853,
122
+ "lr_end": 0.0017918125683914858,
123
+ "elapsed_seconds": 67.82398700714111,
124
+ "accuracy": 0.9034228541337546,
125
+ "macro_f1": 0.8822687844188983,
126
+ "num_eval": 9495
127
+ },
128
+ {
129
+ "epoch": 14,
130
+ "train_loss": 0.3915723686496021,
131
+ "lr_end": 0.0016286695417633624,
132
+ "elapsed_seconds": 67.73259091377258,
133
+ "accuracy": 0.9016324381253291,
134
+ "macro_f1": 0.8837984553247075,
135
+ "num_eval": 9495
136
+ },
137
+ {
138
+ "epoch": 15,
139
+ "train_loss": 0.3607438381742673,
140
+ "lr_end": 0.0014640254272247667,
141
+ "elapsed_seconds": 67.74051308631897,
142
+ "accuracy": 0.9083728278041074,
143
+ "macro_f1": 0.8927869860007333,
144
+ "num_eval": 9495
145
+ },
146
+ {
147
+ "epoch": 16,
148
+ "train_loss": 0.3410214547847712,
149
+ "lr_end": 0.0012998786577474743,
150
+ "elapsed_seconds": 67.72065258026123,
151
+ "accuracy": 0.905423907319642,
152
+ "macro_f1": 0.8922696411524683,
153
+ "num_eval": 9495
154
+ },
155
+ {
156
+ "epoch": 17,
157
+ "train_loss": 0.320683361050666,
158
+ "lr_end": 0.0011382216295811381,
159
+ "elapsed_seconds": 67.73088240623474,
160
+ "accuracy": 0.9070036861506056,
161
+ "macro_f1": 0.8910737901237671,
162
+ "num_eval": 9495
163
+ },
164
+ {
165
+ "epoch": 18,
166
+ "train_loss": 0.3039343846906733,
167
+ "lr_end": 0.0009810165187568425,
168
+ "elapsed_seconds": 67.75304675102234,
169
+ "accuracy": 0.9130068457082675,
170
+ "macro_f1": 0.8995491401951339,
171
+ "num_eval": 9495
172
+ },
173
+ {
174
+ "epoch": 19,
175
+ "train_loss": 0.27867962774641303,
176
+ "lr_end": 0.0008301714644005056,
177
+ "elapsed_seconds": 67.74108362197876,
178
+ "accuracy": 0.9141653501843076,
179
+ "macro_f1": 0.9006247702483327,
180
+ "num_eval": 9495
181
+ },
182
+ {
183
+ "epoch": 20,
184
+ "train_loss": 0.26235123887701456,
185
+ "lr_end": 0.0006875174079405514,
186
+ "elapsed_seconds": 67.73433208465576,
187
+ "accuracy": 0.9152185360716166,
188
+ "macro_f1": 0.8962481088006313,
189
+ "num_eval": 9495
190
+ },
191
+ {
192
+ "epoch": 21,
193
+ "train_loss": 0.25165350653420004,
194
+ "lr_end": 0.0005547858693331366,
195
+ "elapsed_seconds": 67.7448239326477,
196
+ "accuracy": 0.9149025803054239,
197
+ "macro_f1": 0.9016399778230075,
198
+ "num_eval": 9495
199
+ },
200
+ {
201
+ "epoch": 22,
202
+ "train_loss": 0.23295196383896977,
203
+ "lr_end": 0.00043358793005475964,
204
+ "elapsed_seconds": 67.78508257865906,
205
+ "accuracy": 0.9162717219589257,
206
+ "macro_f1": 0.902575877603001,
207
+ "num_eval": 9495
208
+ },
209
+ {
210
+ "epoch": 23,
211
+ "train_loss": 0.22248560638001763,
212
+ "lr_end": 0.0003253946779644913,
213
+ "elapsed_seconds": 67.751882314682,
214
+ "accuracy": 0.9191153238546603,
215
+ "macro_f1": 0.904615785832685,
216
+ "num_eval": 9495
217
+ },
218
+ {
219
+ "epoch": 24,
220
+ "train_loss": 0.21600844006860187,
221
+ "lr_end": 0.00023151935139403203,
222
+ "elapsed_seconds": 67.75333857536316,
223
+ "accuracy": 0.9215376513954713,
224
+ "macro_f1": 0.9077659376985585,
225
+ "num_eval": 9495
226
+ },
227
+ {
228
+ "epoch": 25,
229
+ "train_loss": 0.20916091654575555,
230
+ "lr_end": 0.0001531013991987532,
231
+ "elapsed_seconds": 68.03508472442627,
232
+ "accuracy": 0.921853607161664,
233
+ "macro_f1": 0.90760546460601,
234
+ "num_eval": 9495
235
+ },
236
+ {
237
+ "epoch": 26,
238
+ "train_loss": 0.1965290139273287,
239
+ "lr_end": 9.109265024715332e-05,
240
+ "elapsed_seconds": 67.76995801925659,
241
+ "accuracy": 0.9228014744602422,
242
+ "macro_f1": 0.9085411885232777,
243
+ "num_eval": 9495
244
+ },
245
+ {
246
+ "epoch": 27,
247
+ "train_loss": 0.19573741489643806,
248
+ "lr_end": 4.6245760222010575e-05,
249
+ "elapsed_seconds": 67.75984287261963,
250
+ "accuracy": 0.921853607161664,
251
+ "macro_f1": 0.9077462312429695,
252
+ "num_eval": 9495
253
+ },
254
+ {
255
+ "epoch": 28,
256
+ "train_loss": 0.19350992440770687,
257
+ "lr_end": 1.910507596474794e-05,
258
+ "elapsed_seconds": 67.75555872917175,
259
+ "accuracy": 0.921853607161664,
260
+ "macro_f1": 0.9073942641770526,
261
+ "num_eval": 9495
262
+ },
263
+ {
264
+ "epoch": 29,
265
+ "train_loss": 0.19182746196034614,
266
+ "lr_end": 1.0000028250635854e-05,
267
+ "elapsed_seconds": 67.74632787704468,
268
+ "accuracy": 0.9216429699842023,
269
+ "macro_f1": 0.9066374838199934,
270
+ "num_eval": 9495
271
+ }
272
+ ]
training_metadata.json ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "autocast_dtype": "bf16",
3
+ "config": {
4
+ "conv_kernel": 15,
5
+ "d_model": 256,
6
+ "n_attn": 1,
7
+ "n_conv": 3,
8
+ "n_heads": 4,
9
+ "ngram_buckets": 4096,
10
+ "ngram_dim": 64
11
+ },
12
+ "config_name": "base_ngram",
13
+ "device": "cuda",
14
+ "early_stopping_patience": 4,
15
+ "early_stopping_threshold": 0.0,
16
+ "eval_batch_size": 256,
17
+ "eval_rows": 9495,
18
+ "id2label": {
19
+ "0": "Julia",
20
+ "1": "Go",
21
+ "2": "Raku",
22
+ "3": "Pascal",
23
+ "4": "Perl",
24
+ "5": "Python",
25
+ "6": "Nim",
26
+ "7": "J",
27
+ "8": "C",
28
+ "9": "Haskell",
29
+ "10": "Mathematica/Wolfram Language",
30
+ "11": "Kotlin",
31
+ "12": "Java",
32
+ "13": "Ruby",
33
+ "14": "Racket",
34
+ "15": "C++",
35
+ "16": "Tcl",
36
+ "17": "C#",
37
+ "18": "Scala",
38
+ "19": "Ada",
39
+ "20": "Factor",
40
+ "21": "Rust",
41
+ "22": "Lua",
42
+ "23": "D",
43
+ "24": "Common Lisp",
44
+ "25": "AutoHotkey",
45
+ "26": "JavaScript",
46
+ "27": "F#",
47
+ "28": "Wren",
48
+ "29": "Clojure",
49
+ "30": "Fortran",
50
+ "31": "PureBasic",
51
+ "32": "OCaml",
52
+ "33": "Awk",
53
+ "34": "R",
54
+ "35": "Swift",
55
+ "36": "Forth",
56
+ "37": "Elixir",
57
+ "38": "Erlang",
58
+ "39": "PowerShell",
59
+ "40": "Groovy",
60
+ "41": "REXX",
61
+ "42": "Visual Basic .NET",
62
+ "43": "PHP",
63
+ "44": "Prolog",
64
+ "45": "Scheme",
65
+ "46": "AppleScript",
66
+ "47": "COBOL",
67
+ "48": "FreeBASIC",
68
+ "49": "Smalltalk",
69
+ "50": "PicoLisp",
70
+ "51": "MATLAB",
71
+ "52": "Oz",
72
+ "53": "Standard ML",
73
+ "54": "Lasso",
74
+ "55": "Crystal",
75
+ "56": "APL",
76
+ "57": "Stata",
77
+ "58": "CoffeeScript",
78
+ "59": "Batchfile",
79
+ "60": "jq",
80
+ "61": "GAP",
81
+ "62": "Red",
82
+ "63": "Emacs Lisp",
83
+ "64": "Eiffel",
84
+ "65": "Fantom",
85
+ "66": "Ring",
86
+ "67": "Pike",
87
+ "68": "Rebol",
88
+ "69": "ActionScript",
89
+ "70": "MAXScript",
90
+ "71": "Vala",
91
+ "72": "Dart",
92
+ "73": "Processing",
93
+ "74": "Ceylon",
94
+ "75": "AutoIt",
95
+ "76": "Io",
96
+ "77": "M4",
97
+ "78": "Zig",
98
+ "79": "LFE",
99
+ "80": "ATS",
100
+ "81": "Logtalk",
101
+ "82": "ABAP",
102
+ "83": "Scilab",
103
+ "84": "IDL",
104
+ "85": "LabVIEW",
105
+ "86": "ColdFusion",
106
+ "87": "SAS",
107
+ "88": "Haxe",
108
+ "89": "VBA",
109
+ "90": "BASIC",
110
+ "91": "V",
111
+ "92": "VBScript",
112
+ "93": "E",
113
+ "94": "Euphoria",
114
+ "95": "ARM Assembly",
115
+ "96": "Component Pascal",
116
+ "97": "M",
117
+ "98": "Modula-2",
118
+ "99": "BQN",
119
+ "100": "Befunge",
120
+ "101": "Objective-C",
121
+ "102": "NewLisp",
122
+ "103": "Nemerle",
123
+ "104": "Modula-3",
124
+ "105": "Mercury",
125
+ "106": "QuickBASIC"
126
+ },
127
+ "label2id": {
128
+ "ABAP": 82,
129
+ "APL": 56,
130
+ "ARM Assembly": 95,
131
+ "ATS": 80,
132
+ "ActionScript": 69,
133
+ "Ada": 19,
134
+ "AppleScript": 46,
135
+ "AutoHotkey": 25,
136
+ "AutoIt": 75,
137
+ "Awk": 33,
138
+ "BASIC": 90,
139
+ "BQN": 99,
140
+ "Batchfile": 59,
141
+ "Befunge": 100,
142
+ "C": 8,
143
+ "C#": 17,
144
+ "C++": 15,
145
+ "COBOL": 47,
146
+ "Ceylon": 74,
147
+ "Clojure": 29,
148
+ "CoffeeScript": 58,
149
+ "ColdFusion": 86,
150
+ "Common Lisp": 24,
151
+ "Component Pascal": 96,
152
+ "Crystal": 55,
153
+ "D": 23,
154
+ "Dart": 72,
155
+ "E": 93,
156
+ "Eiffel": 64,
157
+ "Elixir": 37,
158
+ "Emacs Lisp": 63,
159
+ "Erlang": 38,
160
+ "Euphoria": 94,
161
+ "F#": 27,
162
+ "Factor": 20,
163
+ "Fantom": 65,
164
+ "Forth": 36,
165
+ "Fortran": 30,
166
+ "FreeBASIC": 48,
167
+ "GAP": 61,
168
+ "Go": 1,
169
+ "Groovy": 40,
170
+ "Haskell": 9,
171
+ "Haxe": 88,
172
+ "IDL": 84,
173
+ "Io": 76,
174
+ "J": 7,
175
+ "Java": 12,
176
+ "JavaScript": 26,
177
+ "Julia": 0,
178
+ "Kotlin": 11,
179
+ "LFE": 79,
180
+ "LabVIEW": 85,
181
+ "Lasso": 54,
182
+ "Logtalk": 81,
183
+ "Lua": 22,
184
+ "M": 97,
185
+ "M4": 77,
186
+ "MATLAB": 51,
187
+ "MAXScript": 70,
188
+ "Mathematica/Wolfram Language": 10,
189
+ "Mercury": 105,
190
+ "Modula-2": 98,
191
+ "Modula-3": 104,
192
+ "Nemerle": 103,
193
+ "NewLisp": 102,
194
+ "Nim": 6,
195
+ "OCaml": 32,
196
+ "Objective-C": 101,
197
+ "Oz": 52,
198
+ "PHP": 43,
199
+ "Pascal": 3,
200
+ "Perl": 4,
201
+ "PicoLisp": 50,
202
+ "Pike": 67,
203
+ "PowerShell": 39,
204
+ "Processing": 73,
205
+ "Prolog": 44,
206
+ "PureBasic": 31,
207
+ "Python": 5,
208
+ "QuickBASIC": 106,
209
+ "R": 34,
210
+ "REXX": 41,
211
+ "Racket": 14,
212
+ "Raku": 2,
213
+ "Rebol": 68,
214
+ "Red": 62,
215
+ "Ring": 66,
216
+ "Ruby": 13,
217
+ "Rust": 21,
218
+ "SAS": 87,
219
+ "Scala": 18,
220
+ "Scheme": 45,
221
+ "Scilab": 83,
222
+ "Smalltalk": 49,
223
+ "Standard ML": 53,
224
+ "Stata": 57,
225
+ "Swift": 35,
226
+ "Tcl": 16,
227
+ "V": 91,
228
+ "VBA": 89,
229
+ "VBScript": 92,
230
+ "Vala": 71,
231
+ "Visual Basic .NET": 42,
232
+ "Wren": 28,
233
+ "Zig": 78,
234
+ "jq": 60
235
+ },
236
+ "learning_rate": 0.003,
237
+ "max_len": 1023,
238
+ "min_learning_rate": 1e-05,
239
+ "model_arch": "ByteHybrid",
240
+ "n_params": 2289515,
241
+ "num_classes": 107,
242
+ "num_train_epochs": 30,
243
+ "snippet_config": {
244
+ "eval_strategy": "head",
245
+ "max_chars": 1023,
246
+ "min_chars": 64,
247
+ "seed": 20260420,
248
+ "short_chars": 128,
249
+ "train_strategies": [
250
+ "variable_window"
251
+ ]
252
+ },
253
+ "train_batch_size": 128,
254
+ "train_rows": 72549,
255
+ "warmup_ratio": 0.05,
256
+ "weight_decay": 0.01
257
+ }
training_summary.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "best_epoch": 26,
3
+ "best_macro_f1": 0.9085411885232777,
4
+ "epochs_run": 30,
5
+ "history_path": "training_history.json"
6
+ }