niduank commited on
Commit
e50280a
·
verified ·
1 Parent(s): f84d944

Promote run8 EN-only to main; remove bilingual run7 artifacts

Browse files
README.md CHANGED
@@ -1,7 +1,6 @@
1
  ---
2
  language:
3
  - en
4
- - fr
5
  license: apache-2.0
6
  tags:
7
  - keyboard
@@ -9,127 +8,63 @@ tags:
9
  - mobile
10
  - ios
11
  - coreml
12
- - bilingual
13
  library_name: pytorch
14
  pipeline_tag: text-generation
15
  ---
16
 
17
- # Onit Keyboard LM
18
-
19
- A **41M parameter** bilingual (English + French) language model designed for **mobile keyboard prediction** on iOS.
20
-
21
- ## Model Description
22
-
23
- Onit Keyboard LM is a compact causal language model optimized for next-word prediction in a mobile keyboard context. It supports both English and French, including code-switching between the two languages.
24
-
25
- ### Architecture
26
-
27
- | Component | Value |
28
- |-----------|-------|
29
- | Type | Causal LM (decoder-only) |
30
- | Parameters | ~41M |
31
- | Vocabulary | 16,384 BPE tokens |
32
- | Embedding dim | 512 |
33
- | Layers | 10 |
34
- | Attention heads | 8 |
35
- | FFN dim | 1408 (SwiGLU) |
36
- | Max sequence length | 256 |
37
- | Positional encoding | RoPE |
38
- | Normalization | RMSNorm + QK-Norm |
39
- | Embeddings | Tied (input = output) |
40
-
41
- ### Key Design Choices
42
-
43
- - **SwiGLU FFN** for better parameter efficiency at small scale
44
- - **QK-Norm** for stable training without careful LR tuning
45
- - **RoPE** for length generalization
46
- - **Tied embeddings** to reduce parameter count (critical for mobile)
47
- - **BPE tokenizer** (16K vocab) trained on the bilingual data mix
48
 
49
  ## Training
50
 
51
- ### Dataset (Phase 2)
52
-
53
- The model was trained on a diverse bilingual mix:
54
-
55
- | Source | Language | Share |
56
- |--------|----------|-------|
57
- | OpenSubtitles | FR + EN | ~40% |
58
- | Wikipedia | FR + EN | ~30% |
59
- | C4 (web) | FR + EN | ~30% |
60
-
61
- Total: ~13.6M sentences, ~2.7 GB of clean text.
62
-
63
- ### Hyperparameters
64
-
65
- | Parameter | Value |
66
- |-----------|-------|
67
- | Training steps | 30,000 |
68
- | Effective batch size | 64 (32 x 2 grad accum) |
69
- | Learning rate | 6e-5 (cosine decay) |
70
- | Warmup steps | 1,000 |
71
- | Precision | bf16 mixed |
72
- | Optimizer | AdamW |
73
-
74
- ### Results
75
 
76
- | Metric | Value |
77
- |--------|-------|
78
- | Training loss (final) | 2.01 |
79
- | Validation PPL | 58.8 |
80
- | Tokens seen | 491M |
81
-
82
- ## Usage
83
-
84
- ### PyTorch
85
-
86
- ```python
87
- import torch
88
- from tokenizers import Tokenizer
89
- from keyboard_lm.model import JointUniLM, ModelConfig
90
-
91
- # Load
92
- ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
93
- config = ModelConfig(**ckpt["model_config"])
94
- model = JointUniLM(config)
95
- model.load_state_dict(ckpt["model_state_dict"])
96
- model.eval()
97
-
98
- tokenizer = Tokenizer.from_file("tokenizer.json")
99
-
100
- # Predict next token
101
- prompt = "I'm going to the"
102
- ids = [config.bos_token_id] + tokenizer.encode(prompt).ids
103
- input_ids = torch.tensor([ids])
104
-
105
- with torch.no_grad():
106
- logits, _ = model(input_ids)
107
- probs = torch.softmax(logits[0, -1], dim=-1)
108
- top5 = torch.topk(probs, 5)
109
-
110
- for prob, idx in zip(top5.values, top5.indices):
111
- print(f" {tokenizer.decode([idx.item()]):>10} ({prob:.1%})")
112
- ```
113
-
114
- ### CoreML (iOS)
115
-
116
- See `scripts/export_coreml.py` in the [GitHub repo](https://github.com/synth-inc/onit-keyboard-lm) for CoreML conversion.
117
 
118
  ## Files
119
 
120
- | File | Description |
121
- |------|-------------|
122
- | `model.pt` | Model weights + config (no optimizer) |
123
- | `checkpoint_full.pt` | Full training checkpoint (with optimizer, for resume) |
124
- | `config.json` | Model configuration |
125
- | `tokenizer.json` | BPE tokenizer (v2, trained on Phase 2 mix) |
126
-
127
- ## Limitations
128
-
129
- - Optimized for short text (keyboard input), not long-form generation
130
- - May produce grammatical errors in French (e.g., double negatives)
131
- - 256-token context window limits long-range coherence
132
- - Not suitable for factual Q&A or instruction following
 
 
133
 
134
  ## License
135
 
 
1
  ---
2
  language:
3
  - en
 
4
  license: apache-2.0
5
  tags:
6
  - keyboard
 
8
  - mobile
9
  - ios
10
  - coreml
11
+ - english
12
  library_name: pytorch
13
  pipeline_tag: text-generation
14
  ---
15
 
16
+ # Onit Keyboard LM (EN-only, run8)
17
+
18
+ A 40M parameter English-only language model designed for next-word prediction
19
+ in the Onit iOS keyboard. Replaces the previous bilingual run7 model that
20
+ surfaced French tokens (`même`, `soit`, `des`, `présent`) in EN keyboard
21
+ contexts.
22
+
23
+ ## Architecture
24
+
25
+ | Component | Value |
26
+ |-----------------------|-----------------------------|
27
+ | Type | Causal LM (decoder-only) |
28
+ | Parameters | ~40M |
29
+ | Vocabulary | 16,384 BPE tokens (EN-only) |
30
+ | Embedding dim | 512 |
31
+ | Layers | 10 |
32
+ | Attention heads | 8 |
33
+ | FFN dim | 1408 (SwiGLU) |
34
+ | Max sequence length | 256 |
35
+ | Positional encoding | RoPE |
36
+ | Normalization | RMSNorm + QK-Norm |
37
+ | Embeddings | Tied (input = output) |
 
 
 
 
 
 
 
 
 
38
 
39
  ## Training
40
 
41
+ EN-only corpus (43M lines, ~445M tokens):
42
+ - `clean_en` (Tim's curated corpus)
43
+ - `opensubtitles_en` (filtered for mislabeled French; deduplicated)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
+ Training run (run8):
46
+ - 30,000 steps, lr 6e-5 cosine schedule, warmup 1000, effective batch 64
47
+ - Validation PPL: **38.08** on the held-out EN val split
48
+ - Test PPL: **37.88** on the held-out EN test split (no contamination)
49
+ - 100 % argmax parity between PyTorch and the exported CoreML model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  ## Files
52
 
53
+ | File | Description |
54
+ |---------------------------------------|------------------------------------------------|
55
+ | `keyboard_lm_seq128_fp16.mlpackage` | CoreML mlprogram fp16, seq_len=128 (iOS) |
56
+ | `tokenizer_en.json` | BPE 16K tokenizer trained on the EN-only corpus|
57
+ | `config.json` | Model configuration |
58
+
59
+ ## iOS usage notes
60
+
61
+ - **Strip trailing whitespace from prompts before tokenization.** The model
62
+ was trained on clean sentences and produces noisy subword fragments on
63
+ inputs like `"Hey guys "` (with trailing space). Use
64
+ `prompt.trimmingCharacters(in: .whitespacesAndNewlines)` before encoding.
65
+ - 100 % argmax agreement between PyTorch and the exported CoreML model on
66
+ validation prompts. Predictions on iOS match the PyTorch reference
67
+ bit-for-bit (modulo fp16 quantization noise: max abs diff ≈ 0.011).
68
 
69
  ## License
70
 
checkpoint_full.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:1815019697c1ba8ac9c770097bd0234d9ead3a92c8aa74f40c567aef220eab7c
3
- size 486296722
 
 
 
 
{keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Data/com.apple.CoreML/model.mlmodel RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:09276d45c04924a72551c43ae21c04bcf5ca638f41d7bc68316e856f35b4e503
3
- size 207932
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46d2b3346546533188d9f97e78ff9c29779b6e53483dc4aefe8b664106ca0e4f
3
+ size 209808
{keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Data/com.apple.CoreML/weights/weight.bin RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ac3538eb72a12390b660e8852a26e83f7c8a85a69d2200f9080b327657be9679
3
  size 81119936
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b8da8eb7e6e8fb4dd2ef16ca34c557cd739d87ec9d451fd3b7930f2e1c58fdd1
3
  size 81119936
{keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Manifest.json RENAMED
@@ -1,18 +1,18 @@
1
  {
2
  "fileFormatVersion": "1.0.0",
3
  "itemInfoEntries": {
4
- "41ADBBC4-04DD-4893-B15F-744D900611C1": {
5
- "author": "com.apple.CoreML",
6
- "description": "CoreML Model Weights",
7
- "name": "weights",
8
- "path": "com.apple.CoreML/weights"
9
- },
10
- "F99D3F93-9202-4625-8B4E-B13FCDE915D3": {
11
  "author": "com.apple.CoreML",
12
  "description": "CoreML Model Specification",
13
  "name": "model.mlmodel",
14
  "path": "com.apple.CoreML/model.mlmodel"
 
 
 
 
 
 
15
  }
16
  },
17
- "rootModelIdentifier": "F99D3F93-9202-4625-8B4E-B13FCDE915D3"
18
  }
 
1
  {
2
  "fileFormatVersion": "1.0.0",
3
  "itemInfoEntries": {
4
+ "5049EDD7-0417-455F-8CB7-9D7D525CCF43": {
 
 
 
 
 
 
5
  "author": "com.apple.CoreML",
6
  "description": "CoreML Model Specification",
7
  "name": "model.mlmodel",
8
  "path": "com.apple.CoreML/model.mlmodel"
9
+ },
10
+ "EC107ED8-DED7-459A-A9DE-998EF22B3FB0": {
11
+ "author": "com.apple.CoreML",
12
+ "description": "CoreML Model Weights",
13
+ "name": "weights",
14
+ "path": "com.apple.CoreML/weights"
15
  }
16
  },
17
+ "rootModelIdentifier": "5049EDD7-0417-455F-8CB7-9D7D525CCF43"
18
  }
model.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:d8648eaa471f0baf1229546dfd29bb4b5532b9b37a1338c82bb50ce6be074049
3
- size 162087274
 
 
 
 
tokenizer_en.json CHANGED
The diff for this file is too large to render. See raw diff