Promote run8 EN-only to main; remove bilingual run7 artifacts
Browse files- README.md +46 -111
- checkpoint_full.pt +0 -3
- {keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Data/com.apple.CoreML/model.mlmodel +2 -2
- {keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Data/com.apple.CoreML/weights/weight.bin +1 -1
- {keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Manifest.json +8 -8
- model.pt +0 -3
- tokenizer_en.json +0 -0
README.md
CHANGED
|
@@ -1,7 +1,6 @@
|
|
| 1 |
---
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
-
- fr
|
| 5 |
license: apache-2.0
|
| 6 |
tags:
|
| 7 |
- keyboard
|
|
@@ -9,127 +8,63 @@ tags:
|
|
| 9 |
- mobile
|
| 10 |
- ios
|
| 11 |
- coreml
|
| 12 |
-
-
|
| 13 |
library_name: pytorch
|
| 14 |
pipeline_tag: text-generation
|
| 15 |
---
|
| 16 |
|
| 17 |
-
# Onit Keyboard LM
|
| 18 |
-
|
| 19 |
-
A
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
|
| 28 |
-
|
|
| 29 |
-
|
|
| 30 |
-
|
|
| 31 |
-
|
|
| 32 |
-
|
|
| 33 |
-
|
|
| 34 |
-
|
|
| 35 |
-
|
|
| 36 |
-
|
|
| 37 |
-
|
|
| 38 |
-
|
|
| 39 |
-
| Embeddings | Tied (input = output) |
|
| 40 |
-
|
| 41 |
-
### Key Design Choices
|
| 42 |
-
|
| 43 |
-
- **SwiGLU FFN** for better parameter efficiency at small scale
|
| 44 |
-
- **QK-Norm** for stable training without careful LR tuning
|
| 45 |
-
- **RoPE** for length generalization
|
| 46 |
-
- **Tied embeddings** to reduce parameter count (critical for mobile)
|
| 47 |
-
- **BPE tokenizer** (16K vocab) trained on the bilingual data mix
|
| 48 |
|
| 49 |
## Training
|
| 50 |
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
| Source | Language | Share |
|
| 56 |
-
|--------|----------|-------|
|
| 57 |
-
| OpenSubtitles | FR + EN | ~40% |
|
| 58 |
-
| Wikipedia | FR + EN | ~30% |
|
| 59 |
-
| C4 (web) | FR + EN | ~30% |
|
| 60 |
-
|
| 61 |
-
Total: ~13.6M sentences, ~2.7 GB of clean text.
|
| 62 |
-
|
| 63 |
-
### Hyperparameters
|
| 64 |
-
|
| 65 |
-
| Parameter | Value |
|
| 66 |
-
|-----------|-------|
|
| 67 |
-
| Training steps | 30,000 |
|
| 68 |
-
| Effective batch size | 64 (32 x 2 grad accum) |
|
| 69 |
-
| Learning rate | 6e-5 (cosine decay) |
|
| 70 |
-
| Warmup steps | 1,000 |
|
| 71 |
-
| Precision | bf16 mixed |
|
| 72 |
-
| Optimizer | AdamW |
|
| 73 |
-
|
| 74 |
-
### Results
|
| 75 |
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
## Usage
|
| 83 |
-
|
| 84 |
-
### PyTorch
|
| 85 |
-
|
| 86 |
-
```python
|
| 87 |
-
import torch
|
| 88 |
-
from tokenizers import Tokenizer
|
| 89 |
-
from keyboard_lm.model import JointUniLM, ModelConfig
|
| 90 |
-
|
| 91 |
-
# Load
|
| 92 |
-
ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
|
| 93 |
-
config = ModelConfig(**ckpt["model_config"])
|
| 94 |
-
model = JointUniLM(config)
|
| 95 |
-
model.load_state_dict(ckpt["model_state_dict"])
|
| 96 |
-
model.eval()
|
| 97 |
-
|
| 98 |
-
tokenizer = Tokenizer.from_file("tokenizer.json")
|
| 99 |
-
|
| 100 |
-
# Predict next token
|
| 101 |
-
prompt = "I'm going to the"
|
| 102 |
-
ids = [config.bos_token_id] + tokenizer.encode(prompt).ids
|
| 103 |
-
input_ids = torch.tensor([ids])
|
| 104 |
-
|
| 105 |
-
with torch.no_grad():
|
| 106 |
-
logits, _ = model(input_ids)
|
| 107 |
-
probs = torch.softmax(logits[0, -1], dim=-1)
|
| 108 |
-
top5 = torch.topk(probs, 5)
|
| 109 |
-
|
| 110 |
-
for prob, idx in zip(top5.values, top5.indices):
|
| 111 |
-
print(f" {tokenizer.decode([idx.item()]):>10} ({prob:.1%})")
|
| 112 |
-
```
|
| 113 |
-
|
| 114 |
-
### CoreML (iOS)
|
| 115 |
-
|
| 116 |
-
See `scripts/export_coreml.py` in the [GitHub repo](https://github.com/synth-inc/onit-keyboard-lm) for CoreML conversion.
|
| 117 |
|
| 118 |
## Files
|
| 119 |
|
| 120 |
-
| File
|
| 121 |
-
|------|-------------|
|
| 122 |
-
| `
|
| 123 |
-
| `
|
| 124 |
-
| `config.json`
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
-
|
|
|
|
|
|
|
| 133 |
|
| 134 |
## License
|
| 135 |
|
|
|
|
| 1 |
---
|
| 2 |
language:
|
| 3 |
- en
|
|
|
|
| 4 |
license: apache-2.0
|
| 5 |
tags:
|
| 6 |
- keyboard
|
|
|
|
| 8 |
- mobile
|
| 9 |
- ios
|
| 10 |
- coreml
|
| 11 |
+
- english
|
| 12 |
library_name: pytorch
|
| 13 |
pipeline_tag: text-generation
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# Onit Keyboard LM (EN-only, run8)
|
| 17 |
+
|
| 18 |
+
A 40M parameter English-only language model designed for next-word prediction
|
| 19 |
+
in the Onit iOS keyboard. Replaces the previous bilingual run7 model that
|
| 20 |
+
surfaced French tokens (`même`, `soit`, `des`, `présent`) in EN keyboard
|
| 21 |
+
contexts.
|
| 22 |
+
|
| 23 |
+
## Architecture
|
| 24 |
+
|
| 25 |
+
| Component | Value |
|
| 26 |
+
|-----------------------|-----------------------------|
|
| 27 |
+
| Type | Causal LM (decoder-only) |
|
| 28 |
+
| Parameters | ~40M |
|
| 29 |
+
| Vocabulary | 16,384 BPE tokens (EN-only) |
|
| 30 |
+
| Embedding dim | 512 |
|
| 31 |
+
| Layers | 10 |
|
| 32 |
+
| Attention heads | 8 |
|
| 33 |
+
| FFN dim | 1408 (SwiGLU) |
|
| 34 |
+
| Max sequence length | 256 |
|
| 35 |
+
| Positional encoding | RoPE |
|
| 36 |
+
| Normalization | RMSNorm + QK-Norm |
|
| 37 |
+
| Embeddings | Tied (input = output) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
## Training
|
| 40 |
|
| 41 |
+
EN-only corpus (43M lines, ~445M tokens):
|
| 42 |
+
- `clean_en` (Tim's curated corpus)
|
| 43 |
+
- `opensubtitles_en` (filtered for mislabeled French; deduplicated)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
Training run (run8):
|
| 46 |
+
- 30,000 steps, lr 6e-5 cosine schedule, warmup 1000, effective batch 64
|
| 47 |
+
- Validation PPL: **38.08** on the held-out EN val split
|
| 48 |
+
- Test PPL: **37.88** on the held-out EN test split (no contamination)
|
| 49 |
+
- 100 % argmax parity between PyTorch and the exported CoreML model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
## Files
|
| 52 |
|
| 53 |
+
| File | Description |
|
| 54 |
+
|---------------------------------------|------------------------------------------------|
|
| 55 |
+
| `keyboard_lm_seq128_fp16.mlpackage` | CoreML mlprogram fp16, seq_len=128 (iOS) |
|
| 56 |
+
| `tokenizer_en.json` | BPE 16K tokenizer trained on the EN-only corpus|
|
| 57 |
+
| `config.json` | Model configuration |
|
| 58 |
+
|
| 59 |
+
## iOS usage notes
|
| 60 |
+
|
| 61 |
+
- **Strip trailing whitespace from prompts before tokenization.** The model
|
| 62 |
+
was trained on clean sentences and produces noisy subword fragments on
|
| 63 |
+
inputs like `"Hey guys "` (with trailing space). Use
|
| 64 |
+
`prompt.trimmingCharacters(in: .whitespacesAndNewlines)` before encoding.
|
| 65 |
+
- 100 % argmax agreement between PyTorch and the exported CoreML model on
|
| 66 |
+
validation prompts. Predictions on iOS match the PyTorch reference
|
| 67 |
+
bit-for-bit (modulo fp16 quantization noise: max abs diff ≈ 0.011).
|
| 68 |
|
| 69 |
## License
|
| 70 |
|
checkpoint_full.pt
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:1815019697c1ba8ac9c770097bd0234d9ead3a92c8aa74f40c567aef220eab7c
|
| 3 |
-
size 486296722
|
|
|
|
|
|
|
|
|
|
|
|
{keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Data/com.apple.CoreML/model.mlmodel
RENAMED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:46d2b3346546533188d9f97e78ff9c29779b6e53483dc4aefe8b664106ca0e4f
|
| 3 |
+
size 209808
|
{keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Data/com.apple.CoreML/weights/weight.bin
RENAMED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 81119936
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b8da8eb7e6e8fb4dd2ef16ca34c557cd739d87ec9d451fd3b7930f2e1c58fdd1
|
| 3 |
size 81119936
|
{keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Manifest.json
RENAMED
|
@@ -1,18 +1,18 @@
|
|
| 1 |
{
|
| 2 |
"fileFormatVersion": "1.0.0",
|
| 3 |
"itemInfoEntries": {
|
| 4 |
-
"
|
| 5 |
-
"author": "com.apple.CoreML",
|
| 6 |
-
"description": "CoreML Model Weights",
|
| 7 |
-
"name": "weights",
|
| 8 |
-
"path": "com.apple.CoreML/weights"
|
| 9 |
-
},
|
| 10 |
-
"F99D3F93-9202-4625-8B4E-B13FCDE915D3": {
|
| 11 |
"author": "com.apple.CoreML",
|
| 12 |
"description": "CoreML Model Specification",
|
| 13 |
"name": "model.mlmodel",
|
| 14 |
"path": "com.apple.CoreML/model.mlmodel"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
}
|
| 16 |
},
|
| 17 |
-
"rootModelIdentifier": "
|
| 18 |
}
|
|
|
|
| 1 |
{
|
| 2 |
"fileFormatVersion": "1.0.0",
|
| 3 |
"itemInfoEntries": {
|
| 4 |
+
"5049EDD7-0417-455F-8CB7-9D7D525CCF43": {
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
"author": "com.apple.CoreML",
|
| 6 |
"description": "CoreML Model Specification",
|
| 7 |
"name": "model.mlmodel",
|
| 8 |
"path": "com.apple.CoreML/model.mlmodel"
|
| 9 |
+
},
|
| 10 |
+
"EC107ED8-DED7-459A-A9DE-998EF22B3FB0": {
|
| 11 |
+
"author": "com.apple.CoreML",
|
| 12 |
+
"description": "CoreML Model Weights",
|
| 13 |
+
"name": "weights",
|
| 14 |
+
"path": "com.apple.CoreML/weights"
|
| 15 |
}
|
| 16 |
},
|
| 17 |
+
"rootModelIdentifier": "5049EDD7-0417-455F-8CB7-9D7D525CCF43"
|
| 18 |
}
|
model.pt
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:d8648eaa471f0baf1229546dfd29bb4b5532b9b37a1338c82bb50ce6be074049
|
| 3 |
-
size 162087274
|
|
|
|
|
|
|
|
|
|
|
|
tokenizer_en.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|