Promote run8 EN-only to main; remove bilingual run7 artifacts

Browse files

Files changed (7) hide show

README.md +46 -111
checkpoint_full.pt +0 -3
{keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Data/com.apple.CoreML/model.mlmodel +2 -2
{keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Data/com.apple.CoreML/weights/weight.bin +1 -1
{keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Manifest.json +8 -8
model.pt +0 -3
tokenizer_en.json +0 -0

README.md CHANGED Viewed

@@ -1,7 +1,6 @@
 ---
 language:
   - en
-  - fr
 license: apache-2.0
 tags:
   - keyboard
@@ -9,127 +8,63 @@ tags:
   - mobile
   - ios
   - coreml
-  - bilingual
 library_name: pytorch
 pipeline_tag: text-generation
 ---
-# Onit Keyboard LM
-A **41M parameter** bilingual (English + French) language model designed for **mobile keyboard prediction** on iOS.
-## Model Description
-Onit Keyboard LM is a compact causal language model optimized for next-word prediction in a mobile keyboard context. It supports both English and French, including code-switching between the two languages.
-### Architecture
-| Component | Value |
-|-----------|-------|
-| Type | Causal LM (decoder-only) |
-| Parameters | ~41M |
-| Vocabulary | 16,384 BPE tokens |
-| Embedding dim | 512 |
-| Layers | 10 |
-| Attention heads | 8 |
-| FFN dim | 1408 (SwiGLU) |
-| Max sequence length | 256 |
-| Positional encoding | RoPE |
-| Normalization | RMSNorm + QK-Norm |
-| Embeddings | Tied (input = output) |
-### Key Design Choices
-- **SwiGLU FFN** for better parameter efficiency at small scale
-- **QK-Norm** for stable training without careful LR tuning
-- **RoPE** for length generalization
-- **Tied embeddings** to reduce parameter count (critical for mobile)
-- **BPE tokenizer** (16K vocab) trained on the bilingual data mix
 ## Training
-### Dataset (Phase 2)
-The model was trained on a diverse bilingual mix:
-| Source | Language | Share |
-|--------|----------|-------|
-| OpenSubtitles | FR + EN | ~40% |
-| Wikipedia | FR + EN | ~30% |
-| C4 (web) | FR + EN | ~30% |
-Total: ~13.6M sentences, ~2.7 GB of clean text.
-### Hyperparameters
-| Parameter | Value |
-|-----------|-------|
-| Training steps | 30,000 |
-| Effective batch size | 64 (32 x 2 grad accum) |
-| Learning rate | 6e-5 (cosine decay) |
-| Warmup steps | 1,000 |
-| Precision | bf16 mixed |
-| Optimizer | AdamW |
-### Results
-| Metric | Value |
-|--------|-------|
-| Training loss (final) | 2.01 |
-| Validation PPL | 58.8 |
-| Tokens seen | 491M |
-## Usage
-### PyTorch
-```python
-import torch
-from tokenizers import Tokenizer
-from keyboard_lm.model import JointUniLM, ModelConfig
-# Load
-ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
-config = ModelConfig(**ckpt["model_config"])
-model = JointUniLM(config)
-model.load_state_dict(ckpt["model_state_dict"])
-model.eval()
-tokenizer = Tokenizer.from_file("tokenizer.json")
-# Predict next token
-prompt = "I'm going to the"
-ids = [config.bos_token_id] + tokenizer.encode(prompt).ids
-input_ids = torch.tensor([ids])
-with torch.no_grad():
-    logits, _ = model(input_ids)
-    probs = torch.softmax(logits[0, -1], dim=-1)
-    top5 = torch.topk(probs, 5)
-for prob, idx in zip(top5.values, top5.indices):
-    print(f"  {tokenizer.decode([idx.item()]):>10} ({prob:.1%})")
-```
-### CoreML (iOS)
-See `scripts/export_coreml.py` in the [GitHub repo](https://github.com/synth-inc/onit-keyboard-lm) for CoreML conversion.
 ## Files
-| File | Description |
-|------|-------------|
-| `model.pt` | Model weights + config (no optimizer) |
-| `checkpoint_full.pt` | Full training checkpoint (with optimizer, for resume) |
-| `config.json` | Model configuration |
-| `tokenizer.json` | BPE tokenizer (v2, trained on Phase 2 mix) |
-## Limitations
-- Optimized for short text (keyboard input), not long-form generation
-- May produce grammatical errors in French (e.g., double negatives)
-- 256-token context window limits long-range coherence
-- Not suitable for factual Q&A or instruction following
 ## License

 ---
 language:
   - en
 license: apache-2.0
 tags:
   - keyboard
   - mobile
   - ios
   - coreml
+  - english
 library_name: pytorch
 pipeline_tag: text-generation
 ---
+# Onit Keyboard LM (EN-only, run8)
+A 40M parameter English-only language model designed for next-word prediction
+in the Onit iOS keyboard. Replaces the previous bilingual run7 model that
+surfaced French tokens (`même`, `soit`, `des`, `présent`) in EN keyboard
+contexts.
+## Architecture
+| Component             | Value                       |
+|-----------------------|-----------------------------|
+| Type                  | Causal LM (decoder-only)    |
+| Parameters            | ~40M                        |
+| Vocabulary            | 16,384 BPE tokens (EN-only) |
+| Embedding dim         | 512                         |
+| Layers                | 10                          |
+| Attention heads       | 8                           |
+| FFN dim               | 1408 (SwiGLU)               |
+| Max sequence length   | 256                         |
+| Positional encoding   | RoPE                        |
+| Normalization         | RMSNorm + QK-Norm           |
+| Embeddings            | Tied (input = output)       |
 ## Training
+EN-only corpus (43M lines, ~445M tokens):
+- `clean_en` (Tim's curated corpus)
+- `opensubtitles_en` (filtered for mislabeled French; deduplicated)
+Training run (run8):
+- 30,000 steps, lr 6e-5 cosine schedule, warmup 1000, effective batch 64
+- Validation PPL: **38.08** on the held-out EN val split
+- Test PPL: **37.88** on the held-out EN test split (no contamination)
+- 100 % argmax parity between PyTorch and the exported CoreML model
 ## Files
+| File                                  | Description                                    |
+|---------------------------------------|------------------------------------------------|
+| `keyboard_lm_seq128_fp16.mlpackage`   | CoreML mlprogram fp16, seq_len=128 (iOS)       |
+| `tokenizer_en.json`                   | BPE 16K tokenizer trained on the EN-only corpus|
+| `config.json`                         | Model configuration                            |
+## iOS usage notes
+- **Strip trailing whitespace from prompts before tokenization.** The model
+  was trained on clean sentences and produces noisy subword fragments on
+  inputs like `"Hey guys "` (with trailing space). Use
+  `prompt.trimmingCharacters(in: .whitespacesAndNewlines)` before encoding.
+- 100 % argmax agreement between PyTorch and the exported CoreML model on
+  validation prompts. Predictions on iOS match the PyTorch reference
+  bit-for-bit (modulo fp16 quantization noise: max abs diff ≈ 0.011).
 ## License

checkpoint_full.pt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:1815019697c1ba8ac9c770097bd0234d9ead3a92c8aa74f40c567aef220eab7c
-size 486296722

{keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Data/com.apple.CoreML/model.mlmodel RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:09276d45c04924a72551c43ae21c04bcf5ca638f41d7bc68316e856f35b4e503
-size 207932

 version https://git-lfs.github.com/spec/v1
+oid sha256:46d2b3346546533188d9f97e78ff9c29779b6e53483dc4aefe8b664106ca0e4f
+size 209808

{keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Data/com.apple.CoreML/weights/weight.bin RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ac3538eb72a12390b660e8852a26e83f7c8a85a69d2200f9080b327657be9679
 size 81119936

 version https://git-lfs.github.com/spec/v1
+oid sha256:b8da8eb7e6e8fb4dd2ef16ca34c557cd739d87ec9d451fd3b7930f2e1c58fdd1
 size 81119936

{keyboard_lm_run8_seq128_fp16.mlpackage → keyboard_lm_seq128_fp16.mlpackage}/Manifest.json RENAMED Viewed

@@ -1,18 +1,18 @@
 {
     "fileFormatVersion": "1.0.0",
     "itemInfoEntries": {
-        "41ADBBC4-04DD-4893-B15F-744D900611C1": {
-            "author": "com.apple.CoreML",
-            "description": "CoreML Model Weights",
-            "name": "weights",
-            "path": "com.apple.CoreML/weights"
-        },
-        "F99D3F93-9202-4625-8B4E-B13FCDE915D3": {
             "author": "com.apple.CoreML",
             "description": "CoreML Model Specification",
             "name": "model.mlmodel",
             "path": "com.apple.CoreML/model.mlmodel"
         }
     },
-    "rootModelIdentifier": "F99D3F93-9202-4625-8B4E-B13FCDE915D3"
 }

 {
     "fileFormatVersion": "1.0.0",
     "itemInfoEntries": {
+        "5049EDD7-0417-455F-8CB7-9D7D525CCF43": {
             "author": "com.apple.CoreML",
             "description": "CoreML Model Specification",
             "name": "model.mlmodel",
             "path": "com.apple.CoreML/model.mlmodel"
+        },
+        "EC107ED8-DED7-459A-A9DE-998EF22B3FB0": {
+            "author": "com.apple.CoreML",
+            "description": "CoreML Model Weights",
+            "name": "weights",
+            "path": "com.apple.CoreML/weights"
         }
     },
+    "rootModelIdentifier": "5049EDD7-0417-455F-8CB7-9D7D525CCF43"
 }

model.pt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:d8648eaa471f0baf1229546dfd29bb4b5532b9b37a1338c82bb50ce6be074049
-size 162087274

tokenizer_en.json CHANGED Viewed

The diff for this file is too large to render. See raw diff