Chisel-7B-v0.1
Chisel-7B is a vocabulary expanded variant of Mistral-7B-v0.1 with targeted Urdu script support. The model is a pre-finetuning checkpoint it does not yet understand Urdu semantics but tokenizes Urdu text with 36.3% lower fertility than the base model, making it a stronger foundation for Urdu continued pretraining or instruction tuning.
Results
Fertility vs. subwords added
N Vocab Tokens/Word Reduction
──────────────────────────────────────────
0 32,000 4.59 —
30 32,030 4.55 0.9%
50 32,077 3.69 19.5%
100 32,125 3.47 24.5%
200 32,214 3.22 29.9%
300 32,295 3.09 32.6%
400 32,377 3.00 34.7%
500 32,467 2.92 36.3% ← recommended
700 32,648 2.82 38.6%
1000 32,924 2.70 41.2%
1500 33,372 2.57 44.1%
2000 33,593 2.52 45.1% ← maximum
N=500 recovers 80.5% of the maximum possible fertility reduction with only 1.46% vocabulary increase. Marginal gain per 100 subwords drops below 1.5% after N=500.
Per-sentence tokenization
| Sentence | Words | Base tokens | Chisel tokens | Reduction |
|---|---|---|---|---|
| پاکستان ایک خوبصورت ملک ہے | 5 | 26 | 13 | 50.0% |
| اردو زبان بہت میٹھی ہے | 5 | 22 | 14 | 36.4% |
| کراچی پاکستان کا سب سے بڑا شہر ہے | 8 | 35 | 20 | 42.9% |
| علم حاصل کرنا ہر مسلمان پر فرض ہے | 8 | 33 | 18 | 45.5% |
| آج موسم بہت اچھا ہے | 5 | 19 | 12 | 36.8% |
| Average | 6.2 | 27.0 | 15.4 | 43.1% |
English perplexity (Wikitext-2)
| Model | Perplexity |
|---|---|
| Mistral-7B-v0.1 (4-bit) | 9.04 |
| Chisel-7B-v0.1 (4-bit) | 10.82 |
| Delta | +1.78 |
The +1.78 perplexity increase is attributable to softmax redistribution over the expanded output vocabulary. This degradation is expected to recover after continued pretraining on Urdu data, consistent with prior vocabulary expansion work (Hewitt 2021; Cui et al. 2023).
Vocabulary breakdown
| Category | Count | Examples |
|---|---|---|
| Already in Mistral vocab | 19 | ٹ ڈ ں ھ ی ے ، ؟ |
| Characters added (missing) | 30 | ڑ ۔ ۰–۹ ؛ ٪ ۓ ۍ ٖ ٗ |
| Subwords added (N=500) | 437 | frequency-ranked BPE units |
| Total new tokens | 467 | |
| Final vocab size | 32,467 |
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"mahwizzzz/Chisel-7B-v0.1",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("mahwizzzz/Chisel-7B-v0.1")
# verify fertility improvement
text = "پاکستان ایک خوبصورت ملک ہے"
tokens = tokenizer.encode(text, add_special_tokens=False)
print(f"{len(text.split())} words → {len(tokens)} tokens")
# expected: 5 words → 13 tokens
Limitations
- No Urdu semantic understanding. Responses will be in English until finetuned on Urdu data.
- English perplexity degraded by +1.78 vs base model.
- Fertility corpus was Wikipedia Urdu (500 articles). Domain-specific subwords (medical, legal, conversational) are underrepresented.
- Not evaluated on any Urdu downstream benchmark.
Intended use
This checkpoint is intended as a starting point for:
- Urdu continued pretraining
- Urdu instruction tuning
- Urdu translation, QA, and text generation research
- Tokenization efficiency studies for low-resource Perso-Arabic script languages
Do not use in production Urdu applications without finetuning.
Technical specs
| Base model | Mistral-7B-v0.1 |
| Architecture | Transformer decoder, 32 layers, 4096 hidden |
| Original vocab | 32,000 |
| Expanded vocab | 32,467 |
| Quantization | 4-bit NF4, double quant, bfloat16 compute |
| Embedding init | Multivariate normal (Hewitt 2021) |
| Expansion corpus | Wikipedia Urdu, 500 articles |
| BPE model | SentencePiece, vocab=2000 |
Citation
@misc{khalil2025chisel7b,
author = {Mahwiz Khalil},
title = {Chisel-7B: Selective Vocabulary Expansion for Urdu Adaptation of Mistral-7B},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/mahwizzzz/Chisel-7B-v0.1}
}
@article{jiang2023mistral,
title = {Mistral 7B},
author = {Jiang, Albert Q and others},
journal = {arXiv preprint arXiv:2310.06825},
year = {2023}
}
@misc{hewitt2021initializing,
author = {Hewitt, John},
title = {Initializing New Word Embeddings for Pretrained Language Models},
year = {2021},
url = {https://nlp.stanford.edu/~johnhew/vocab-expansion.html}
}
- Downloads last month
- 87
Model tree for mahwizzzz/Chisel-7B-v0.1
Base model
mistralai/Mistral-7B-v0.1