Nematus Transformer · English → Turkish (BPE)
A Transformer ("base") neural machine translation model that translates English → Turkish, trained with the Nematus toolkit on a ~142k-sentence news-domain parallel corpus using a joint 32k BPE subword vocabulary.
This is the EN→TR model. The reverse direction is
atahanuz/transformers-translator-tr-en-75M.
Both belong to the machine translation models collection. The two models were trained on the
same data and the same joint BPE model — only the source/target direction and vocab caps differ
(BPE is direction-agnostic).
TL;DR
- BLEU 42.04 on a 1,000-sentence held-out test set (beam 12, length-normalized).
- ~75M parameters · single A100-80GB · ~75 min training.
Model details
| Toolkit | Nematus (TensorFlow), commit 49d050863bc9644b8c0a9d9ab6e54ccd30f927dd |
| Architecture | Transformer "base" |
| Direction | English (en) → Turkish (tr) |
| Embedding / model dim | 512 / 512 |
| Encoder / decoder layers | 6 / 6 |
| Attention heads | 8 |
| FFN hidden size | 2048 |
| Dropout | 0.1 (embeddings / residual / relu / attention) |
| Embedding tying | none (untied) |
| Subword model | joint 32k BPE (subword-nmt) |
| Vocab caps | source 18,000 / target 24,000 |
| Parameters | ~75M |
Training data
- Corpus:
mt_datasets_vol2— 144,065 English–Turkish sentence pairs (news / current affairs). - Filtering: drop any pair where either side is empty or longer than 60 whitespace tokens → 143,926 pairs.
- Split (shuffled, seed 42): 141,926 train / 1,000 dev / 1,000 test.
Preprocessing
- Tokenization: Moses
tokenizer.perl -l <lang> -no-escape. - Joint BPE: 32,000 merge operations learned on the training side only (EN+TR concatenated)
with
subword-nmt learn-joint-bpe-and-vocab. - Apply BPE without a vocabulary-frequency threshold — on a corpus this size the usual
--vocabulary-threshold 50over-fragments. Resulting subword vocab: TR 23,211 / EN 17,659 types.
Training configuration
- Optimizer: Adam (β1 0.9, β2 0.999, ε 1e-8), learning rate 1e-4 constant, gradient-norm clip 1.0.
- Batch 480 sentences,
maxlen120, label smoothing 0.0. - Validation + checkpoint every 400 updates; early stopping with patience 10 on dev cross-entropy.
- Hardware: 1× NVIDIA A100-80GB (~66 GB peak memory).
Training run
- Best validation at update 10,800 (dev cross-entropy 38.69) — the checkpoint in this repo is that best model.
- Early-stopped at update 15,000 (patience reached as dev CE rose while train loss kept falling).
- Wall-clock ≈ 75 minutes.
Evaluation
BLEU via multi-bleu.perl on the merged-BPE (Moses-tokenized) hypothesis vs the tokenized reference; beam 12, length-normalized.
| Direction | BLEU | 1/2/3/4-gram precision | BP |
|---|---|---|---|
| EN→TR (this model) | 42.04 | 69.7 / 49.1 / 36.7 / 27.8 | 0.97 |
| TR→EN (sibling model) | 42.78 | 72.9 / 49.8 / 36.4 / 27.3 | 0.98 |
Both directions are scored on the same 1,000 mirrored test pairs, so the comparison is fair. Generating Turkish is the harder direction by dev cross-entropy (agglutinative morphology), but on this formulaic news domain the BLEU gap is small (−0.74).
Example
| English (input) | Model output | Reference |
|---|---|---|
The US Embassy in Bosnia and Herzegovina welcomed the offer to send soldiers to Iraq. |
ABD'nin Bosna-Hersek Büyükelçiliği Irak'a asker gönderme teklifini memnuniyetle karşıladı. |
ABD'nin BH Büyükelçiliği Irak'a asker gönderme teklifini memnuniyetle karşıladı. |
Files in this repo
model.npz.*— Nematus/TensorFlow checkpoint (best-validation, update 10,800).train.bpe.en.json,train.bpe.tr.json— Nematus source/target vocabularies (referenced bymodel.npz.json).tr-en.bpe.codes,vocab.tr,vocab.en— the joint BPE model, used to segment new input.nematus_tf220_compat.patch— makes Nematus run on TensorFlow ≥ 2.16 / NumPy ≥ 2 / Python 3.12.
Installation
# Nematus (pinned) + the compatibility patch shipped in this repo
git clone https://github.com/EdinburghNLP/nematus.git
cd nematus && git checkout 49d050863bc9644b8c0a9d9ab6e54ccd30f927dd
git apply /path/to/nematus_tf220_compat.patch
pip install "tensorflow>=2.16" "numpy>=2" subword-nmt sacremoses
cd ..
The patch fixes graph-mode tf.debugging.assert_shapes, NumPy-2 dtype=object, a numpy-float
JSON encoder, the softmax-projection EmbeddingLayer args, and monkey-patches
tf.keras.layers.Dropout so Keras 3 accepts the symbolic training flag.
Usage
# 0. download this model
python3 -c "from huggingface_hub import snapshot_download; \
snapshot_download('atahanuz/transformers-translator-en-tr-75M', local_dir='en_tr_model')"
# 1. preprocess an English input file (one sentence per line)
perl nematus/data/tokenizer.perl -l en -no-escape < input.en > input.tok.en
subword-nmt apply-bpe -c en_tr_model/tr-en.bpe.codes < input.tok.en > input.bpe.en
# 2. translate — run from the model dir so the dict paths in model.npz.json resolve
cd en_tr_model
python3 ../nematus/nematus/translate.py -m model.npz -i ../input.bpe.en -o ../out.bpe.tr -k 12 -n -b 50
cd ..
# 3. postprocess: undo BPE, then detokenize (Turkish)
sed -E 's/(@@ )|(@@ ?$)//g' out.bpe.tr > out.tok.tr
python3 - <<'EOF'
import re
from sacremoses import MosesDetokenizer
d = MosesDetokenizer(lang='tr')
for line in open('out.tok.tr', encoding='utf-8'):
t = d.detokenize(line.split())
print(re.sub(r"\s*'\s*", "'", t)) # join Turkish suffix apostrophes: Irak ' a -> Irak'a
EOF
Intended use & limitations
- Domain: news / current affairs (SETimes-style). Quality drops on very different domains.
- Trained on sentences ≤ 60 tokens; very long inputs may degrade or truncate.
- The reported BLEU is tokenized
multi-bleu.perl— useful for internal comparison but not directly comparable across tokenizers/papers. For citable numbers use sacreBLEU on detokenized output. - No safety/bias auditing; the model can reflect biases present in the training data and may hallucinate fluent-but-wrong content on out-of-distribution input.
License
cc-by-4.0 is a placeholder — set this to whatever matches your training-data terms.
Acknowledgements
Built with Nematus and subword-nmt. BPE: Sennrich, Haddow & Birch (2016); Transformer: Vaswani et al. (2017).