Nematus Transformer · English → Turkish (BPE)

A Transformer ("base") neural machine translation model that translates English → Turkish, trained with the Nematus toolkit on a ~142k-sentence news-domain parallel corpus using a joint 32k BPE subword vocabulary.

This is the EN→TR model. The reverse direction is atahanuz/transformers-translator-tr-en-75M. Both belong to the machine translation models collection. The two models were trained on the same data and the same joint BPE model — only the source/target direction and vocab caps differ (BPE is direction-agnostic).

TL;DR

  • BLEU 42.04 on a 1,000-sentence held-out test set (beam 12, length-normalized).
  • ~75M parameters · single A100-80GB · ~75 min training.

Model details

Toolkit Nematus (TensorFlow), commit 49d050863bc9644b8c0a9d9ab6e54ccd30f927dd
Architecture Transformer "base"
Direction English (en) → Turkish (tr)
Embedding / model dim 512 / 512
Encoder / decoder layers 6 / 6
Attention heads 8
FFN hidden size 2048
Dropout 0.1 (embeddings / residual / relu / attention)
Embedding tying none (untied)
Subword model joint 32k BPE (subword-nmt)
Vocab caps source 18,000 / target 24,000
Parameters ~75M

Training data

  • Corpus: mt_datasets_vol2 — 144,065 English–Turkish sentence pairs (news / current affairs).
  • Filtering: drop any pair where either side is empty or longer than 60 whitespace tokens → 143,926 pairs.
  • Split (shuffled, seed 42): 141,926 train / 1,000 dev / 1,000 test.

Preprocessing

  1. Tokenization: Moses tokenizer.perl -l <lang> -no-escape.
  2. Joint BPE: 32,000 merge operations learned on the training side only (EN+TR concatenated) with subword-nmt learn-joint-bpe-and-vocab.
  3. Apply BPE without a vocabulary-frequency threshold — on a corpus this size the usual --vocabulary-threshold 50 over-fragments. Resulting subword vocab: TR 23,211 / EN 17,659 types.

Training configuration

  • Optimizer: Adam (β1 0.9, β2 0.999, ε 1e-8), learning rate 1e-4 constant, gradient-norm clip 1.0.
  • Batch 480 sentences, maxlen 120, label smoothing 0.0.
  • Validation + checkpoint every 400 updates; early stopping with patience 10 on dev cross-entropy.
  • Hardware: 1× NVIDIA A100-80GB (~66 GB peak memory).

Training run

  • Best validation at update 10,800 (dev cross-entropy 38.69) — the checkpoint in this repo is that best model.
  • Early-stopped at update 15,000 (patience reached as dev CE rose while train loss kept falling).
  • Wall-clock ≈ 75 minutes.

Evaluation

BLEU via multi-bleu.perl on the merged-BPE (Moses-tokenized) hypothesis vs the tokenized reference; beam 12, length-normalized.

Direction BLEU 1/2/3/4-gram precision BP
EN→TR (this model) 42.04 69.7 / 49.1 / 36.7 / 27.8 0.97
TR→EN (sibling model) 42.78 72.9 / 49.8 / 36.4 / 27.3 0.98

Both directions are scored on the same 1,000 mirrored test pairs, so the comparison is fair. Generating Turkish is the harder direction by dev cross-entropy (agglutinative morphology), but on this formulaic news domain the BLEU gap is small (−0.74).

Example

English (input) Model output Reference
The US Embassy in Bosnia and Herzegovina welcomed the offer to send soldiers to Iraq. ABD'nin Bosna-Hersek Büyükelçiliği Irak'a asker gönderme teklifini memnuniyetle karşıladı. ABD'nin BH Büyükelçiliği Irak'a asker gönderme teklifini memnuniyetle karşıladı.

Files in this repo

  • model.npz.* — Nematus/TensorFlow checkpoint (best-validation, update 10,800).
  • train.bpe.en.json, train.bpe.tr.json — Nematus source/target vocabularies (referenced by model.npz.json).
  • tr-en.bpe.codes, vocab.tr, vocab.en — the joint BPE model, used to segment new input.
  • nematus_tf220_compat.patch — makes Nematus run on TensorFlow ≥ 2.16 / NumPy ≥ 2 / Python 3.12.

Installation

# Nematus (pinned) + the compatibility patch shipped in this repo
git clone https://github.com/EdinburghNLP/nematus.git
cd nematus && git checkout 49d050863bc9644b8c0a9d9ab6e54ccd30f927dd
git apply /path/to/nematus_tf220_compat.patch
pip install "tensorflow>=2.16" "numpy>=2" subword-nmt sacremoses
cd ..

The patch fixes graph-mode tf.debugging.assert_shapes, NumPy-2 dtype=object, a numpy-float JSON encoder, the softmax-projection EmbeddingLayer args, and monkey-patches tf.keras.layers.Dropout so Keras 3 accepts the symbolic training flag.

Usage

# 0. download this model
python3 -c "from huggingface_hub import snapshot_download; \
snapshot_download('atahanuz/transformers-translator-en-tr-75M', local_dir='en_tr_model')"

# 1. preprocess an English input file (one sentence per line)
perl nematus/data/tokenizer.perl -l en -no-escape < input.en > input.tok.en
subword-nmt apply-bpe -c en_tr_model/tr-en.bpe.codes < input.tok.en > input.bpe.en

# 2. translate — run from the model dir so the dict paths in model.npz.json resolve
cd en_tr_model
python3 ../nematus/nematus/translate.py -m model.npz -i ../input.bpe.en -o ../out.bpe.tr -k 12 -n -b 50
cd ..

# 3. postprocess: undo BPE, then detokenize (Turkish)
sed -E 's/(@@ )|(@@ ?$)//g' out.bpe.tr > out.tok.tr
python3 - <<'EOF'
import re
from sacremoses import MosesDetokenizer
d = MosesDetokenizer(lang='tr')
for line in open('out.tok.tr', encoding='utf-8'):
    t = d.detokenize(line.split())
    print(re.sub(r"\s*'\s*", "'", t))   # join Turkish suffix apostrophes: Irak ' a -> Irak'a
EOF

Intended use & limitations

  • Domain: news / current affairs (SETimes-style). Quality drops on very different domains.
  • Trained on sentences ≤ 60 tokens; very long inputs may degrade or truncate.
  • The reported BLEU is tokenized multi-bleu.perl — useful for internal comparison but not directly comparable across tokenizers/papers. For citable numbers use sacreBLEU on detokenized output.
  • No safety/bias auditing; the model can reflect biases present in the training data and may hallucinate fluent-but-wrong content on out-of-distribution input.

License

cc-by-4.0 is a placeholder — set this to whatever matches your training-data terms.

Acknowledgements

Built with Nematus and subword-nmt. BPE: Sennrich, Haddow & Birch (2016); Transformer: Vaswani et al. (2017).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including atahanuz/transformers-translator-tr-en-75M