opus-tatoeba-en-ja-ct2

CTranslate2 int8 quantized conversion of Helsinki-NLP/opus-tatoeba-en-ja, packaged for offline use in Playto.

Purpose

Lightweight English → Japanese NMT for the Playto desktop game-language-learning tool, used as a low-spec / no-GPU fallback when local LLM inference is not available.

Note: This is the Tatoeba Challenge variant, not the older opus-mt-en-jap (which is trained primarily on biblical parallel corpora and produces archaic Japanese unsuitable for general/game text). Always prefer this Tatoeba variant for production en→ja use.

Source model

Upstream: Helsinki-NLP/opus-tatoeba-en-ja — MarianMT, transformer-align architecture, Tatoeba Challenge dataset
License: CC-BY 4.0
No fine-tuning, no weight modification — purely a format conversion + int8 quantization of the upstream weights

Conversion

pip install ctranslate2 transformers sentencepiece
ct2-transformers-converter \
  --model Helsinki-NLP/opus-tatoeba-en-ja \
  --output_dir opus-tatoeba-en-ja-ct2 \
  --quantization int8 \
  --copy_files source.spm target.spm
tar czf opus-tatoeba-en-ja-ct2.tar.gz opus-tatoeba-en-ja-ct2

File layout (inside `opus-tatoeba-en-ja-ct2.tar.gz`)

File	Size	Purpose
`model.bin`	~75 MB	CTranslate2 int8 quantized weights
`shared_vocabulary.json`	~1.4 MB	CTranslate2 vocab
`source.spm`	~810 KB	SentencePiece source tokenizer
`target.spm`	~830 KB	SentencePiece target tokenizer
`config.json`	~250 B	CTranslate2 config

Usage

With `ctranslate2` (Python)

import ctranslate2
import sentencepiece

translator = ctranslate2.Translator("opus-tatoeba-en-ja-ct2", device="cpu", compute_type="int8")
sp_source = sentencepiece.SentencePieceProcessor("opus-tatoeba-en-ja-ct2/source.spm")
sp_target = sentencepiece.SentencePieceProcessor("opus-tatoeba-en-ja-ct2/target.spm")

source_tokens = sp_source.encode("Hello, how are you?", out_type=str) + ["</s>"]
results = translator.translate_batch([source_tokens])
print(sp_target.decode(results[0].hypotheses[0]))
# → "こんにちは、元気?"

With `ct2rs` (Rust)

use ct2rs::{Translator, Tokenizer};

let tokenizer = Tokenizer::new("opus-tatoeba-en-ja-ct2")?;
let translator = Translator::with_tokenizer("opus-tatoeba-en-ja-ct2", tokenizer, /* config */)?;
let result = translator.translate_batch(&["Hello, how are you?".to_string()], /* options */)?;

Important: MarianMT models require </s> appended to source token sequences. The ct2rs::Tokenizer wrapper handles this automatically; raw SentencePiece calls must add it manually.

Quality

Evaluated on Playto's fixture corpus (= 26 en→ja game text samples across 5 games):

~80 % Good (= with merged translation mode, Playto-internal eyeball judgment)
For en→ja, merged mode generally outperforms per-line (= honorific / 助詞 / 文末表現 benefit from longer context)
Significantly outperforms older Helsinki-NLP/opus-mt-en-jap (= biblical Japanese, ~95 % broken on game text)

Attribution

This is a derivative work of Helsinki-NLP/opus-tatoeba-en-ja. License is CC-BY 4.0 inherited from upstream.

Helsinki-NLP. OPUS-MT — Open Machine Translation Models.
Tatoeba Challenge.
https://github.com/Helsinki-NLP/Tatoeba-Challenge
https://github.com/Helsinki-NLP/Opus-MT

Disclaimer

NMT output quality is significantly lower than modern LLM-based translation. This model is intended as a lightweight fallback for environments where LLM inference is not viable (= low VRAM, mobile, slow connection). For higher-quality translation, Playto's LLM-based translation path is recommended.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for playto-mt/opus-tatoeba-en-ja-ct2

Base model

Helsinki-NLP/opus-tatoeba-en-ja

Finetuned

(3)

this model