English to Thai Transliteration Model

This model transliterates English text to Thai script, converting the sound of English words into Thai characters.

Model Description

This model is based on ByT5, a token-free sequence-to-sequence model that operates directly on UTF-8 bytes. It has been fine-tuned specifically for English to Thai transliteration using multiple data sources to improve accuracy and coverage.

  • Developed by: yacht
  • Model type: ByT5 (Sequence-to-Sequence)
  • Language(s): English → Thai
  • License: MIT (free for commercial use)
  • FP16 Support: Yes (model supports half-precision inference)

Intended Uses & Limitations

Intended Uses

  • Converting English names, places, and terms into Thai script
  • Assisting with the transliteration of foreign words into Thai
  • Educational purposes for learning Thai script
  • Improving accessibility of English content for Thai speakers

Limitations

  • The model may struggle with uncommon or complex English words
  • Transliteration quality depends on the training data coverage
  • The model focuses on phonetic conversion, not translation

Training and Evaluation

Training Data

The model was trained on a combined dataset of English-Thai transliteration pairs from multiple sources. The dataset includes:

  • Common English words and their Thai transliterations
  • Names of people, places, and organizations
  • Technical terms and other domain-specific vocabulary
  • Geological and scientific terminology

Training Procedure

  • Training framework: Hugging Face Transformers
  • Base model: google/byt5-base
  • Training hyperparameters:
    • Learning rate: 2e-4
    • Batch size: 8
    • Number of epochs: 10
    • Optimizer: AdamW
    • Mixed precision: FP16 (model was trained with mixed precision)
    • Gradient clipping: Yes (max_grad_norm=1.0)

Evaluation Results

  • Accuracy: 0.7831
  • Character Error Rate: 0.0591
  • Mean Levenshtein Distance: 0.4654

How to Use

Standard Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
model_name = "yacht/byt5-base-en2th-transliterator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Transliterate English to Thai
english_text = "hello"
inputs = tokenizer(english_text, return_tensors="pt")
outputs = model.generate(inputs.input_ids)
thai_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"English: {english_text} → Thai: {thai_text}")

Using with FP16 for Faster Inference

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model with fp16 for faster inference (requires GPU with CUDA)
model_name = "yacht/byt5-base-en2th-transliterator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

# Transliterate English to Thai with fp16
english_text = "hello"
inputs = tokenizer(english_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
outputs = model.generate(inputs.input_ids)
thai_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"English: {english_text} → Thai: {thai_text}")

Examples

English Thai
hello เฮลโล
computer คอมพิวเตอร์
thailand ไทยแลนด์
bangkok แบงค็อก
graph กราฟ
grossular กรอสซูลาร์
grossularite กรอสซูลาไรต์

Performance Benefits of FP16

Using FP16 (half-precision) can provide significant performance benefits:

  • Up to 2x faster inference on compatible GPUs
  • Reduced memory usage (approximately half compared to FP32)
  • Minimal impact on transliteration quality

Multi-Dataset Training

This model was trained on multiple datasets combined together, which provides several advantages:

  • Broader vocabulary coverage across different domains
  • Improved handling of edge cases and uncommon words
  • More consistent transliteration patterns
  • Better generalization to new, unseen words

Limitations and Bias

This model is designed specifically for transliteration, not translation. It attempts to convert the sounds of English words into Thai script, not to provide their Thai translations.

The model's performance may vary based on:

  • The phonetic complexity of the input
  • Whether the input contains sounds that are difficult to represent in Thai
  • The coverage of similar words in the training data

Common Errors

Some common error patterns observed:

  • group → กรูป (should be: กรุ๊ป)
  • golf → โกล์ฟ (should be: กอล์ฟ)

License

MIT License

Copyright (c) 2025 yacht

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Downloads last month
345
Safetensors
Model size
582M params
Tensor type
F32
·
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.