ShinDJ's picture
Update README.md
7539b04 verified
|
raw
history blame
4.56 kB
metadata
library_name: transformers
license: llama3.1
language:
  - ko
  - vi
  - id
  - km
  - th
metrics:
  - bleu
  - rouge
base_model:
  - meta-llama/Llama-3.1-8B-Instruct

Model Card for Model ID

This model is a multilingual translation model fine-tuned on LLaMA 3.1 Instruct base model. It enables mutual translation between the following Southeast Asian languages:

  • Korean
  • Vietnamese
  • Indonesian
  • Cambodian (Khmer)
  • Thai

Acknowledgements

AICA

Model Details

The model is designed for translating short text segments between any pair of the supported languages.

Supported language pairs:

  • Korean ↔ Vietnamese
  • Korean ↔ Indonesian
  • Korean ↔ Cambodian
  • Korean ↔ Thai
  • Vietnamese ↔ Indonesian
  • Vietnamese ↔ Cambodian
  • Vietnamese ↔ Thai
  • Indonesian ↔ Cambodian
  • Indonesian ↔ Thai
  • Cambodian ↔ Thai

Model Description

This model is specifically optimized for Southeast Asian language translation needs, focusing on enabling communication between these specific language communities.

The extensive training data of 20M examples (1M for each translation direction) provides a robust foundation for handling common expressions and basic conversations across these languages.

Model Architecture

Base Model: meta-llama/Llama-3.1-8B-Instruct

Bias, Risks, and Limitations

  • Performance is limited to short sentences and phrases
  • May not handle complex or lengthy text effectively
  • Translation quality may vary depending on language pair and content complexity

Evaluation results

Source Language Target Language BLEU Score ROUGE-1 ROUGE-L
Korean Vietnamese 56.70 81.64 76.66
Korean Cambodian 71.69 89.26 88.20
Korean Indonesian 58.32 80.39 76.63
Korean Thai 63.26 78.88 72.29
Vietnamese Korean 49.01 75.57 72.74
Vietnamese Cambodian 78.26 90.74 90.32
Vietnamese Indonesian 65.96 83.08 81.46
Vietnamese Thai 65.93 81.09 76.57
Cambodian Korean 49.10 72.67 69.75
Cambodian Vietnamese 63.42 81.56 79.09
Cambodian Indonesian 61.41 79.67 77.75
Cambodian Thai 70.91 81.85 77.66
Indonesian Korean 53.61 77.14 74.29
Indonesian Vietnamese 68.21 85.41 83.10
Indonesian Cambodian 78.84 90.81 90.35
Indonesian Thai 67.12 81.54 77.19
Thai Korean 45.59 72.48 69.46
Thai Vietnamese 61.55 81.01 78.24
Thai Cambodian 78.52 91.47 91.16
Thai Indonesian 58.99 78.56 76.40

Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "MLP-KTLim/llama-3.1-Asian-Bllossom-8B-Translator",
    torch_dtype="auto",
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(
    "MLP-KTLim/llama-3.1-Asian-Bllossom-8B-Translator",
)

input_text = "μ•ˆλ…•ν•˜μ„Έμš”? μ•„μ‹œμ•„ μ–Έμ–΄ λ²ˆμ—­ λͺ¨λΈ μž…λ‹ˆλ‹€."

def get_input_ids(source_lang, target_lang, message):
    assert source_lang in ["Korean", "Vietnamese", "Indonesian", "Thai", "Cambodian"]
    assert target_lang in ["Korean", "Vietnamese", "Indonesian", "Thai", "Cambodian"]
    
    input_ids = tokenizer.apply_chat_template(
        conversation=[
            {"role": "system", "content": f"You are a useful translation AI. Please translate the sentence given in {source_lang} into {target_lang}."},
            {"role": "user", "content": message},
        ],
        tokenize=True,
        return_tensors="pt",
        add_generation_prompt=True,
    )
    return input_ids

input_ids = get_input_ids(
    source_lang="Korean",
    target_lang="Vietnamese",
    message=input_text,
)

output = model.generate(
    input_ids.to(model.device),
    max_new_tokens=128,
)

print(tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True))

Contributor