File size: 2,087 Bytes
1673378
06fcb57
 
 
1673378
 
 
 
06fcb57
1673378
 
 
 
 
 
228263d
d839e72
 
 
148ff33
d839e72
fbd118f
 
 
 
 
 
 
 
ded9b9b
 
 
228263d
 
 
48effef
ce09713
a4f0c54
 
ce09713
ed5f209
a4f0c54
228263d
 
 
 
a4f0c54
228263d
 
 
71a5e5c
48effef
 
355c6cf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
language:
- zh
- en
tags:
- translation
- game
- cultivation
license: cc-by-nc-4.0
datasets:
- Custom
metrics:
- BLEU
---

This is a finetuned version of Facebook/M2M100. 
It's a project born from the activity of [Amateur Modding Avenue](discord.gg/agFA6xa6un), a Discord based modding community.
Special thanks to the Path of Wuxia modding team for kindly sharing their translations to help build the dataset.

It has been trained on a 46k lines parallel corpus on several Chinese video games translations. All of them are from human/fan translations.

It's not perfect but it's the best I could do. 
It should be sitting somewhere between Google Translate and DeepL, I guess.
So... Before you go any further, lower your expectations.
No, lower.
Just a bit lower... and.. here we are. 

That being said, it has upsides for first MT pass in a game translation context :

1) It should not mess up tags
2) It has basic cultivation/martial arts vocabulary
3) Nothing is locked behind a paywall \o/

Sample generation script : 

```python
from transformers import AutoModelForSeq2SeqLM, M2M100Tokenizer
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
tokenizer = transformers.M2M100Tokenizer.from_pretrained("CadenzaBaron/M2M100-418M-for-GameTranslation-Finetuned-Zh-En")
model = AutoModelForSeq2SeqLM.from_pretrained("CadenzaBaron/M2M100-418M-for-GameTranslation-Finetuned-Zh-En")
model.to(device)
tokenizer.src_lang = "zh"
tokenizer.tgt_lang = "en"
test_string = "地阶上品遁术,施展后便可立于所持之剑上,以极快的速度自由飞行。"

inputs = tokenizer(test_string, return_tensors="pt").to(device)
translated_tokens = model.generate(**inputs, num_beams=10, do_sample=True)
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

print("CH : ", test_string , " // EN : ", translation)
```

Translation sample and comparison with Google Translate and DeepL : [Link to Spreadsheet](https://docs.google.com/spreadsheets/d/1J1i9P0nyI9q5-m2iZGSUatt3ZdHSxU8NOp9tJH7wxsk/edit?usp=sharing)