File size: 2,642 Bytes
1673378
06fcb57
 
 
1673378
 
 
 
06fcb57
1673378
 
 
 
ff5c6ba
 
 
 
 
3442013
105abae
1673378
 
266cff4
6167bc7
228263d
d839e72
 
 
148ff33
d839e72
fbd118f
 
 
 
 
 
 
 
ded9b9b
 
 
228263d
663c169
 
228263d
 
48effef
ce09713
a4f0c54
 
ce09713
ed5f209
a4f0c54
228263d
 
 
 
a4f0c54
228263d
 
 
71a5e5c
48effef
 
355c6cf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
language:
- zh
- en
tags:
- translation
- game
- cultivation
license: cc-by-nc-4.0
datasets:
- Custom
metrics:
- BLEU
pipeline_tag: translation
inference:
  parameters:
    src_lang : "zh"
    tgt_lang : "en"
widget:
- text: "地阶上品遁术,施展后便可立于所持之剑上,以极快的速度自由飞行。"
---

# Note : Model version has been bumped to V2, providing seemingly better translations. Legacy model still available on the 'V1' branch

This is a finetuned version of Facebook/M2M100. 
It's a project born from the activity of [Amateur Modding Avenue](discord.gg/agFA6xa6un), a Discord based modding community.
Special thanks to the Path of Wuxia modding team for kindly sharing their translations to help build the dataset.

It has been trained on a 46k lines parallel corpus on several Chinese video games translations. All of them are from human/fan translations.

It's not perfect but it's the best I could do. 
It should be sitting somewhere between Google Translate and DeepL, I guess.
So... Before you go any further, lower your expectations.
No, lower.
Just a bit lower... and.. here we are. 

That being said, it has upsides for first MT pass in a game translation context :

1) It should not mess up tags
2) It has basic cultivation/martial arts vocabulary
3) Nothing is locked behind a paywall \o/

Note : Considering the dataset is built from the work of from modding groups (AMA and PoW Translation Team), who may not want their work to be reused for further AI training, it will not be made public nor shared. 

Sample generation script : 

```python
from transformers import AutoModelForSeq2SeqLM, M2M100Tokenizer
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
tokenizer = transformers.M2M100Tokenizer.from_pretrained("CadenzaBaron/M2M100-418M-for-GameTranslation-Finetuned-Zh-En")
model = AutoModelForSeq2SeqLM.from_pretrained("CadenzaBaron/M2M100-418M-for-GameTranslation-Finetuned-Zh-En")
model.to(device)
tokenizer.src_lang = "zh"
tokenizer.tgt_lang = "en"
test_string = "地阶上品遁术,施展后便可立于所持之剑上,以极快的速度自由飞行。"

inputs = tokenizer(test_string, return_tensors="pt").to(device)
translated_tokens = model.generate(**inputs, num_beams=10, do_sample=True)
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

print("CH : ", test_string , " // EN : ", translation)
```

Translation sample and comparison with Google Translate and DeepL : [Link to Spreadsheet](https://docs.google.com/spreadsheets/d/1J1i9P0nyI9q5-m2iZGSUatt3ZdHSxU8NOp9tJH7wxsk/edit?usp=sharing)