Release Notes

this model is finetuned from larryvrh/mt5-translation-ja_zh great appreciation to the creator
reason for making this model
I was testing the model for translation of some of the Japanese game to Chinese
There are several production issues with the original model
so I did some "supervised" training just to fix them

模型公开声明

这个模型由 mt5-translation-ja_zh 继续训练得来
制作这个模型的原因
尝试使用各类模型进行游戏文本翻译的工作，游戏文本有非常典型的文本对应关系
尤其是游戏文本的翻译中，部分token必须被翻译，部分token必须保持原样，其主要的文本行数必须保持原样
因mt5的预训练包括对应关系，因而较为优秀
因为发现大佬已经进行了翻译的预训练，就直接在基础上精调
游戏文本，几乎很少超过100字，因此larryvrh的模型基本上完全符合需求
修复了一些对应的翻译出的位置问题，训练了一些需要的翻译词汇
本模型缺陷
暂时只制作了mt5-large模型，需要大概8g以上的显存，过剩比较多
为了方便使用，设置成大batch一波推的做法，充分利用gpu资源，但它不会看上下文，因此认为是很大的弊端
数据集中固定翻译的词汇量不足，因此很多翻译会给你它知道的其他语言（一般是英文）
经过一些努力矫正后，它现在会zero-shot的给你一句空耳（出现这个zero-shot特性的时候我们翻译组都绷不住了）

简单的后端应用

还没稳定调试，慎用，需要将设置中的模型名称改为该模型名称并启动

https://github.com/IryNeko/RabbitCafe

A more precise example using it

使用指南

from transformers import pipeline
model_name="iryneko571/mt5-translation-ja_zh-game-large"
#pipe = pipeline("translation",model=model_name,tokenizer=model_name,repetition_penalty=1.4,batch_size=1,max_length=256)
pipe = pipeline("translation",
  model=model_name,
  repetition_penalty=1.4,
  batch_size=1,
  max_length=256
  )

def translate_batch(batch, language='<-ja2zh->'): # batch is an array of string
    i=0 # quickly format the list
    while i<len(batch):
        batch[i]=f'{language} {batch[i]}'
        i+=1
    translated=pipe(batch)
    result=[]
    i=0
    while i<len(translated):
        result.append(translated[i]['translation_text'])
        i+=1
    return result

inputs=[]

print(translate_batch(inputs))

simple webui

暂时的网页UI

I mean nobody stops you from connecting a gradio yourself, if you put that in community response i will make one.
Currently working on a more enterprise approach, would take a while to code pages

integrating with xunity autotranslator
- connect to redis to block massive request flood (and harvest data)
- work with different types of linebreaks such as \\n, \n and \r\n
create support to translate whole json data file
- also filter out the non-jp text
  - and hope this ai keeps the code

roadmap

train mt-5 small and rwkv
make lora training script and ui
create algorism that save no-confidence translations into a db for manual correction
search the manual translatioin db with sentencepiece search to make it work with "previous translations"

搞mt5-small和rwkv，rwkv能读上下文
制造lora training脚本和ui，把炼丹炉搭起来方便实用
让ai将不确定的翻译文本导出用于人工翻译矫正
使用sentencepiece进行ai检索，获取相似的“上文翻译“，大幅提高ai翻译用词的一致性

how to find me

找到作者

Discord Server:
https://discord.gg/JmjPmJjA
If you need any help, a test server or just want to chat
如果需要帮助，需要试试最新的版本，或者只是为了看下我是啥，可以进channel看看（这边允许发布这个吗？）