File size: 2,049 Bytes
4d8e9a4
9f97b8a
 
 
 
 
4d8e9a4
 
9f97b8a
 
 
 
 
347a529
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5034acf
347a529
5034acf
347a529
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
language:
- en
- ja
tags:
- nllb
license: cc-by-nc-4.0
---

# NLLB 1.3B fine-tuned on Japanese to English Light Novel translation

This model was fine-tuned on light and web novel for Japanese to English translation.

It can translate sentences and paragraphs up to 512 tokens.


## Usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("thefrigidliquidation/nllb-jaen-1.3B-lightnovels")
model = AutoModelForSeq2SeqLM.from_pretrained("thefrigidliquidation/nllb-jaen-1.3B-lightnovels")

generated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang],
    max_new_tokens=1024,
    no_repeat_ngram_size=6,
).cpu()

translated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
```

Generating with diverse beam search seems to work best. Add the following to `model.generate`:
```python
num_beams=8,
num_beam_groups=4,
do_sample=False,
```


## Glossary
You can provide up to 10 custom translations for nouns and character names at runtime. To do so, surround the Japanese term with term tokens. Prefix the word with one of `<t0>, <t1>, ..., <t9>` and suffix the word with `</t>`. The term will be translated as the prefix term token which can then be string replaced.

For example, in `γƒžγ‚€γƒ³γ€γƒ«γƒƒγƒ„γŒθΏŽγˆγ«ζ₯γŸγ‚ˆ` if you wish to have `γƒžγ‚€γƒ³` translated as `Myne` you would replace `γƒžγ‚€γƒ³` with `<t0>γƒžγ‚€γƒ³</t>`. The model will translate `<t0>γƒžγ‚€γƒ³</t>γ€γƒ«γƒƒγƒ„γŒθΏŽγˆγ«ζ₯γŸγ‚ˆ` as `<t0>, Lutz is here to pick you up.` Then simply do a string replacement on the output, replacing `<t0>` with `Myne`.


## Honorifics
You can force the model to generate or ignore honorifics.

```python
# default, the model decides whether to use honorifics
tokenizer.tgt_lang = "jpn_Jpan"
# no honorifics, the model is discouraged from using honorifics
tokenizer.tgt_lang = "zsm_Latn"
# honorifics, the model is encouraged to use honorifics
tokenizer.tgt_lang = "zul_Latn"
```