File size: 3,037 Bytes
967e848
b3ee252
 
 
 
 
 
 
967e848
 
 
 
 
5cd8050
967e848
 
 
5cd8050
c680b33
f2d7fd3
967e848
 
 
c680b33
 
 
967e848
5cd8050
967e848
 
 
 
 
 
 
 
 
 
 
7b6ccdc
 
967e848
5cd8050
967e848
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5cd8050
967e848
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
language:
- es
- zh

tags:
- translation

license: apache-2.0
---

# HelsinkiNLP-FineTuned-Legal-es-zh

This model is a fine-tuned version of [Helsinki-NLP/opus-tatoeba-es-zh](https://huggingface.co/Helsinki-NLP/opus-tatoeba-es-zh) on a dataset of legal domain constructed by the author himself.

## Intended uses & limitations

This model is the result of the master graduation thesis for the Tradumatics: Translation Technologies program at the Autonomous University of Barcelona. Please refer to GitHub repo created for this thesis for full-text and relative open-sourced materials: https://github.com/guocheng98/MUTTT2020_TFM_ZGC

The thesis intends to explain various theories and certain algorithm details about neural machine translation, thus this fine-tuned model only serves as a hands-on practice example for that objective, without any intention of productive usage.

## Training and evaluation data

The dataset is constructed from the Chinese translation of Spanish Civil Code, Spanish Constitution, and many other laws & regulations found in the database China Law Info (北大法宝 Beida Fabao), along with their source text found on Boletín Oficial del Estado and EUR-Lex.

There are 9972 sentence pairs constructed. 1000 are used for evaluation and the rest for training.

## Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 2000
- num_epochs: 10
- mixed_precision_training: Native AMP
- weight_decay: 0.01
- early_stopping_patience: 8

## Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 2.9584        | 0.36  | 400  | 2.6800          |
| 2.6402        | 0.71  | 800  | 2.5017          |
| 2.5038        | 1.07  | 1200 | 2.3907          |
| 2.3279        | 1.43  | 1600 | 2.2999          |
| 2.2258        | 1.78  | 2000 | 2.2343          |
| 2.1061        | 2.14  | 2400 | 2.1961          |
| 1.9279        | 2.5   | 2800 | 2.1569          |
| 1.9059        | 2.85  | 3200 | 2.1245          |
| 1.7491        | 3.21  | 3600 | 2.1227          |
| 1.6301        | 3.57  | 4000 | 2.1169          |
| 1.6871        | 3.92  | 4400 | 2.0979          |
| 1.5203        | 4.28  | 4800 | 2.1074          |
| 1.4646        | 4.63  | 5200 | 2.1024          |
| 1.4739        | 4.99  | 5600 | 2.0905          |
| 1.338         | 5.35  | 6000 | 2.0946          |
| 1.3152        | 5.7   | 6400 | 2.0974          |
| 1.306         | 6.06  | 6800 | 2.0985          |
| 1.1991        | 6.42  | 7200 | 2.0962          |
| 1.2113        | 6.77  | 7600 | 2.1092          |
| 1.1983        | 7.13  | 8000 | 2.1060          |
| 1.1238        | 7.49  | 8400 | 2.1102          |
| 1.1417        | 7.84  | 8800 | 2.1078          |

## Framework versions

- Transformers 4.7.0
- Pytorch 1.8.1+cu101
- Datasets 1.8.0
- Tokenizers 0.10.3