Mainak Manna
commited on
Commit
•
5ed79a1
1
Parent(s):
9b32f6c
First version of the model
Browse files- README.md +69 -0
- config.json +36 -0
- pytorch_model.bin +3 -0
- spiece.model +3 -0
README.md
ADDED
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
language: Italian Swedish
|
4 |
+
tags:
|
5 |
+
- translation Italian Swedish model
|
6 |
+
datasets:
|
7 |
+
- dcep europarl jrc-acquis
|
8 |
+
widget:
|
9 |
+
- text: "Inoltre, come è emerso da un discorso pronunciato dal direttore del Centro europeo per la prevenzione e il controllo delle malattie (ECDC) in occasione della riunione dell'EPSCO svoltasi il 6 giugno 2011, gli Stati membri dell'UE sono i paesi di maggiore diffusione del morbillo nel mondo sviluppato."
|
10 |
+
|
11 |
+
---
|
12 |
+
|
13 |
+
# legal_t5_small_trans_it_sv model
|
14 |
+
|
15 |
+
Model on translating legal text from Italian to Swedish. It was first released in
|
16 |
+
[this repository](https://github.com/agemagician/LegalTrans). This model is trained on three parallel corpus from jrc-acquis, europarl and dcep.
|
17 |
+
|
18 |
+
|
19 |
+
## Model description
|
20 |
+
|
21 |
+
legal_t5_small_trans_it_sv is based on the `t5-small` model and was trained on a large corpus of parallel text. This is a smaller model, which scales the baseline model of t5 down by using `dmodel = 512`, `dff = 2,048`, 8-headed attention, and only 6 layers each in the encoder and decoder. This variant has about 60 million parameters.
|
22 |
+
|
23 |
+
## Intended uses & limitations
|
24 |
+
|
25 |
+
The model could be used for translation of legal texts from Italian to Swedish.
|
26 |
+
|
27 |
+
### How to use
|
28 |
+
|
29 |
+
Here is how to use this model to translate legal text from Italian to Swedish in PyTorch:
|
30 |
+
|
31 |
+
```python
|
32 |
+
from transformers import AutoTokenizer, AutoModelWithLMHead, TranslationPipeline
|
33 |
+
|
34 |
+
pipeline = TranslationPipeline(
|
35 |
+
model=AutoModelWithLMHead.from_pretrained("SEBIS/legal_t5_small_trans_it_sv"),
|
36 |
+
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "SEBIS/legal_t5_small_trans_it_sv", do_lower_case=False,
|
37 |
+
skip_special_tokens=True),
|
38 |
+
device=0
|
39 |
+
)
|
40 |
+
|
41 |
+
it_text = "Inoltre, come è emerso da un discorso pronunciato dal direttore del Centro europeo per la prevenzione e il controllo delle malattie (ECDC) in occasione della riunione dell'EPSCO svoltasi il 6 giugno 2011, gli Stati membri dell'UE sono i paesi di maggiore diffusione del morbillo nel mondo sviluppato."
|
42 |
+
|
43 |
+
pipeline([it_text], max_length=512)
|
44 |
+
```
|
45 |
+
|
46 |
+
## Training data
|
47 |
+
|
48 |
+
The legal_t5_small_trans_it_sv model was trained on [JRC-ACQUIS](https://wt-public.emm4u.eu/Acquis/index_2.2.html), [EUROPARL](https://www.statmt.org/europarl/), and [DCEP](https://ec.europa.eu/jrc/en/language-technologies/dcep) dataset consisting of 5 Million parallel texts.
|
49 |
+
|
50 |
+
## Training procedure
|
51 |
+
|
52 |
+
### Preprocessing
|
53 |
+
|
54 |
+
### Pretraining
|
55 |
+
An unigram model with 88M parameters is trained over the complete parallel corpus to get the vocabulary (with byte pair encoding), which is used with this model.
|
56 |
+
|
57 |
+
|
58 |
+
## Evaluation results
|
59 |
+
|
60 |
+
When the model is used for translation test dataset, achieves the following results:
|
61 |
+
|
62 |
+
Test results :
|
63 |
+
|
64 |
+
| Model | BLEU score |
|
65 |
+
|:-----:|:-----:|
|
66 |
+
| legal_t5_small_trans_it_sv | 39.17|
|
67 |
+
|
68 |
+
|
69 |
+
### BibTeX entry and citation info
|
config.json
ADDED
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"T5ForConditionalGeneration"
|
4 |
+
],
|
5 |
+
"d_ff": 2048,
|
6 |
+
"d_kv": 64,
|
7 |
+
"d_model": 512,
|
8 |
+
"decoder_start_token_id": 0,
|
9 |
+
"dropout_rate": 0.1,
|
10 |
+
"eos_token_id": 1,
|
11 |
+
"feed_forward_proj": "relu",
|
12 |
+
"initializer_factor": 1.0,
|
13 |
+
"is_encoder_decoder": true,
|
14 |
+
"layer_norm_epsilon": 1e-06,
|
15 |
+
"model_type": "t5",
|
16 |
+
"n_positions": 512,
|
17 |
+
"num_decoder_layers": 6,
|
18 |
+
"num_heads": 8,
|
19 |
+
"num_layers": 6,
|
20 |
+
"output_past": true,
|
21 |
+
"pad_token_id": 0,
|
22 |
+
"relative_attention_num_buckets": 32,
|
23 |
+
"task_specific_params": {
|
24 |
+
"translation_it_to_sv": {
|
25 |
+
"early_stopping": true,
|
26 |
+
"length_penalty": 2.0,
|
27 |
+
"max_length": 512,
|
28 |
+
"min_length": 1,
|
29 |
+
"no_repeat_ngram_size": 3,
|
30 |
+
"num_beams": 4,
|
31 |
+
"prefix": "translate Italian to Swedish : "
|
32 |
+
}
|
33 |
+
},
|
34 |
+
"use_cache": true,
|
35 |
+
"vocab_size": 32128
|
36 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:00f7be5d7ae76e13a2a926706e100df4f5ad7abb64773ed8114ce4b66d44238b
|
3 |
+
size 242087498
|
spiece.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:53b3c9b1becca02342bbf2c8b00abe9154fb0fc8dbe8c71ad506537b2222523a
|
3 |
+
size 840425
|