Mainak Manna commited on
Commit
5ed79a1
1 Parent(s): 9b32f6c

First version of the model

Browse files
Files changed (4) hide show
  1. README.md +69 -0
  2. config.json +36 -0
  3. pytorch_model.bin +3 -0
  4. spiece.model +3 -0
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ language: Italian Swedish
4
+ tags:
5
+ - translation Italian Swedish model
6
+ datasets:
7
+ - dcep europarl jrc-acquis
8
+ widget:
9
+ - text: "Inoltre, come è emerso da un discorso pronunciato dal direttore del Centro europeo per la prevenzione e il controllo delle malattie (ECDC) in occasione della riunione dell'EPSCO svoltasi il 6 giugno 2011, gli Stati membri dell'UE sono i paesi di maggiore diffusione del morbillo nel mondo sviluppato."
10
+
11
+ ---
12
+
13
+ # legal_t5_small_trans_it_sv model
14
+
15
+ Model on translating legal text from Italian to Swedish. It was first released in
16
+ [this repository](https://github.com/agemagician/LegalTrans). This model is trained on three parallel corpus from jrc-acquis, europarl and dcep.
17
+
18
+
19
+ ## Model description
20
+
21
+ legal_t5_small_trans_it_sv is based on the `t5-small` model and was trained on a large corpus of parallel text. This is a smaller model, which scales the baseline model of t5 down by using `dmodel = 512`, `dff = 2,048`, 8-headed attention, and only 6 layers each in the encoder and decoder. This variant has about 60 million parameters.
22
+
23
+ ## Intended uses & limitations
24
+
25
+ The model could be used for translation of legal texts from Italian to Swedish.
26
+
27
+ ### How to use
28
+
29
+ Here is how to use this model to translate legal text from Italian to Swedish in PyTorch:
30
+
31
+ ```python
32
+ from transformers import AutoTokenizer, AutoModelWithLMHead, TranslationPipeline
33
+
34
+ pipeline = TranslationPipeline(
35
+ model=AutoModelWithLMHead.from_pretrained("SEBIS/legal_t5_small_trans_it_sv"),
36
+ tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "SEBIS/legal_t5_small_trans_it_sv", do_lower_case=False,
37
+ skip_special_tokens=True),
38
+ device=0
39
+ )
40
+
41
+ it_text = "Inoltre, come è emerso da un discorso pronunciato dal direttore del Centro europeo per la prevenzione e il controllo delle malattie (ECDC) in occasione della riunione dell'EPSCO svoltasi il 6 giugno 2011, gli Stati membri dell'UE sono i paesi di maggiore diffusione del morbillo nel mondo sviluppato."
42
+
43
+ pipeline([it_text], max_length=512)
44
+ ```
45
+
46
+ ## Training data
47
+
48
+ The legal_t5_small_trans_it_sv model was trained on [JRC-ACQUIS](https://wt-public.emm4u.eu/Acquis/index_2.2.html), [EUROPARL](https://www.statmt.org/europarl/), and [DCEP](https://ec.europa.eu/jrc/en/language-technologies/dcep) dataset consisting of 5 Million parallel texts.
49
+
50
+ ## Training procedure
51
+
52
+ ### Preprocessing
53
+
54
+ ### Pretraining
55
+ An unigram model with 88M parameters is trained over the complete parallel corpus to get the vocabulary (with byte pair encoding), which is used with this model.
56
+
57
+
58
+ ## Evaluation results
59
+
60
+ When the model is used for translation test dataset, achieves the following results:
61
+
62
+ Test results :
63
+
64
+ | Model | BLEU score |
65
+ |:-----:|:-----:|
66
+ | legal_t5_small_trans_it_sv | 39.17|
67
+
68
+
69
+ ### BibTeX entry and citation info
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "T5ForConditionalGeneration"
4
+ ],
5
+ "d_ff": 2048,
6
+ "d_kv": 64,
7
+ "d_model": 512,
8
+ "decoder_start_token_id": 0,
9
+ "dropout_rate": 0.1,
10
+ "eos_token_id": 1,
11
+ "feed_forward_proj": "relu",
12
+ "initializer_factor": 1.0,
13
+ "is_encoder_decoder": true,
14
+ "layer_norm_epsilon": 1e-06,
15
+ "model_type": "t5",
16
+ "n_positions": 512,
17
+ "num_decoder_layers": 6,
18
+ "num_heads": 8,
19
+ "num_layers": 6,
20
+ "output_past": true,
21
+ "pad_token_id": 0,
22
+ "relative_attention_num_buckets": 32,
23
+ "task_specific_params": {
24
+ "translation_it_to_sv": {
25
+ "early_stopping": true,
26
+ "length_penalty": 2.0,
27
+ "max_length": 512,
28
+ "min_length": 1,
29
+ "no_repeat_ngram_size": 3,
30
+ "num_beams": 4,
31
+ "prefix": "translate Italian to Swedish : "
32
+ }
33
+ },
34
+ "use_cache": true,
35
+ "vocab_size": 32128
36
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:00f7be5d7ae76e13a2a926706e100df4f5ad7abb64773ed8114ce4b66d44238b
3
+ size 242087498
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:53b3c9b1becca02342bbf2c8b00abe9154fb0fc8dbe8c71ad506537b2222523a
3
+ size 840425