|
|
|
--- |
|
language: Cszech Swedish |
|
tags: |
|
- translation Cszech Swedish model |
|
datasets: |
|
- dcep europarl jrc-acquis |
|
--- |
|
|
|
# legal_t5_small_trans_cs_sv model |
|
|
|
Model on translating legal text from Cszech to Swedish. It was first released in |
|
[this repository](https://github.com/agemagician/LegalTrans). This model is trained on three parallel corpus from jrc-acquis, europarl and dcep. |
|
|
|
|
|
## Model description |
|
|
|
legal_t5_small_trans_cs_sv is based on the `t5-small` model and was trained on a large corpus of parallel text. This is a smaller model, which scales the baseline model of t5 down by using `dmodel = 512`, `dff = 2,048`, 8-headed attention, and only 6 layers each in the encoder and decoder. This variant has about 60 million parameters. |
|
|
|
## Intended uses & limitations |
|
|
|
The model could be used for translation of legal texts from Cszech to Swedish. |
|
|
|
### How to use |
|
|
|
Here is how to use this model to translate legal text from Cszech to Swedish in PyTorch: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelWithLMHead, TranslationPipeline |
|
|
|
pipeline = TranslationPipeline( |
|
model=AutoModelWithLMHead.from_pretrained("SEBIS/legal_t5_small_trans_cs_sv"), |
|
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "SEBIS/legal_t5_small_trans_cs_sv", do_lower_case=False, |
|
skip_special_tokens=True), |
|
device=0 |
|
) |
|
|
|
cs_text = "Sådana utmaningar inkluderar: medborgarnas fria rörlighet och idrottarnas nationalitet, spelarövergångar (problem avseende handlingarnas legalitet och transparens i finansflöden), idrottstävlingars integritet och en europeisk dialog inom idrottssektorn. |
|
" |
|
|
|
pipeline([cs_text], max_length=512) |
|
``` |
|
|
|
## Training data |
|
|
|
The legal_t5_small_trans_cs_sv model was trained on [JRC-ACQUIS](https://wt-public.emm4u.eu/Acquis/index_2.2.html), [EUROPARL](https://www.statmt.org/europarl/), and [DCEP](https://ec.europa.eu/jrc/en/language-technologies/dcep) dataset consisting of 5 Million parallel texts. |
|
|
|
## Training procedure |
|
|
|
### Preprocessing |
|
|
|
### Pretraining |
|
An unigram model with 88M parameters is trained over the complete parallel corpus to get the vocabulary (with byte pair encoding), which is used with this model. |
|
|
|
|
|
## Evaluation results |
|
|
|
When the model is used for translation test dataset, achieves the following results: |
|
|
|
Test results : |
|
|
|
| Model | BLEU score | |
|
|:-----:|:-----:| |
|
| legal_t5_small_trans_cs_sv | 47.9| |
|
|
|
|
|
### BibTeX entry and citation info |
|
|