Mainak Manna
First version of the model
ac543c6
|
raw
history blame
2.6 kB
---
language: Cszech Spanish
tags:
- translation Cszech Spanish model
datasets:
- dcep europarl jrc-acquis
---
# legal_t5_small_trans_cs_es model
Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was first released in
[this repository](https://github.com/agemagician/LegalTrans). This model is trained on three parallel corpus from jrc-acquis, europarl and dcep.
## Model description
legal_t5_small_trans_cs_es is based on the `t5-small` model and was trained on a large corpus of parallel text. This is a smaller model, which scales the baseline model of t5 down by using `dmodel = 512`, `dff = 2,048`, 8-headed attention, and only 6 layers each in the encoder and decoder. This variant has about 60 million parameters.
## Intended uses & limitations
The model could be used for translation of legal texts from Cszech to Spanish.
### How to use
Here is how to use this model to translate legal text from Cszech to Spanish in PyTorch:
```python
from transformers import AutoTokenizer, AutoModelWithLMHead, TranslationPipeline
pipeline = TranslationPipeline(
model=AutoModelWithLMHead.from_pretrained("SEBIS/legal_t5_small_trans_cs_es"),
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "SEBIS/legal_t5_small_trans_cs_es", do_lower_case=False,
skip_special_tokens=True),
device=0
)
cs_text = "Constata que entre 2004 y 2007 se pusieron a disposición de Bulgaria 650 000 000 EUR en el fondo PHARE, 226 000 000 EUR en el fondo SAPARD, y 440 500 000 EUR en el fondo ISPA; que entre 2004 y 2007 se pusieron a disposición de Rumanía unos 1 346 500 000 EUR en el fondo PHARE, 526 300 000 EUR en el fondo SAPARD, y 1 040 500 000 EUR en el fondo ISPA;
"
pipeline([cs_text], max_length=512)
```
## Training data
The legal_t5_small_trans_cs_es model was trained on [JRC-ACQUIS](https://wt-public.emm4u.eu/Acquis/index_2.2.html), [EUROPARL](https://www.statmt.org/europarl/), and [DCEP](https://ec.europa.eu/jrc/en/language-technologies/dcep) dataset consisting of 5 Million parallel texts.
## Training procedure
### Preprocessing
### Pretraining
An unigram model with 88M parameters is trained over the complete parallel corpus to get the vocabulary (with byte pair encoding), which is used with this model.
## Evaluation results
When the model is used for translation test dataset, achieves the following results:
Test results :
| Model | secondary structure (3-states) |
|:-----:|:-----:|
| legal_t5_small_trans_cs_es | 50.77|
### BibTeX entry and citation info