SEBIS
/

legal_t5_small_trans_cs_sv

Text2Text Generation

translation Cszech Swedish model

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Mainak Manna commited on Jan 21, 2021

Commit

969a6e8

•

1 Parent(s): 6444d85

First version of the model

Files changed (1) hide show

README.md +67 -0

README.md ADDED Viewed

	@@ -0,0 +1,67 @@

+---
+language: Cszech Swedish
+tags:
+- translation Cszech Swedish  model
+datasets:
+- dcep europarl jrc-acquis
+---
+# legal_t5_small_trans_cs_sv model
+Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was first released in
+[this repository](https://github.com/agemagician/LegalTrans). This model is trained on three parallel corpus from jrc-acquis, europarl and dcep.
+## Model description
+legal_t5_small_trans_cs_sv is based on the `t5-small` model and was trained on a large corpus of parallel text. This is a smaller model, which scales the baseline model of t5 down by using `dmodel = 512`, `dff = 2,048`, 8-headed attention, and only 6 layers each in the encoder and decoder. This variant has about 60 million parameters.
+## Intended uses & limitations
+The model could be used for translation of legal texts from Cszech to Swedish.
+### How to use
+Here is how to use this model to translate legal text from Cszech to Swedish in PyTorch:
+```python
+from transformers import AutoTokenizer, AutoModelWithLMHead, TranslationPipeline
+pipeline = TranslationPipeline(
+model=AutoModelWithLMHead.from_pretrained("SEBIS/legal_t5_small_trans_cs_sv"),
+tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "SEBIS/legal_t5_small_trans_cs_sv", do_lower_case=False,
+                                            skip_special_tokens=True),
+    device=0
+)
+cs_text = "Slutomröstning: närvarande ledamöter
+"
+pipeline([cs_text], max_length=512)
+```
+## Training data
+The legal_t5_small_trans_cs_sv model was trained on [JRC-ACQUIS](https://wt-public.emm4u.eu/Acquis/index_2.2.html), [EUROPARL](https://www.statmt.org/europarl/), and [DCEP](https://ec.europa.eu/jrc/en/language-technologies/dcep) dataset consisting of 5 Million parallel texts.
+## Training procedure
+### Preprocessing
+### Pretraining
+An unigram model with 88M parameters is trained over the complete parallel corpus to get the vocabulary (with byte pair encoding), which is used with this model.
+## Evaluation results
+When the model is used for translation test dataset, achieves the following results:
+Test results :
+| Model | secondary structure (3-states) |
+|:-----:|:-----:|
+|   legal_t5_small_trans_cs_sv | 47.9|
+### BibTeX entry and citation info