SEBIS
/

legal_t5_small_trans_it_cs

Text2Text Generation

translation Italian Cszech model

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

legal_t5_small_trans_it_cs / README.md

Mainak Manna

First version of the model

363175c over 3 years ago

|

2.43 kB


	---
	language: Italian Cszech
	tags:
	- translation Italian Cszech model
	datasets:
	- dcep europarl jrc-acquis
	widget:
	- text: "k udělení absolutoria za plnění rozpočtu Evropské agentury pro chemické látky na rozpočtový rok 2009
	"
	---

	# legal_t5_small_trans_it_cs model

	Model on translating legal text from Italian to Cszech. It was first released in
	[this repository](https://github.com/agemagician/LegalTrans). This model is trained on three parallel corpus from jrc-acquis, europarl and dcep.


	## Model description

	legal_t5_small_trans_it_cs is based on the `t5-small` model and was trained on a large corpus of parallel text. This is a smaller model, which scales the baseline model of t5 down by using `dmodel = 512`, `dff = 2,048`, 8-headed attention, and only 6 layers each in the encoder and decoder. This variant has about 60 million parameters.

	## Intended uses & limitations

	The model could be used for translation of legal texts from Italian to Cszech.

	### How to use

	Here is how to use this model to translate legal text from Italian to Cszech in PyTorch:

	```python
	from transformers import AutoTokenizer, AutoModelWithLMHead, TranslationPipeline

	pipeline = TranslationPipeline(
	model=AutoModelWithLMHead.from_pretrained("SEBIS/legal_t5_small_trans_it_cs"),
	tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "SEBIS/legal_t5_small_trans_it_cs", do_lower_case=False,
	skip_special_tokens=True),
	device=0
	)

	it_text = "k udělení absolutoria za plnění rozpočtu Evropské agentury pro chemické látky na rozpočtový rok 2009
	"

	pipeline([it_text], max_length=512)
	```

	## Training data

	The legal_t5_small_trans_it_cs model was trained on [JRC-ACQUIS](https://wt-public.emm4u.eu/Acquis/index_2.2.html), [EUROPARL](https://www.statmt.org/europarl/), and [DCEP](https://ec.europa.eu/jrc/en/language-technologies/dcep) dataset consisting of 5 Million parallel texts.

	## Training procedure

	### Preprocessing

	### Pretraining
	An unigram model with 88M parameters is trained over the complete parallel corpus to get the vocabulary (with byte pair encoding), which is used with this model.


	## Evaluation results

	When the model is used for translation test dataset, achieves the following results:

	Test results :

	\| Model \| BLEU score \|
	\|:-----:\|:-----:\|
	\| legal_t5_small_trans_it_cs \| 43.3\|


	### BibTeX entry and citation info