unicamp-dl
/

monoptt5-large

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

monoptt5-large / README.md

marcospiau's picture

Update README.md

93a1ad0 verified 4 months ago

|

history blame contribute delete

No virus

3.16 kB

	---
	datasets:
	- unicamp-dl/mmarco
	language:
	- pt
	pipeline_tag: text2text-generation
	base_model: unicamp-dl/ptt5-v2-large
	---

	## Introduction
	MonoPTT5 models are T5 rerankers for the Portuguese language. Starting from [ptt5-v2 checkpoints](https://huggingface.co/collections/unicamp-dl/ptt5-v2-666538a650188ba00aa8d2d0), they were trained for 100k steps on a mixture of Portuguese and English data from the mMARCO dataset.
	For further information on the training and evaluation of these models, please refer to our paper, [ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language](https://arxiv.org/abs/2008.09144).

	## Usage
	The easiest way to use our models is through the `rerankers` package. After installing the package using `pip install rerankers[transformers]`, the following code can be used as a minimal working example:

	```python
	from rerankers import Reranker
	import torch

	query = "O futebol é uma paixão nacional"
	docs = [
	"O futebol é superestimado e não deveria receber tanta atenção.",
	"O futebol é uma parte essencial da cultura brasileira e une as pessoas.",
	]

	ranker = Reranker(
	"unicamp-dl/monoptt5-large",
	inputs_template="Pergunta: {query} Documento: {text} Relevante:",
	dtype=torch.float32 # or bfloat16 if supported by your GPU
	)

	results = ranker.rank(query, docs)

	print("Classification results:")
	for result in results:
	print(result)

	# Loading T5Ranker model unicamp-dl/monoptt5-large
	# No device set
	# Using device cuda
	# Using dtype torch.float32
	# Loading model unicamp-dl/monoptt5-large, this might take a while...
	# Using device cuda.
	# Using dtype torch.float32.
	# T5 true token set to ▁Sim
	# T5 false token set to ▁Não
	# Returning normalised scores...
	# Inputs template set to Pergunta: {query} Documento: {text} Relevante:

	# Classification results:
	# document=Document(text='O futebol é uma parte essencial da cultura brasileira e une as pessoas.', doc_id=1, metadata={}) score=0.923164963722229 rank=1
	# document=Document(text='O futebol é superestimado e não deveria receber tanta atenção.', doc_id=0, metadata={}) score=0.08710747957229614 rank=2
	```

	For additional configurations and more advanced usage, consult the `rerankers` [GitHub repository](https://github.com/AnswerDotAI/rerankers).

	## Citation
	If you use our models, please cite:
	```
	@misc{piau2024ptt5v2,
	title={ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language},
	author={Marcos Piau and Roberto Lotufo and Rodrigo Nogueira},
	year={2024},
	eprint={2406.10806},
	archivePrefix={arXiv},
	primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
	}
	```