ViRanker / README.md

Upload tokenizer

1fd3e6a verified 4 months ago

5.92 kB

	---
	language:
	- vi
	license: apache-2.0
	library_name: transformers
	tags:
	- transformers
	- cross-encoder
	- rerank
	datasets:
	- unicamp-dl/mmarco
	pipeline_tag: text-classification
	widget:
	- text: tỉnh nào có diện tích lớn nhất việt nam
	output:
	- label: nghệ an có diện tích lớn nhất việt nam
	score: 0.9999
	- label: bắc ninh có diện tích nhỏ nhất việt nam
	score: 0.1705
	---

	# Reranker

	* [Usage](#usage)
	* [Using FlagEmbedding](#using-flagembedding)
	* [Using Huggingface transformers](#using-huggingface-transformers)
	* [Fine tune](#fine-tune)
	* [Data format](#data-format)
	* [Performance](#performance)
	* [Citation](#citation)

	Different from embedding model, reranker uses question and document as input and directly output similarity instead of
	embedding.
	You can get a relevance score by inputting query and passage to the reranker.
	And the score can be mapped to a float value in [0,1] by sigmoid function.

	## Usage

	### Using FlagEmbedding

	```
	pip install -U FlagEmbedding
	```

	Get relevance scores (higher scores indicate more relevance):

	```python
	from FlagEmbedding import FlagReranker

	reranker = FlagReranker('namdp-ptit/ViRanker',
	use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

	score = reranker.compute_score(['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'])
	print(score) # 11.140625

	# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
	score = reranker.compute_score(['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'],
	normalize=True)
	print(score) # 0.9999854895214452

	scores = reranker.compute_score(
	[
	['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'],
	['tỉnh nào có diện tích lớn nhất việt nam', 'bắc ninh có diện tích nhỏ nhất việt nam']
	]
	)
	print(scores) # [11.140625, -1.58203125]

	# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
	scores = reranker.compute_score(
	[
	['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'],
	['tỉnh nào có diện tích lớn nhất việt nam', 'bắc ninh có diện tích nhỏ nhất việt nam']
	],
	normalize=True
	)
	print(scores) # [0.99998548952144523, 0.17050799982688053]
	```

	### Using Huggingface transformers

	```
	pip install -U transformers
	```

	Get relevance scores (higher scores indicate more relevance):

	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained('namdp-ptit/ViRanker')
	model = AutoModelForSequenceClassification.from_pretrained('namdp-ptit/ViRanker')
	model.eval()

	pairs = [
	['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'],
	['tỉnh nào có diện tích lớn nhất việt nam', 'bắc ninh có diện tích nhỏ nhất việt nam']
	],
	with torch.no_grad():
	inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
	scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
	print(scores)
	```

	## Fine-tune

	### Data Format

	Train data should be a json file, where each line is a dict like this:

	```
	{"query": str, "pos": List[str], "neg": List[str]}
	```

	`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts. If you have no negative
	texts for a query, you can random sample some from the entire corpus as the negatives.

	## Performance

	Below is a comparision table of the results we achieved compared to some other pre-trained Cross-Encoders on
	the [MS MMarco Passage Reranking - Vi - Dev](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.

	\| Model-Name \| NDCG@3 \| MRR@3 \| NDCG@5 \| MRR@5 \| NDCG@10 \| MRR@10 \| Docs / Sec \|
	\|-----------------------------------------------------------------------------------------------------------------------------------------\|:-----------\|:-----------\|:-----------\|:-----------\|:-----------\|:-----------\|:-----------\|
	\| [namdp-ptit/ViRanker](https://huggingface.co/namdp-ptit/ViRanker) \| 0.6685 \| 0.6564 \| 0.6842 \| 0.6811 \| 0.7278 \| 0.6985 \| 2.02
	\| [itdainb/PhoRanker](https://huggingface.co/itdainb/PhoRanker) \| 0.6625 \| 0.6458 \| 0.7147 \| 0.6731 \| 0.7422 \| 0.6830 \| 15
	\| [kien-vu-uet/finetuned-phobert-passage-rerank-best-eval](https://huggingface.co/kien-vu-uet/finetuned-phobert-passage-rerank-best-eval) \| 0.0963 \| 0.0883 \| 0.1396 \| 0.1131 \| 0.1681 \| 0.1246 \| 15
	\| [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) \| 0.6087 \| 0.5841 \| 0.6513 \| 0.6062 \| 0.6872 \| 0.62091 \| 3.51
	\| [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma) \| 0.6088 \| 0.5908 \| 0.6446 \| 0.6108 \| 0.6785 \| 0.6249 \| 1.29

	## Citation

	Please cite as

	```Plaintext
	@misc{ViRanker,
	title={ViRanker: A Cross-encoder Model for Vietnamese Text Ranking},
	author={Nam Dang Phuong},
	year={2024},
	publisher={Huggingface},
	}
	```