aken12
/

splade-japanese-v3

Inference Endpoints

Model card Files Files and versions Community

splade-japanese-v3 / README.md

aken12's picture

Update README.md

e58d212 verified 4 months ago

|

No virus

3.45 kB

	---
	license: cc-by-sa-4.0
	datasets:
	- unicamp-dl/mmarco
	- bclavie/mmarco-japanese-hard-negatives
	language:
	- ja
	---



	\| \| \| \| JQaRa \| \| \|
	\| ------------------- \| --- \| --------- \| --------- \| --------- \| --------- \|
	\| \| \| NDCG@10 \| MRR@10 \| NDCG@100 \| MRR@100 \|
	\| splade-japanese-v3 \| \| 0.505 \| 0.772 \| 0.7 \| 0.775 \|
	\| JaColBERTv2 \| \| 0.585 \| 0.836 \| 0.753 \| 0.838 \|
	\| JaColBERT \| \| 0.549 \| 0.811 \| 0.730 \| 0.814 \|
	\| bge-m3+all \| \| 0.576 \| 0.818 \| 0.745 \| 0.820 \|
	\| bg3-m3+dense \| \| 0.539 \| 0.785 \| 0.721 \| 0.788 \|
	\| m-e5-large \| \| 0.554 \| 0.799 \| 0.731 \| 0.801 \|
	\| m-e5-base \| \| 0.471 \| 0.727 \| 0.673 \| 0.731 \|
	\| m-e5-small \| \| 0.492 \| 0.729 \| 0.689 \| 0.733 \|
	\| GLuCoSE \| \| 0.308 \| 0.518 \| 0.564 \| 0.527 \|
	\| sup-simcse-ja-base \| \| 0.324 \| 0.541 \| 0.572 \| 0.550 \|
	\| sup-simcse-ja-large \| \| 0.356 \| 0.575 \| 0.596 \| 0.583 \|
	\| fio-base-v0.1 \| \| 0.372 \| 0.616 \| 0.608 \| 0.622 \|

	## Evaluation on [MIRACL japanese](https://huggingface.co/datasets/miracl/miracl)
	These models don't train on the MIRACL training data.

	\| Model \| nDCG@10 \| Recall@1000 \| Recall@5 \| Recall@30 \|
	\|------------------\|---------\|-------------\|----------\|-----------\|
	\| BM25 \| 0.369 \| 0.931 \| - \| - \|
	\| splade-japanese \| 0.405 \| 0.931 \| 0.406 \| 0.663 \|
	\| splade-japanese-efficient\| 0.408 \| 0.954 \| 0.419 \| 0.718 \|
	\| splade-japanese-v2 \| 0.580 \| 0.967 \| 0.629 \| 0.844 \|
	\| splade-japanese-v2-doc \| 0.478 \| 0.930 \| 0.514 \| 0.759 \|
	\| splade-japanese-v3 \| 0.604 \| 0.979 \| 0.647 \| 0.877 \|


	*'splade-japanese-v2-doc' model does not require query encoder during inference.



	下のコードを実行すれば，単語拡張や重み付けの確認ができます．

	If you'd like to try it out, you can see the expansion of queries or documents by running the code below.

	you need to install

	```
	!pip install fugashi ipadic unidic-lite
	```

	```python
	from transformers import AutoModelForMaskedLM,AutoTokenizer
	import torch
	import numpy as np

	model = AutoModelForMaskedLM.from_pretrained("aken12/splade-japanese-v3")
	tokenizer = AutoTokenizer.from_pretrained("aken12/splade-japanese-v3")
	vocab_dict = {v: k for k, v in tokenizer.get_vocab().items()}

	def encode_query(query): ##query passsage maxlen: 32,180
	query = tokenizer(query, return_tensors="pt")
	output = model(**query, return_dict=True).logits
	output, _ = torch.max(torch.log(1 + torch.relu(output)) * query['attention_mask'].unsqueeze(-1), dim=1)
	return output

	with torch.no_grad():
	model_output = encode_query(query="筑波大学では何の研究が行われているか？")

	reps = model_output
	idx = torch.nonzero(reps[0], as_tuple=False)

	dict_splade = {}
	for i in idx:
	token_value = reps[0][i[0]].item()
	if token_value > 0:
	token = vocab_dict[int(i[0])]
	dict_splade[token] = float(token_value)

	sorted_dict_splade = sorted(dict_splade.items(), key=lambda item: item[1], reverse=True)
	for token, value in sorted_dict_splade:
	print(token, value)
	```