|
--- |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- unicamp-dl/mmarco |
|
- bclavie/mmarco-japanese-hard-negatives |
|
language: |
|
- ja |
|
--- |
|
|
|
## Evaluation on [MIRACL japanese](https://huggingface.co/datasets/miracl/miracl) |
|
These models don't train on the MIRACL training data. |
|
|
|
| Model | nDCG@10 | Recall@1000 | Recall@5 | Recall@30 | |
|
|------------------|---------|-------------|----------|-----------| |
|
| BM25 | 0.369 | 0.931 | - | - | |
|
| splade-japanese | 0.405 | 0.931 | 0.406 | 0.663 | |
|
| splade-japanese-efficient| 0.408 | 0.954 | 0.419 | 0.718 | |
|
| splade-japanese-v2 | 0.580 | 0.967 | 0.629 | 0.844 | |
|
| splade-japanese-v2-doc | 0.478 | 0.930 | 0.514 | 0.759 | |
|
| splade-japanese-v3 | **0.604** | **0.979** | **0.647** | **0.877** | |
|
|
|
|
|
*'splade-japanese-v2-doc' model does not require query encoder during inference. |
|
|
|
下のコードを実行すれば,単語拡張や重み付けの確認ができます. |
|
|
|
If you'd like to try it out, you can see the expansion of queries or documents by running the code below. |
|
|
|
you need to install |
|
|
|
``` |
|
!pip install fugashi ipadic unidic-lite |
|
``` |
|
|
|
```python |
|
from transformers import AutoModelForMaskedLM,AutoTokenizer |
|
import torch |
|
import numpy as np |
|
|
|
model = AutoModelForMaskedLM.from_pretrained("aken12/splade-japanese-v3") |
|
tokenizer = AutoTokenizer.from_pretrained("aken12/splade-japanese-v3") |
|
vocab_dict = {v: k for k, v in tokenizer.get_vocab().items()} |
|
|
|
def encode_query(query): |
|
query = tokenizer(query, return_tensors="pt") |
|
output = model(**query, return_dict=True).logits |
|
output, _ = torch.max(torch.log(1 + torch.relu(output)) * query['attention_mask'].unsqueeze(-1), dim=1) |
|
return output |
|
|
|
with torch.no_grad(): |
|
model_output = encode_query(query="筑波大学では何の研究が行われているか?") |
|
|
|
reps = model_output |
|
idx = torch.nonzero(reps[0], as_tuple=False) |
|
|
|
dict_splade = {} |
|
for i in idx: |
|
token_value = reps[0][i[0]].item() |
|
if token_value > 0: |
|
token = vocab_dict[int(i[0])] |
|
dict_splade[token] = float(token_value) |
|
|
|
sorted_dict_splade = sorted(dict_splade.items(), key=lambda item: item[1], reverse=True) |
|
for token, value in sorted_dict_splade: |
|
print(token, value) |
|
``` |