Fill-Mask
Transformers
PyTorch
Japanese
bert
Inference Endpoints
aken12 commited on
Commit
f1a787a
1 Parent(s): a98ce1e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ datasets:
4
+ - unicamp-dl/mmarco
5
+ - bclavie/mmarco-japanese-hard-negatives
6
+ language:
7
+ - ja
8
+ ---
9
+
10
+ ## Evaluation on [MIRACL japanese](https://huggingface.co/datasets/miracl/miracl)
11
+ These models don't train on the MIRACL training data.
12
+
13
+ | Model | nDCG@10 | Recall@1000 | Recall@5 | Recall@30 |
14
+ |------------------|---------|-------------|----------|-----------|
15
+ | BM25 | 0.369 | 0.931 | - | - |
16
+ | splade-japanese | 0.405 | 0.931 | 0.406 | 0.663 |
17
+ | splade-japanese-efficient| 0.408 | 0.954 | 0.419 | 0.718 |
18
+ | splade-japanese-v2 | 0.580 | 0.967 | 0.629 | 0.844 |
19
+ | splade-japanese-v2-doc | 0.478 | 0.930 | 0.514 | 0.759 |
20
+ | splade-japanese-v3 | 0.604 | 0.979 | 0.647 | 0.877 |
21
+
22
+
23
+ *'splade-japanese-v2-doc' model does not require query encoder during inference.
24
+
25
+ 下のコードを実行すれば,単語拡張や重み付けの確認ができます.
26
+
27
+ If you'd like to try it out, you can see the expansion of queries or documents by running the code below.
28
+
29
+ you need to install
30
+
31
+ ```
32
+ !pip install fugashi ipadic unidic-lite
33
+ ```
34
+
35
+ ```python
36
+ from transformers import AutoModelForMaskedLM,AutoTokenizer
37
+ import torch
38
+ import numpy as np
39
+
40
+ model = AutoModelForMaskedLM.from_pretrained("aken12/splade-japanese-v2")
41
+ tokenizer = AutoTokenizer.from_pretrained("aken12/splade-japanese-v2")
42
+ vocab_dict = {v: k for k, v in tokenizer.get_vocab().items()}
43
+
44
+ def encode_query(query):
45
+ query = tokenizer(query, return_tensors="pt")
46
+ output = model(**query, return_dict=True).logits
47
+ output, _ = torch.max(torch.log(1 + torch.relu(output)) * query['attention_mask'].unsqueeze(-1), dim=1)
48
+ return output
49
+
50
+ with torch.no_grad():
51
+ model_output = encode_query(query="筑波大学では何の研究が行われているか?")
52
+
53
+ reps = model_output
54
+ idx = torch.nonzero(reps[0], as_tuple=False)
55
+
56
+ dict_splade = {}
57
+ for i in idx:
58
+ token_value = reps[0][i[0]].item()
59
+ if token_value > 0:
60
+ token = vocab_dict[int(i[0])]
61
+ dict_splade[token] = float(token_value)
62
+
63
+ sorted_dict_splade = sorted(dict_splade.items(), key=lambda item: item[1], reverse=True)
64
+ for token, value in sorted_dict_splade:
65
+ print(token, value)
66
+ ```