File size: 5,276 Bytes
9221208 d372cfb 9221208 d372cfb 9221208 6c122a4 d372cfb 9221208 eef3369 9221208 eef3369 0a05607 9221208 eef3369 9221208 eef3369 9221208 eef3369 9221208 0a05607 eef3369 9221208 d372cfb 9221208 d372cfb f9c4a7f 9221208 d372cfb 6c122a4 d372cfb 6c122a4 9221208 d372cfb 6c122a4 d372cfb 9221208 d372cfb 6c122a4 d372cfb 6c122a4 9221208 d372cfb 6c122a4 9221208 d372cfb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
---
language:
- ja
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
metrics:
widget: []
pipeline_tag: sentence-similarity
license: apache-2.0
datasets:
- hpprc/emb
- hpprc/mqa-ja
- google-research-datasets/paws-x
---
## Model Details
This is a text embedding model based on RoFormer with a maximum input sequence length of 1024.
The model is pre-trained with Wikipedia and cc100 and fine-tuned as a sentence embedding model.
Fine-tuning begins with weakly supervised learning using mc4 and MQA.
After that, we perform the same 3-stage learning process as [GLuCoSE v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2).
### Model Description
- **Model Type:** Sentence Transformer
- **Maximum Sequence Length:** 1024 tokens
- **Output Dimensionality:** 768 tokens
- **Similarity Function:** Cosine Similarity
<!-- - **Training Dataset:** Unknown -->
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: RetrievaBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
import torch.nn.functional as F
# Download from the 🤗 Hub
# The argument "trust_remote_code=True" is required to load the model
model = SentenceTransformer("pkshatech/RoSEtta-base-ja",trust_remote_code=True)
# Don't forget to add the prefix "query: " for query-side or "passage: " for passage-side texts.
sentences = [
'query: PKSHAはどんな会社ですか?',
'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
'query: 日本で一番高い山は?',
'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
embeddings = model.encode(sentences,convert_to_tensor=True)
print(embeddings.shape)
# [4, 768]
# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# tensor([[1.0000, 0.5910, 0.4332, 0.5421],
# [0.5910, 1.0000, 0.4977, 0.6969],
# [0.4332, 0.4977, 1.0000, 0.7475],
# [0.5421, 0.6969, 0.7475, 1.0000]])
```
<!--
### Direct Usage (Transformers)
<details><summary>Click to see the direct usage in Transformers</summary>
</details>
-->
<!--
### Downstream Usage (Sentence Transformers)
You can finetune this model on your own dataset.
<details><summary>Click to expand</summary>
</details>
-->
<!--
### Out-of-Scope Use
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->
<!--
## Bias, Risks and Limitations
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->
<!--
### Recommendations
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->
## Benchmarks
### Retieval
Evaluated with [MIRACL-ja](https://huggingface.co/datasets/miracl/miracl), [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) and [MLDR-ja](https://huggingface.co/datasets/Shitao/MLDR).
| model | size | MIRACL<br>Recall@5 | JQaRA<br>nDCG@10 | MLDR<br>nDCG@10 |
|:--:|:--:|:--:|:--:|:----:|
| me5-base | 0.3B | **84.2** | 47.2 | 25.4 |
| GLuCoSE | 0.1B | 53.3 | 30.8 | 25.2 |
| RoSEtta | 0.2B | 79.3 | **57.7** | **32.3** |
### JMTEB
Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB).
* The time-consuming datasets ['amazon_review_classification', 'mrtydi', 'jaqket', 'esci'] were excluded, and the evaluation was conducted on the other 12 datasets.
* The average is a macro-average per task.
| model | size | Class. | Ret. | STS. | Clus. | Pair. | Avg. |
|:--:|:--:|:--:|:--:|:----:|:-------:|:-------:|:------:|
| me5-base | 0.3B | 75.1 | 80.6 | 80.5 | 52.6 | 62.4 | 70.2 |
| GLuCoSE | 0.1B | **82.6** | 69.8 | 78.2 | 51.5 | **66.2** | 69.7 |
| RoSEtta | 0.2B | 79.0 | **84.3** | **81.4** | **53.2** | 61.7 | **71.9** |
## Authors
Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe
## License
This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). |