File size: 3,757 Bytes
ec0b089
828bcba
18b35f1
ec0b089
18b35f1
3038a8e
18b35f1
7cf4595
 
763adef
4944432
 
763adef
4944432
 
763adef
ec0b089
18b35f1
 
 
 
 
 
 
 
a6dc7c2
18b35f1
 
 
a6dc7c2
18b35f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a62a9a
18b35f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a6dc7c2
 
 
18b35f1
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
pipeline_tag: sentence-similarity
language: ja
license: cc-by-sa-4.0
tags:
- transformers
- sentence-similarity
- feature-extraction
- sentence-transformers
inference: false 
widget:
- source_sentence: "This widget can't work correctly now."
  sentences:
    - "Sorry :("
    - "Try this model in your local environment!"
  example_title: "notification"
---

# Japanese SimCSE (BERT-base)
[日本語のREADME/Japanese README](https://huggingface.co/pkshatech/simcse-ja-bert-base-clcmlp/blob/main/README_JA.md)  

## summary
model name: `pkshatech/simcse-ja-bert-base-clcmlp`  


This is a Japanese [SimCSE](https://arxiv.org/abs/2104.08821) model. You can easily extract sentence embedding representations from Japanese sentences. This model is based on [`cl-tohoku/bert-base-japanese-v2`](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) and trained on [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88) dataset, which is a Japanese natural language inference dataset.


## Usage (Sentence-Transformers)
You can use this model easily with [sentence-transformers](https://www.SBERT.net).

You need [fugashi](https://github.com/polm/fugashi) and [unidic-lite](https://pypi.org/project/unidic-lite/) for tokenization.

Please install sentence-transformers, fugashi, and unidic-lite with pip as follows:
```
pip install -U fugashi[unidic-lite] sentence-transformers
```

You can load the model and convert sentences to dense vectors as follows:

```python
from sentence_transformers import SentenceTransformer
sentences = [
    "PKSHA Technologyは機械学習/深層学習技術に関わるアルゴリズムソリューションを展開している。",
    "この深層学習モデルはPKSHA Technologyによって学習され、公開された。",
    "広目天は、仏教における四天王の一尊であり、サンスクリット語の「種々の眼をした者」を名前の由来とする。",
]

model = SentenceTransformer('pkshatech/simcse-ja-bert-base-clcmlp')
embeddings = model.encode(sentences)
print(embeddings)
```

Since the loss function used during training is cosine similarity, we recommend using cosine similarity for downstream tasks.

## Model Detail

### Tokenization
We use the same tokenizer as `tohoku/bert-base-japanese-v2`. Please see the [README of `tohoku/bert-base-japanese-v2`](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) for details.

### Training
We set `tohoku/bert-base-japanese-v2` as the initial value and trained it on the train set of [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88).  We trained 20 epochs and published the checkpoint of the model with the highest Spearman's correlation coefficient on the validation set [^1] of the train set of [JSTS](https://github.com/yahoojapan/JGLUE)

### Training Parameters

| Parameter | Value |
| --- | --- |
|pooling_strategy | [CLS] -> single fully-connected layer |
| max_seq_length | 128 |
| with hard negative | true |
| temperature of contrastive loss |  0.05 |
| Batch size | 200 |
| Learning rate | 1e-5 |
| Weight decay | 0.01 |
| Max gradient norm | 1.0 |
| Warmup steps | 2012 |
| Scheduler | WarmupLinear |
| Epochs | 20 |
| Evaluation steps | 250 |


# Licenses
This models are distributed under the terms of the Creative [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).


[^1]: When we trained this model, the test data of JGLUE was not released, so we used the dev set of JGLUE as a private evaluation data. Therefore, we selected the checkpoint on the train set of JGLUE insted of its dev set.