update widget setting

763adef almost 2 years ago

3.76 kB

	---
	pipeline_tag: sentence-similarity
	language: ja
	license: cc-by-sa-4.0
	tags:
	- transformers
	- sentence-similarity
	- feature-extraction
	- sentence-transformers
	inference: false
	widget:
	- source_sentence: "This widget can't work correctly now."
	sentences:
	- "Sorry :("
	- "Try this model in your local environment!"
	example_title: "notification"
	---

	# Japanese SimCSE (BERT-base)
	[日本語のREADME/Japanese README](https://huggingface.co/pkshatech/simcse-ja-bert-base-clcmlp/blob/main/README_JA.md)

	## summary
	model name: `pkshatech/simcse-ja-bert-base-clcmlp`


	This is a Japanese [SimCSE](https://arxiv.org/abs/2104.08821) model. You can easily extract sentence embedding representations from Japanese sentences. This model is based on [`cl-tohoku/bert-base-japanese-v2`](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) and trained on [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88) dataset, which is a Japanese natural language inference dataset.


	## Usage (Sentence-Transformers)
	You can use this model easily with [sentence-transformers](https://www.SBERT.net).

	You need [fugashi](https://github.com/polm/fugashi) and [unidic-lite](https://pypi.org/project/unidic-lite/) for tokenization.

	Please install sentence-transformers, fugashi, and unidic-lite with pip as follows:
	```
	pip install -U fugashi[unidic-lite] sentence-transformers
	```

	You can load the model and convert sentences to dense vectors as follows:

	```python
	from sentence_transformers import SentenceTransformer
	sentences = [
	"PKSHA Technologyは機械学習/深層学習技術に関わるアルゴリズムソリューションを展開している。",
	"この深層学習モデルはPKSHA Technologyによって学習され、公開された。",
	"広目天は、仏教における四天王の一尊であり、サンスクリット語の「種々の眼をした者」を名前の由来とする。",
	]

	model = SentenceTransformer('pkshatech/simcse-ja-bert-base-clcmlp')
	embeddings = model.encode(sentences)
	print(embeddings)
	```

	Since the loss function used during training is cosine similarity, we recommend using cosine similarity for downstream tasks.

	## Model Detail

	### Tokenization
	We use the same tokenizer as `tohoku/bert-base-japanese-v2`. Please see the [README of `tohoku/bert-base-japanese-v2`](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) for details.

	### Training
	We set `tohoku/bert-base-japanese-v2` as the initial value and trained it on the train set of [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88). We trained 20 epochs and published the checkpoint of the model with the highest Spearman's correlation coefficient on the validation set [^1] of the train set of [JSTS](https://github.com/yahoojapan/JGLUE)

	### Training Parameters

	\| Parameter \| Value \|
	\| --- \| --- \|
	\|pooling_strategy \| [CLS] -> single fully-connected layer \|
	\| max_seq_length \| 128 \|
	\| with hard negative \| true \|
	\| temperature of contrastive loss \| 0.05 \|
	\| Batch size \| 200 \|
	\| Learning rate \| 1e-5 \|
	\| Weight decay \| 0.01 \|
	\| Max gradient norm \| 1.0 \|
	\| Warmup steps \| 2012 \|
	\| Scheduler \| WarmupLinear \|
	\| Epochs \| 20 \|
	\| Evaluation steps \| 250 \|


	# Licenses
	This models are distributed under the terms of the Creative [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).


	[^1]: When we trained this model, the test data of JGLUE was not released, so we used the dev set of JGLUE as a private evaluation data. Therefore, we selected the checkpoint on the train set of JGLUE insted of its dev set.