sarashina-embedding-v1-1b / README_JA.md

fix readme

7b3afd8 20 days ago

7.87 kB

	---
	language:
	- ja
	license_name: sarahina-non-commercial-license
	license_link: LICENSE
	tags:
	- transformers
	- sentence-similarity
	- feature-extraction
	- sentence-transformers
	inference: false
	datasets:
	- hpprc/emb
	- cl-nagoya/auto-wiki-qa
	- cl-nagoya/ruri-dataset-ft
	- hpprc/mqa-ja
	- izumi-lab/llm-japanese-dataset
	- sentence-transformers/NQ-retrieval
	- sbintuitions/JSQuAD
	- SkelterLabsInc/JaQuAD
	- wikimedia/wikipedia
	- cl-nagoya/nu-mnli
	- castorini/mr-tydi
	---

	# Sarashina-Embedding-v1-1B

	「Sarashina-embedding-v1-1b」は、1.2Bパラメータの日本語LLM「[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)」をベースにした日本語テキスト埋め込みモデルです。

	このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) （Japanese Massive Text Embedding Benchmark）の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。

	このモデルは、文や段落を1792次元の高密度ベクトル空間にマッピングし、意味的テキスト類似度、意味的検索、paraphrase mining、テキスト分類、クラスタリングなどに使用できます。

	## モデル詳細

	### モデル説明

	- モデルタイプ: Sentence Transformer
	- ベースモデル: [Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)
	- 最大シーケンス長: 8,192トークン
	- 出力次元数: 1,792次元
	- 類似度関数: コサイン類似度
	- 言語: 日本語
	- ライセンス: [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)

	### モデルアーキテクチャ

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel
	(1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
	)
	```

	## 使用方法

	### Sentence Transformersを使う方法

	まず、Sentence Transformersライブラリをインストールします。

	```bash
	pip install -U sentence-transformers
	```

	次に、このモデルをロードし、推論を実行します。

	```python
	from sentence_transformers import SentenceTransformer

	# 🤗 Hubからモデルをダウンロードする
	model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b")
	# 推論を実行する
	sentences = [
	'更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
	'Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
	'更科蕎麦とはなんですか?'
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 1792]

	# 埋め込みの類似度スコアを取得する
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]
	```

	注意

	- "Query: ", "Document: "などのprefixを入力文の先頭に加える必要はありません。
	- このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)でライセンスされており、商用利用には制限があります。もしあなたのビジネスでこのモデルを活用することに興味がある場合は、気軽に[コンタクトページ](https://www.sbintuitions.co.jp/#contact)にご連絡ください。

	## 学習

	"Sarashina-Embedding-v1-1B"は、以下の2段階の学習ステージによって行われています。

	### Stage 1: 弱教師あり学習

	幅広いドメインに対して汎用的かつ高い性能を持つ埋め込みモデルを構築するため、私たちは、独自webクロールデータとオープンデータで構成された弱教師データによる対照学習を行いました。

	#### データセット

	\|dataset\|counts\|
	\|:-:\|:-:\|
	\|[AutoWikiQA](https://huggingface.co/datasets/cl-nagoya/auto-wiki-qa)\|50,521,135\|
	\|web-crawled data (ours)\|47,370,649\|
	\|[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)\|12,941,472\|
	\|[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)\|9,074,340\|
	\|[Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)\|5,555,212\|
	\|Quiz dataset (ours)\|988,478\|
	\|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)\|132,796\|
	\|[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)\|62,859\|
	\|[SNOW(T15+T23)](https://aclanthology.org/L18-1185)\|62,758\|
	\|[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)\|31,746\|
	\|[MKQA](https://aclanthology.org/2021.tacl-1.82)\|3,318\|
	\|\|\|
	\|total\|126,744,763\|

	### Stage 2: ファインチューニング

	より正確なクエリ-ドキュメント間の類似度をモデルに学習させるために、私たちは以下のデータセットでファインチューニングを行いました。

	#### データセット

	\|dataset\|counts\|
	\|:-:\|:-:\|
	\|[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)\|141,388 \|
	\|[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)\|67,987\|
	\|[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)\| 3,697 \|
	\|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)\| 20,000\|
	\|\|\|
	\|total\|233,072\|

	## [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)による性能評価

	Model \|Max Tokens\|Avg. \| Retrieval \| STS \| Classification \| Reranking \| Clustering \| PairClassification \|
	\|:----------------------------------------------\|:----------\|:----------\|:------------\|:----------\|:-----------------\|:------------\|:-------------\|:---------------------\|
	\| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^oai] \| 8191 \|74.05 \| 74.48 \| 82.52 \| 77.58 \| 93.58 \| 53.32 \| 62.35 \|
	\| [cl-nagoya/ruri-large](https://arxiv.org/abs/2409.07737) \| 512 \|73.31 \| 73.02 \| 83.13 \| 77.43 \| 92.99 \| 51.82 \| 62.29 \|
	\| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) \| 512 \|72.23 \| 73.36 \| 82.96 \| 74.21 \| 93.01 \| 48.65 \| 62.37 \|
	\| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) \|1024 \|72.04 \| 73.21 \| 81.39 \| 72.41 \| 92.69 \| 53.23 \| 61.74 \|
	\| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) \| 512\|70.90 \| 70.98 \| 79.70 \| 72.89 \| 92.96 \| 51.24 \| 62.15 \|
	\|\|\|
	\|[Sarashina-Embedding-v1-1B](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)\|8192\|75.50\|77.61\|82.71\|78.37\|93.74\|53.86\|62.00\|

	## ライセンス

	このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)に基づいて公開されています.

	もしこのモデルの商用利用にご興味がある場合は、お気軽に[コンタクトページ](https://www.sbintuitions.co.jp/#contact)へご連絡ください。

	[^oai]: Benchmarked on April 23, 2024.