metadata

language:
  - ja
  - en
license_name: sarahina-non-commercial-license
license_link: LICENSE
tags:
  - transformers
  - sentence-similarity
  - feature-extraction
  - sentence-transformers
pipeline_tag: sentence-similarity
inference: false
datasets:
  - hpprc/emb
  - cl-nagoya/auto-wiki-qa
  - cl-nagoya/ruri-dataset-ft
  - hpprc/mqa-ja
  - izumi-lab/llm-japanese-dataset
  - sentence-transformers/NQ-retrieval
  - sbintuitions/JSQuAD
  - SkelterLabsInc/JaQuAD
  - wikimedia/wikipedia
  - cl-nagoya/nu-mnli
  - castorini/mr-tydi

Sarashina-Embedding-v1-1B

日本語のREADME/Japanese README

"Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "Sarashina2.1-1B". We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in JMTEB (Japanese Massive Text Embedding Benchmark).

This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: Sarashina2.1-1B
Maximum Sequence Length: 8,192 tokens
Output Dimensionality: 1,792 dimensions
Similarity Function: Cosine Similarity
Language: Japanese
License: Sarashina Model NonCommercial License Agreement

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel 
  (1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b")
# Run inference
sentences = [
    '更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
    'Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
    '更科蕎麦とはなんですか?'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1792]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Note

You do not need to add prefixes such as "Query: " and "Document: " at the beginning of the input sentence.
This model is licensed under the Sarashina Model NonCommercial License Agreement, which has restrictions on commercial use. If you are interested in utilizing this model for your business, please feel free to contact us through our contact page.

Training

"Sarashina-Embedding-v1-1B" is created through the following two-stage learning process:

Stage 1: Weakly-supervised Learning

To achieve generic text embedding performance across a wide range of domains, we performed contrastive training on weakly-supervised data consisting of our own web-crawled data and open data.

Datasets

dataset	counts
AutoWikiQA	50,521,135
web-crawled data (ours)	47,370,649
MQA	12,941,472
llm-japanese-dataset	9,074,340
Wikipedia	5,555,212
Quiz dataset (ours)	988,478
Natural Questions	132,796
JSQuAD	62,859
SNOW(T15+T23)	62,758
JaQuAD	31,746
MKQA	3,318

total	126,744,763

Step2: Supervised Fine-tuning

To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following datasets.

Datasets

dataset	counts
JSNLI	141,388
NU-MNLI	67,987
Mr. TyDi (only Japanese subset)	3,697
Natural Questions (sampled)	20,000

total	233,072

Evaluation Results with JMTEB

Model	Max Tokens	Avg.	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
OpenAI/text-embedding-3-large[^oai]	8191	74.05	74.48	82.52	77.58	93.58	53.32	62.35
cl-nagoya/ruri-large	512	73.31	73.02	83.13	77.43	92.99	51.82	62.29
pkshatech/GLuCoSE-base-ja-v2	512	72.23	73.36	82.96	74.21	93.01	48.65	62.37
pkshatech/RoSEtta-base-ja	1024	72.04	73.21	81.39	72.41	92.69	53.23	61.74
intfloat/multilingual-e5-large	512	70.90	70.98	79.70	72.89	92.96	51.24	62.15

Sarashina-Embedding-v1-1B(This model)	8192	75.50	77.61	82.71	78.37	93.74	53.86	62.00

License

This model is licensed under Sarashina Model NonCommercial License Agreement.

If you are interested in using this model for commercial purposes, please feel free to contact us through our contact page.