Ruri: Japanese General Text Embeddings

Usage

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cl-nagoya/ruri-small", trust_remote_code=True)

# Don't forget to add the prefix "クエリ: " for query-side or "文章: " for passage-side texts.
sentences = [
    "クエリ: 瑠璃色はどんな色?",
    "文章: 瑠璃色(るりいろ)は、紫みを帯びた濃い青。名は、半貴石の瑠璃(ラピスラズリ、英: lapis lazuli)による。JIS慣用色名では「こい紫みの青」(略号 dp-pB)と定義している[1][2]。",
    "クエリ: ワシやタカのように、鋭いくちばしと爪を持った大型の鳥類を総称して「何類」というでしょう?",
    "文章: ワシ、タカ、ハゲワシ、ハヤブサ、コンドル、フクロウが代表的である。これらの猛禽類はリンネ前後の時代(17~18世紀)には鷲類・鷹類・隼類及び梟類に分類された。ちなみにリンネは狩りをする鳥を単一の目(もく)にまとめ、vultur(コンドル、ハゲワシ)、falco(ワシ、タカ、ハヤブサなど)、strix(フクロウ)、lanius(モズ)の4属を含めている。",
]

embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [4, 768]

similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)

Benchmarks

JMTEB

Evaluated with JMTEB.

Model #Param. Avg. Retrieval STS Classfification Reranking Clustering PairClassification
cl-nagoya/sup-simcse-ja-base 111M 68.56 49.64 82.05 73.47 91.83 51.79 62.57
cl-nagoya/sup-simcse-ja-large 337M 66.51 37.62 83.18 73.73 91.48 50.56 62.51
cl-nagoya/unsup-simcse-ja-base 111M 65.07 40.23 78.72 73.07 91.16 44.77 62.44
cl-nagoya/unsup-simcse-ja-large 337M 66.27 40.53 80.56 74.66 90.95 48.41 62.49
pkshatech/GLuCoSE-base-ja 133M 70.44 59.02 78.71 76.82 91.90 49.78 66.39
sentence-transformers/LaBSE 472M 64.70 40.12 76.56 72.66 91.63 44.88 62.33
intfloat/multilingual-e5-small 118M 69.52 67.27 80.07 67.62 93.03 46.91 62.19
intfloat/multilingual-e5-base 278M 70.12 68.21 79.84 69.30 92.85 48.26 62.26
intfloat/multilingual-e5-large 560M 71.65 70.98 79.70 72.89 92.96 51.24 62.15
OpenAI/text-embedding-ada-002 - 69.48 64.38 79.02 69.75 93.04 48.30 62.40
OpenAI/text-embedding-3-small - 70.86 66.39 79.46 73.06 92.92 51.06 62.27
OpenAI/text-embedding-3-large - 73.97 74.48 82.52 77.58 93.58 53.32 62.35
Ruri-Small 68M 71.53 69.41 82.79 76.22 93.00 51.19 62.11
Ruri-Base 111M 71.91 69.82 82.87 75.58 92.91 54.16 62.38
Ruri-Large 337M 73.31 73.02 83.13 77.43 92.99 51.82 62.29

Model Details

Model Description

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Framework Versions

  • Python: 3.10.13
  • Sentence Transformers: 3.0.0
  • Transformers: 4.41.2
  • PyTorch: 2.3.1+cu118
  • Accelerate: 0.30.1
  • Datasets: 2.19.1
  • Tokenizers: 0.19.1

License

This model is published under the Apache License, Version 2.0.

Downloads last month
20
Safetensors
Model size
68.1M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for cl-nagoya/ruri-pt-small

Finetuned
(19)
this model
Finetunes
2 models

Dataset used to train cl-nagoya/ruri-pt-small

Collection including cl-nagoya/ruri-pt-small