Sentence Similarity
sentence-transformers
Safetensors
Japanese
luke
feature-extraction
GLuCoSE-base-ja-v2 / README.md
yano0's picture
Update README.md
f75b170 verified
|
raw
history blame
9.35 kB
metadata
language:
  - ja
library_name: sentence-transformers
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
metrics:
  - pearson_cosine
  - spearman_cosine
  - pearson_manhattan
  - spearman_manhattan
  - pearson_euclidean
  - spearman_euclidean
  - pearson_dot
  - spearman_dot
  - pearson_max
  - spearman_max
widget: []
pipeline_tag: sentence-similarity
datasets:
  - hpprc/emb
  - hpprc/mqa-ja
  - google-research-datasets/paws-x
base_model: pkshatech/GLuCoSE-base-ja
license: apache-2.0

SentenceTransformer

This is a sentence-transformers model trained. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

The model is based on GLuCoSE and additionally fine-tuned. Fine-tuning consists of the following steps.

Step 1: Ensemble distillation

  • The embedded representation was distilled using E5-mistral, gte-Qwen2 and mE5-large as teacher models.

Step 2: Contrastive learning

  • Triples were created from JSNLI, MNLI, PAWS-X, JSeM and Mr.TyDi and used for training.
  • This training aimed to improve the overall performance as a sentence embedding model.

Step 3: Search-specific contrastive learning

  • In order to make the model more robust to the retrieval task, additional two-stage training with QA and question-answer data was conducted.
  • In the first stage, the synthetic dataset auto-wiki was used for training, while in the second stage, Japanese Wikipedia Human Retrieval, Mr.TyDi, MIRACL, JQaRA, MQA, Quiz Works and Quiz No Mori were used.

Model Description

  • Model Type: Sentence Transformer
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity

Usage

Direct Usage (Sentence Transformers)

You can perform inference using SentenceTransformers with the following code:

from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

# Download from the 🤗 Hub
model = SentenceTransformer("pkshatech/GLuCoSE-base-ja-v2")

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか?',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は?',
    'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
embeddings = model.encode(sentences,convert_to_tensor=True)
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]

Direct Usage (Transformers)

You can perform inference using Transformers with the following code:

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def mean_pooling(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:
    emb = last_hidden_states * attention_mask.unsqueeze(-1)
    emb = emb.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1)
    return emb

# Download from the 🤗 Hub
tokenizer = AutoTokenizer.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")
model = AutoModel.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか?',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は?',
    'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]

# Tokenize the input texts
batch_dict = tokenizer(sentences, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = mean_pooling(outputs.last_hidden_state, batch_dict['attention_mask'])
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]

Benchmarks

Retieval

Evaluated with MIRACL-ja, JQARA , JaCWIR and MLDR-ja.

Model Size MIRACL
Recall@5
JQaRA
nDCG@10
JaCWIR
MAP@10
MLDR
nDCG@10
OpenAI/text-embedding-3-small - processing... 38.8 81.6 processing...
OpenAI/text-embedding-3-large - processing... processing... processing... processing...
intfloat/multilingual-e5-large 0.6B 89.2 55.4 87.6 29.8
cl-nagoya/ruri-large 0.3B 78.7 62.4 85.0 37.5
intfloat/multilingual-e5-base 0.3B 84.2 47.2 85.3 25.4
cl-nagoya/ruri-base 0.1B 74.3 58.1 84.6 35.3
pkshatech/GLuCoSE-base-ja 0.1B 53.3 30.8 68.6 25.2
GLuCoSE v2 0.1B 85.5 60.6 85.3 33.8
Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the JQARA and JaCWIR.

JMTEB

Evaluated with JMTEB.

Model Size Avg. Retrieval STS Classification Reranking Clustering PairClassification
OpenAI/text-embedding-3-small - 70.86 66.39 79.46 73.06 92.92 51.06 62.27
OpenAI/text-embedding-3-large - 73.97 74.48 82.52 77.58 93.58 53.32 62.35
intfloat/multilingual-e5-large 0.6M 71.65 70.98 79.70 72.89 92.96 51.24 62.15
cl-nagoya/ruri-large 0.3B 73.31 73.02 83.13 77.43 92.99 51.82 62.29
intfloat/multilingual-e5-base 0.3B 70.12 68.21 79.84 69.30 92.85 48.26 62.26
cl-nagoya/ruri-base 0.1B 71.91 69.82 82.87 75.58 92.91 54.16 62.38
pkshatech/GLuCoSE-base-ja 0.1B 70.44 59.02 78.71 76.82 91.90 49.78 66.39
GLuCoSE v2 0.1B 72.22 73.36 82.96 74.21 93.01 48.65 62.37
Note: Results for OpenAI embeddings and multilingual-e5 models are quoted from the JMTEB leaderboard. Results for ruri are quoted from the cl-nagoya/ruri-base model card.

Authors

Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe

License

This model is published under the Apache License, Version 2.0.