Shuu12121/CodeSearch-ModernBERT-Snake

SentenceTransformer based on Shuu12121/CodeModernBERT-Snake🐍

This model is a sentence-transformers model fine-tuned from Shuu12121/CodeModernBERT-Snake, which is a ModernBERT model specifically designed for code, pre-trained from scratch by me.
It is specifically designed for code search and efficiently calculates semantic similarity between code snippets and documentation.
One of the key features of this model is its maximum sequence length of 8192 tokens, which allows it to handle extremely long code snippets and documentation, making it highly suitable for comprehensive code search tasks.
Despite being a very small model with about 75 million parameters, it demonstrates remarkable performance in code search tasks.

このモデルは、私が一から事前学習を行ったコード特化のModernBERTモデルである Shuu12121/CodeModernBERT-Snake をベースにファインチューニングされた sentence-transformers モデルです。
特にコードサーチに特化しており、コード片やドキュメントから効果的に意味的類似性を計算できる ように設計されています。
本モデルの特徴として、最大シーケンス長が8192トークンに対応しており、非常に長いコード片やドキュメントにも対応可能です。
75M程度と極めて小さいモデルながらも、コード検索タスクにおいて高い性能を発揮します。

Model Evaluation / モデル評価

CoIR Evaluation Results / CoIRにおける評価結果

This model achieved an impressive 72.12 on the CodeSearchNet benchmark despite its small size.
This performance is comparable to the Salesforce/SFR-Embedding-Code-400M_R model, which has 400 million parameters.
Due to its focus on code search, this model does not support other tasks, and evaluation scores for other tasks are not provided.
The following table shows a comparison with well-known models, demonstrating that this model achieves a high score despite its compact size.

本モデルは、75M程度と極めて小さいモデルながらコードサーチタスクにおける評価指標である CodeSearchNet で 72.12 を達成しました。
これは 400Mのパラメータを持つ Salesforce/SFR-Embedding-Code-400M_R と比較しても遜色ないレベルです。
他のタスクには対応していないため、評価値は提供されていません。
CodeSearchNetタスクにおける評価値としては、他の有名なモデルと比較しても高いパフォーマンスを示しています。

モデル名	CodeSearchNet 評価値
Shuu12121/CodeModernBERT-Snake	72.12
Salesforce/SFR-Embedding-Code-2B_R	73.5
CodeSage-large-v2	94.26
Salesforce/SFR-Embedding-Code-400M_R	72.53
CodeSage-large	90.58
Voyage-Code-002	81.79
E5-Mistral	54.25
E5-Base-v2	67.99
OpenAI-Ada-002	74.21
BGE-Base-en-v1.5	69.6
BGE-M3	43.23
UniXcoder	60.2
GTE-Base-en-v1.5	43.35
Contriever	34.72

Model Details / モデル詳細

Model Type / モデルタイプ: Sentence Transformer
Base Model / ベースモデル: Shuu12121/CodeModernBERT-Snake
Maximum Sequence Length / 最大シーケンス長: 8192 tokens
Output Dimensions / 出力次元: 512 dimensions
Similarity Function / 類似度関数: Cosine Similarity
License / ライセンス: Apache-2.0

Usage / 使用方法

Installation / インストール

To install Sentence Transformers, run the following command:
Sentence Transformers をインストールするには、以下のコマンドを実行します。

pip install -U sentence-transformers

Model Loading and Inference / モデルのロードと推論

from sentence_transformers import SentenceTransformer

# Load the model / モデルをダウンロードしてロード
model = SentenceTransformer("Shuu12121/CodeSearch-ModernBERT-Snake")

# Example sentences for inference / 推論用の文リスト
sentences = [
    'Encrypts the zip file',
    'def freeze_encrypt(dest_dir, zip_filename, config, opt):\n    \n    pgp_keys = grok_keys(config)\n    icefile_prefix = "aomi-%s" % \\\n                     os.path.basename(os.path.dirname(opt.secretfile))\n    if opt.icefile_prefix:\n        icefile_prefix = opt.icefile_prefix\n\n    timestamp = time.strftime("%H%M%S-%m-%d-%Y",\n                              datetime.datetime.now().timetuple())\n    ice_file = "%s/%s-%s.ice" % (dest_dir, icefile_prefix, timestamp)\n    if not encrypt(zip_filename, ice_file, pgp_keys):\n        raise aomi.exceptions.GPG("Unable to encrypt zipfile")\n\n    return ice_file',
    'def transform(self, sents):\n        \n\n        def convert(tokens):\n            return torch.tensor([self.vocab.stoi[t] for t in tokens], dtype=torch.long)\n\n        if self.vocab is None:\n            raise Exception(\n                "Must run .fit() for .fit_transform() before " "calling .transform()."\n            )\n\n        seqs = sorted([convert(s) for s in sents], key=lambda x: -len(x))\n        X = torch.LongTensor(pad_sequence(seqs, batch_first=True))\n        return X',
]

# Generate embeddings / 埋め込みベクトルの生成
embeddings = model.encode(sentences)
print(embeddings.shape)  # Output: [3, 512]

# Calculate similarity scores / 類似度スコアの計算
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)  # Output: [3, 3]

Library Versions / ライブラリバージョン

Python: 3.11.11
Sentence Transformers: 3.4.1
Transformers: 4.50.0
PyTorch: 2.6.0+cu124
Accelerate: 1.5.2
Datasets: 3.4.1
Tokenizers: 0.21.1

Citation / 引用情報

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Shuu12121
/

CodeSearch-ModernBERT-Snake