kor-static-embedding-64

ํ•œ๊ตญ์–ด ํŠนํ™” ์ดˆ๊ฒฝ๋Ÿ‰ Static Embedding ๋ชจ๋ธ โ€” 9MB, 64์ฐจ์›.

kekeappa/kor-static-embedding-512๋ฅผ Matryoshka ํ•™์Šต์œผ๋กœ ๋งŒ๋“ค๊ณ  64์ฐจ์›์œผ๋กœ ์ž˜๋ผ๋‚ธ ๋ณ€์ข…์ž…๋‹ˆ๋‹ค. ๊ฐ™์€ ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ์— 4๊ฐœ ์ฐจ์› ์กด์žฌ โ€” ์šฉ๋„์— ๋งž๊ฒŒ ์„ ํƒ:

์ฐจ์› ํฌ๊ธฐ ์šฉ๋„
64 9MB ๐ŸŒ ๋ธŒ๋ผ์šฐ์ € ยท ๋ชจ๋ฐ”์ผ ยท ์—ฃ์ง€
128 17MB โšก ๊ฐ€๋ฒผ์šด ๊ฒ€์ƒ‰ยท๋ถ„๋ฅ˜
256 34MB โš–๏ธ ๊ฐ€์„ฑ๋น„
512 68MB ๐ŸŽฏ ์ตœ๊ณ  ์ •ํ™•๋„

์„ฑ๋Šฅ (KorSTS / KLUE-STS)

๋ฒค์น˜๋งˆํฌ Pearson Spearman
KorSTS-test 0.7382 0.7337
KorSTS-valid โ€” 0.7885
KLUE-STS-val โ€” 0.6582

์‚ฌ์šฉ

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("kekeappa/kor-static-embedding-64")
emb = model.encode(["ํ•œ๊ตญ์–ด ๋ฌธ์žฅ", "์ž„๋ฒ ๋”ฉ ํ…Œ์ŠคํŠธ"], normalize_embeddings=True)
print(emb.shape)  # (2, 64)

ํŠน์ง•

  • ์•„ํ‚คํ…์ฒ˜: StaticEmbedding (model2vec ๊ณ„์—ด) โ€” ํŠธ๋žœ์Šคํฌ๋จธ attention ์—†์Œ
  • ์ถ”๋ก : CPU ์ตœ์ , GPU ๋ถˆํ•„์š”
  • ์†๋„: ๋‹จ์ผ ์ฟผ๋ฆฌ < 1ms (๋ธŒ๋ผ์šฐ์ €์—์„œ๋„ ๋น ๋ฆ„)
  • ํ•œ์˜ ํ˜ธํ™˜: cross-lingual ํ•™์Šต๋จ โ€” ํ•œ๊ตญ์–ด ์ฟผ๋ฆฌ๋กœ ์˜์–ด ๋ฌธ์„œ ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅ

ํ•™์Šต ๋ฐฉ๋ฒ•

4-stage ํ•™์Šต:

  1. Distillation ์ดˆ๊ธฐํ™”: BM-K/KoSimCSE-roberta-multitask teacher์˜ vocab ์ž„๋ฒ ๋”ฉ โ†’ PCA + Zipf weighting
  2. KorNLI MNRL: kakaobrain/kor_nli (multi_nli + snli) 277K triplet
  3. Cross-lingual MNRL: OPUS-100 ko-en parallel 200K pair
  4. Matryoshka regression: KorSTS + KLUE-STS + NLLB๋กœ ๋ฒˆ์—ญํ•œ ์˜์–ด STS-B
    • 64/128/256/512 ์ฐจ์› ๋™์‹œ ์ตœ์ ํ™” (MatryoshkaLoss)

ํ•™์Šต ์ฝ”๋“œ: https://github.com/johunsang/kor-static-embedding-512

๋ผ์ด์„ ์Šค

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
2.05M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kekeappa/kor-static-embedding-64

Finetuned
(465)
this model

Datasets used to train kekeappa/kor-static-embedding-64