Model Card for st-polish-kartonberta-base-alpha-v1
This sentence transformer model is designed to convert text content into a 768-float vector space, ensuring an effective representation. It aims to be proficient in tasks involving sentence / document similarity.
The model has been released in its alpha version. Numerous potential enhancements could boost its performance, such as adjusting training hyperparameters or extending the training duration (currently limited to only one epoch). The main reason is limited GPU.
Model Description
- Developed by: Bartłomiej Orlik, https://www.linkedin.com/in/bartłomiej-orlik/
- Model type: RoBERTa Sentence Transformer
- Language: Polish
- License: LGPL-3.0
- Trained from model: sdadas/polish-roberta-base-v2: https://huggingface.co/sdadas/polish-roberta-base-v2
How to Get Started with the Model
Use the code below to get started with the model.
Using Sentence-Transformers
You can use the model with sentence-transformers:
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('OrlikB/st-polish-kartonberta-base-alpha-v1')
text_1 = 'Jestem wielkim fanem opakowań tekturowych'
text_2 = 'Bardzo podobają mi się kartony'
embeddings_1 = model.encode(text_1, normalize_embeddings=True)
embeddings_2 = model.encode(text_2, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
Using HuggingFace Transformers
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
def encode_text(text):
encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors='pt', max_length=512)
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = model_output[0][:, 0]
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
return sentence_embeddings.squeeze().numpy()
cosine_similarity = lambda a, b: np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
tokenizer = AutoTokenizer.from_pretrained('OrlikB/st-polish-kartonberta-base-alpha-v1')
model = AutoModel.from_pretrained('OrlikB/st-polish-kartonberta-base-alpha-v1')
model.eval()
text_1 = 'Jestem wielkim fanem opakowań tekturowych'
text_2 = 'Bardzo podobają mi się kartony'
embeddings_1 = encode_text(text_1)
embeddings_2 = encode_text(text_2)
print(cosine_similarity(embeddings_1, embeddings_2))
*Note: You can use the encode_text function for demonstration purposes. For the best experience, it's recommended to process text in batches.
Evaluation
MTEB for Polish Language
Rank | Model | Model Size (GB) | Embedding Dimensions | Sequence Length | Average (26 datasets) | Classification Average (7 datasets) | Clustering Average (1 datasets) | Pair Classification Average (4 datasets) | Retrieval Average (11 datasets) | STS Average (3 datasets) |
---|---|---|---|---|---|---|---|---|---|---|
1 | multilingual-e5-large | 2.24 | 1024 | 514 | 58.25 | 60.51 | 24.06 | 84.58 | 47.82 | 67.52 |
2 | st-polish-kartonberta-base-alpha-v1 | 0.5 | 768 | 514 | 56.92 | 60.44 | 32.85 | 87.92 | 42.19 | 69.47 |
3 | multilingual-e5-base | 1.11 | 768 | 514 | 54.18 | 57.01 | 18.62 | 82.08 | 42.5 | 65.07 |
4 | multilingual-e5-small | 0.47 | 384 | 512 | 53.15 | 54.35 | 19.64 | 81.67 | 41.52 | 66.08 |
5 | st-polish-paraphrase-from-mpnet | 0.5 | 768 | 514 | 53.06 | 57.49 | 25.09 | 87.04 | 36.53 | 67.39 |
6 | st-polish-paraphrase-from-distilroberta | 0.5 | 768 | 514 | 52.65 | 58.55 | 31.11 | 87 | 33.96 | 68.78 |
More Information
I developed this model as a personal scientific initiative.
I plan to start the development on a new ST model. However, due to limited computational resources, I suspended further work to create a larger or enhanced version of current model.
- Downloads last month
- 4,005
Evaluation results
- v_measure on MTEB 8TagsClusteringtest set self-reported32.852
- accuracy on MTEB AllegroReviewstest set self-reported40.189
- f1 on MTEB AllegroReviewstest set self-reported34.711
- map_at_1 on MTEB ArguAna-PLtest set self-reported30.939
- map_at_10 on MTEB ArguAna-PLtest set self-reported47.468
- map_at_100 on MTEB ArguAna-PLtest set self-reported48.303
- map_at_1000 on MTEB ArguAna-PLtest set self-reported48.308
- map_at_3 on MTEB ArguAna-PLtest set self-reported43.220
- map_at_5 on MTEB ArguAna-PLtest set self-reported45.616
- mrr_at_1 on MTEB ArguAna-PLtest set self-reported31.863
- mrr_at_10 on MTEB ArguAna-PLtest set self-reported47.829
- mrr_at_100 on MTEB ArguAna-PLtest set self-reported48.664
- mrr_at_1000 on MTEB ArguAna-PLtest set self-reported48.670
- mrr_at_3 on MTEB ArguAna-PLtest set self-reported43.492
- mrr_at_5 on MTEB ArguAna-PLtest set self-reported46.006
- ndcg_at_1 on MTEB ArguAna-PLtest set self-reported30.939
- ndcg_at_10 on MTEB ArguAna-PLtest set self-reported56.058
- ndcg_at_100 on MTEB ArguAna-PLtest set self-reported59.562
- ndcg_at_1000 on MTEB ArguAna-PLtest set self-reported59.698
- ndcg_at_3 on MTEB ArguAna-PLtest set self-reported47.260
- ndcg_at_5 on MTEB ArguAna-PLtest set self-reported51.587
- precision_at_1 on MTEB ArguAna-PLtest set self-reported30.939
- precision_at_10 on MTEB ArguAna-PLtest set self-reported8.329
- precision_at_100 on MTEB ArguAna-PLtest set self-reported0.984
- precision_at_1000 on MTEB ArguAna-PLtest set self-reported0.100
- precision_at_3 on MTEB ArguAna-PLtest set self-reported19.654
- precision_at_5 on MTEB ArguAna-PLtest set self-reported13.898
- recall_at_1 on MTEB ArguAna-PLtest set self-reported30.939
- recall_at_10 on MTEB ArguAna-PLtest set self-reported83.286
- recall_at_100 on MTEB ArguAna-PLtest set self-reported98.435
- recall_at_1000 on MTEB ArguAna-PLtest set self-reported99.502
- recall_at_3 on MTEB ArguAna-PLtest set self-reported58.962
- recall_at_5 on MTEB ArguAna-PLtest set self-reported69.488
- accuracy on MTEB CBDtest set self-reported67.690
- ap on MTEB CBDtest set self-reported21.079
- f1 on MTEB CBDtest set self-reported56.801
- cos_sim_accuracy on MTEB CDSC-Etest set self-reported89.200
- cos_sim_ap on MTEB CDSC-Etest set self-reported79.117
- cos_sim_f1 on MTEB CDSC-Etest set self-reported68.835
- cos_sim_precision on MTEB CDSC-Etest set self-reported70.950
- cos_sim_recall on MTEB CDSC-Etest set self-reported66.842
- dot_accuracy on MTEB CDSC-Etest set self-reported89.200
- dot_ap on MTEB CDSC-Etest set self-reported79.117
- dot_f1 on MTEB CDSC-Etest set self-reported68.835
- dot_precision on MTEB CDSC-Etest set self-reported70.950
- dot_recall on MTEB CDSC-Etest set self-reported66.842
- euclidean_accuracy on MTEB CDSC-Etest set self-reported89.200
- euclidean_ap on MTEB CDSC-Etest set self-reported79.117
- euclidean_f1 on MTEB CDSC-Etest set self-reported68.835
- euclidean_precision on MTEB CDSC-Etest set self-reported70.950
- euclidean_recall on MTEB CDSC-Etest set self-reported66.842
- manhattan_accuracy on MTEB CDSC-Etest set self-reported89.100
- manhattan_ap on MTEB CDSC-Etest set self-reported79.122
- manhattan_f1 on MTEB CDSC-Etest set self-reported69.022
- manhattan_precision on MTEB CDSC-Etest set self-reported71.348
- manhattan_recall on MTEB CDSC-Etest set self-reported66.842
- max_accuracy on MTEB CDSC-Etest set self-reported89.200
- max_ap on MTEB CDSC-Etest set self-reported79.122
- max_f1 on MTEB CDSC-Etest set self-reported69.022
- cos_sim_pearson on MTEB CDSC-Rtest set self-reported91.415
- cos_sim_spearman on MTEB CDSC-Rtest set self-reported92.127
- euclidean_pearson on MTEB CDSC-Rtest set self-reported91.744
- euclidean_spearman on MTEB CDSC-Rtest set self-reported92.127
- manhattan_pearson on MTEB CDSC-Rtest set self-reported91.667
- manhattan_spearman on MTEB CDSC-Rtest set self-reported92.058
- map_at_1 on MTEB DBPedia-PLtest set self-reported5.871
- map_at_10 on MTEB DBPedia-PLtest set self-reported12.486
- map_at_100 on MTEB DBPedia-PLtest set self-reported16.897
- map_at_1000 on MTEB DBPedia-PLtest set self-reported18.056
- map_at_3 on MTEB DBPedia-PLtest set self-reported8.958
- map_at_5 on MTEB DBPedia-PLtest set self-reported10.570
- mrr_at_1 on MTEB DBPedia-PLtest set self-reported44.000
- mrr_at_10 on MTEB DBPedia-PLtest set self-reported53.831
- mrr_at_100 on MTEB DBPedia-PLtest set self-reported54.540
- mrr_at_1000 on MTEB DBPedia-PLtest set self-reported54.568
- mrr_at_3 on MTEB DBPedia-PLtest set self-reported51.875
- mrr_at_5 on MTEB DBPedia-PLtest set self-reported53.113
- ndcg_at_1 on MTEB DBPedia-PLtest set self-reported34.625
- ndcg_at_10 on MTEB DBPedia-PLtest set self-reported26.996
- ndcg_at_100 on MTEB DBPedia-PLtest set self-reported31.053
- ndcg_at_1000 on MTEB DBPedia-PLtest set self-reported38.208
- ndcg_at_3 on MTEB DBPedia-PLtest set self-reported29.471
- ndcg_at_5 on MTEB DBPedia-PLtest set self-reported28.364
- precision_at_1 on MTEB DBPedia-PLtest set self-reported44.000
- precision_at_10 on MTEB DBPedia-PLtest set self-reported21.450
- precision_at_100 on MTEB DBPedia-PLtest set self-reported6.837
- precision_at_1000 on MTEB DBPedia-PLtest set self-reported1.602
- precision_at_3 on MTEB DBPedia-PLtest set self-reported32.333
- precision_at_5 on MTEB DBPedia-PLtest set self-reported27.800
- recall_at_1 on MTEB DBPedia-PLtest set self-reported5.871
- recall_at_10 on MTEB DBPedia-PLtest set self-reported17.318
- recall_at_100 on MTEB DBPedia-PLtest set self-reported36.854
- recall_at_1000 on MTEB DBPedia-PLtest set self-reported60.469
- recall_at_3 on MTEB DBPedia-PLtest set self-reported10.214
- recall_at_5 on MTEB DBPedia-PLtest set self-reported13.364
- map_at_1 on MTEB FiQA-PLtest set self-reported10.289
- map_at_10 on MTEB FiQA-PLtest set self-reported18.286
- map_at_100 on MTEB FiQA-PLtest set self-reported19.743
- map_at_1000 on MTEB FiQA-PLtest set self-reported19.964
- map_at_3 on MTEB FiQA-PLtest set self-reported15.193
- map_at_5 on MTEB FiQA-PLtest set self-reported16.962
- mrr_at_1 on MTEB FiQA-PLtest set self-reported21.914
- mrr_at_10 on MTEB FiQA-PLtest set self-reported30.654
- mrr_at_100 on MTEB FiQA-PLtest set self-reported31.623
- mrr_at_1000 on MTEB FiQA-PLtest set self-reported31.701
- mrr_at_3 on MTEB FiQA-PLtest set self-reported27.855
- mrr_at_5 on MTEB FiQA-PLtest set self-reported29.514
- ndcg_at_1 on MTEB FiQA-PLtest set self-reported21.914
- ndcg_at_10 on MTEB FiQA-PLtest set self-reported24.733
- ndcg_at_100 on MTEB FiQA-PLtest set self-reported31.254
- ndcg_at_1000 on MTEB FiQA-PLtest set self-reported35.617
- ndcg_at_3 on MTEB FiQA-PLtest set self-reported20.962
- ndcg_at_5 on MTEB FiQA-PLtest set self-reported22.553
- precision_at_1 on MTEB FiQA-PLtest set self-reported21.914
- precision_at_10 on MTEB FiQA-PLtest set self-reported7.346
- precision_at_100 on MTEB FiQA-PLtest set self-reported1.389
- precision_at_1000 on MTEB FiQA-PLtest set self-reported0.214
- precision_at_3 on MTEB FiQA-PLtest set self-reported14.352
- precision_at_5 on MTEB FiQA-PLtest set self-reported11.420
- recall_at_1 on MTEB FiQA-PLtest set self-reported10.289
- recall_at_10 on MTEB FiQA-PLtest set self-reported31.459
- recall_at_100 on MTEB FiQA-PLtest set self-reported56.854
- recall_at_1000 on MTEB FiQA-PLtest set self-reported83.722
- recall_at_3 on MTEB FiQA-PLtest set self-reported19.457
- recall_at_5 on MTEB FiQA-PLtest set self-reported24.767
- map_at_1 on MTEB HotpotQA-PLtest set self-reported29.669
- map_at_10 on MTEB HotpotQA-PLtest set self-reported41.615
- map_at_100 on MTEB HotpotQA-PLtest set self-reported42.572
- map_at_1000 on MTEB HotpotQA-PLtest set self-reported42.662
- map_at_3 on MTEB HotpotQA-PLtest set self-reported38.938
- map_at_5 on MTEB HotpotQA-PLtest set self-reported40.541
- mrr_at_1 on MTEB HotpotQA-PLtest set self-reported59.338
- mrr_at_10 on MTEB HotpotQA-PLtest set self-reported66.939
- mrr_at_100 on MTEB HotpotQA-PLtest set self-reported67.361
- mrr_at_1000 on MTEB HotpotQA-PLtest set self-reported67.385
- mrr_at_3 on MTEB HotpotQA-PLtest set self-reported65.384
- mrr_at_5 on MTEB HotpotQA-PLtest set self-reported66.345
- ndcg_at_1 on MTEB HotpotQA-PLtest set self-reported59.338
- ndcg_at_10 on MTEB HotpotQA-PLtest set self-reported50.607
- ndcg_at_100 on MTEB HotpotQA-PLtest set self-reported54.343
- ndcg_at_1000 on MTEB HotpotQA-PLtest set self-reported56.286
- ndcg_at_3 on MTEB HotpotQA-PLtest set self-reported46.289
- ndcg_at_5 on MTEB HotpotQA-PLtest set self-reported48.581
- precision_at_1 on MTEB HotpotQA-PLtest set self-reported59.338
- precision_at_10 on MTEB HotpotQA-PLtest set self-reported10.585
- precision_at_100 on MTEB HotpotQA-PLtest set self-reported1.353
- precision_at_1000 on MTEB HotpotQA-PLtest set self-reported0.161
- precision_at_3 on MTEB HotpotQA-PLtest set self-reported28.877
- precision_at_5 on MTEB HotpotQA-PLtest set self-reported19.133
- recall_at_1 on MTEB HotpotQA-PLtest set self-reported29.669
- recall_at_10 on MTEB HotpotQA-PLtest set self-reported52.924
- recall_at_100 on MTEB HotpotQA-PLtest set self-reported67.657
- recall_at_1000 on MTEB HotpotQA-PLtest set self-reported80.628
- recall_at_3 on MTEB HotpotQA-PLtest set self-reported43.315
- recall_at_5 on MTEB HotpotQA-PLtest set self-reported47.833
- map_at_1 on MTEB MSMARCO-PLtest set self-reported0.997
- map_at_10 on MTEB MSMARCO-PLtest set self-reported7.482
- map_at_100 on MTEB MSMARCO-PLtest set self-reported20.208
- map_at_1000 on MTEB MSMARCO-PLtest set self-reported25.601
- map_at_3 on MTEB MSMARCO-PLtest set self-reported3.055
- map_at_5 on MTEB MSMARCO-PLtest set self-reported4.853
- mrr_at_1 on MTEB MSMARCO-PLtest set self-reported55.814
- mrr_at_10 on MTEB MSMARCO-PLtest set self-reported64.651
- mrr_at_100 on MTEB MSMARCO-PLtest set self-reported65.003
- mrr_at_1000 on MTEB MSMARCO-PLtest set self-reported65.052
- mrr_at_3 on MTEB MSMARCO-PLtest set self-reported62.403
- mrr_at_5 on MTEB MSMARCO-PLtest set self-reported64.031
- ndcg_at_1 on MTEB MSMARCO-PLtest set self-reported44.186
- ndcg_at_10 on MTEB MSMARCO-PLtest set self-reported43.250
- ndcg_at_100 on MTEB MSMARCO-PLtest set self-reported40.515
- ndcg_at_1000 on MTEB MSMARCO-PLtest set self-reported48.345
- ndcg_at_3 on MTEB MSMARCO-PLtest set self-reported45.829
- ndcg_at_5 on MTEB MSMARCO-PLtest set self-reported46.477
- precision_at_1 on MTEB MSMARCO-PLtest set self-reported55.814
- precision_at_10 on MTEB MSMARCO-PLtest set self-reported50.465
- precision_at_100 on MTEB MSMARCO-PLtest set self-reported25.419
- precision_at_1000 on MTEB MSMARCO-PLtest set self-reported5.084
- precision_at_3 on MTEB MSMARCO-PLtest set self-reported58.140
- precision_at_5 on MTEB MSMARCO-PLtest set self-reported57.674
- recall_at_1 on MTEB MSMARCO-PLtest set self-reported0.997
- recall_at_10 on MTEB MSMARCO-PLtest set self-reported8.986
- recall_at_100 on MTEB MSMARCO-PLtest set self-reported33.221
- recall_at_1000 on MTEB MSMARCO-PLtest set self-reported58.837
- recall_at_3 on MTEB MSMARCO-PLtest set self-reported3.472
- recall_at_5 on MTEB MSMARCO-PLtest set self-reported5.545