mmlw-e5-large / README.md
sdadas's picture
Update README.md
37e025c verified
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- mteb
model-index:
- name: mmlw-e5-large
results:
- task:
type: Clustering
dataset:
type: PL-MTEB/8tags-clustering
name: MTEB 8TagsClustering
config: default
split: test
revision: None
metrics:
- type: v_measure
value: 30.623921415441725
- task:
type: Classification
dataset:
type: PL-MTEB/allegro-reviews
name: MTEB AllegroReviews
config: default
split: test
revision: None
metrics:
- type: accuracy
value: 37.683896620278325
- type: f1
value: 34.19193027014284
- task:
type: Retrieval
dataset:
type: arguana-pl
name: MTEB ArguAna-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 38.407000000000004
- type: map_at_10
value: 55.147
- type: map_at_100
value: 55.757
- type: map_at_1000
value: 55.761
- type: map_at_3
value: 51.268
- type: map_at_5
value: 53.696999999999996
- type: mrr_at_1
value: 40.043
- type: mrr_at_10
value: 55.840999999999994
- type: mrr_at_100
value: 56.459
- type: mrr_at_1000
value: 56.462999999999994
- type: mrr_at_3
value: 52.074
- type: mrr_at_5
value: 54.364999999999995
- type: ndcg_at_1
value: 38.407000000000004
- type: ndcg_at_10
value: 63.248000000000005
- type: ndcg_at_100
value: 65.717
- type: ndcg_at_1000
value: 65.79
- type: ndcg_at_3
value: 55.403999999999996
- type: ndcg_at_5
value: 59.760000000000005
- type: precision_at_1
value: 38.407000000000004
- type: precision_at_10
value: 8.862
- type: precision_at_100
value: 0.991
- type: precision_at_1000
value: 0.1
- type: precision_at_3
value: 22.451
- type: precision_at_5
value: 15.576
- type: recall_at_1
value: 38.407000000000004
- type: recall_at_10
value: 88.62
- type: recall_at_100
value: 99.075
- type: recall_at_1000
value: 99.57300000000001
- type: recall_at_3
value: 67.354
- type: recall_at_5
value: 77.881
- task:
type: Classification
dataset:
type: PL-MTEB/cbd
name: MTEB CBD
config: default
split: test
revision: None
metrics:
- type: accuracy
value: 66.14999999999999
- type: ap
value: 21.69513674684204
- type: f1
value: 56.48142830893528
- task:
type: PairClassification
dataset:
type: PL-MTEB/cdsce-pairclassification
name: MTEB CDSC-E
config: default
split: test
revision: None
metrics:
- type: cos_sim_accuracy
value: 89.4
- type: cos_sim_ap
value: 76.83228768203222
- type: cos_sim_f1
value: 65.3658536585366
- type: cos_sim_precision
value: 60.909090909090914
- type: cos_sim_recall
value: 70.52631578947368
- type: dot_accuracy
value: 84.1
- type: dot_ap
value: 57.26072201751864
- type: dot_f1
value: 62.75395033860045
- type: dot_precision
value: 54.9407114624506
- type: dot_recall
value: 73.15789473684211
- type: euclidean_accuracy
value: 89.4
- type: euclidean_ap
value: 76.59095263388942
- type: euclidean_f1
value: 65.21739130434783
- type: euclidean_precision
value: 60.26785714285714
- type: euclidean_recall
value: 71.05263157894737
- type: manhattan_accuracy
value: 89.4
- type: manhattan_ap
value: 76.58825999753456
- type: manhattan_f1
value: 64.72019464720195
- type: manhattan_precision
value: 60.18099547511312
- type: manhattan_recall
value: 70.0
- type: max_accuracy
value: 89.4
- type: max_ap
value: 76.83228768203222
- type: max_f1
value: 65.3658536585366
- task:
type: STS
dataset:
type: PL-MTEB/cdscr-sts
name: MTEB CDSC-R
config: default
split: test
revision: None
metrics:
- type: cos_sim_pearson
value: 93.73949495291659
- type: cos_sim_spearman
value: 93.50397366192922
- type: euclidean_pearson
value: 92.47498888987636
- type: euclidean_spearman
value: 93.39315936230747
- type: manhattan_pearson
value: 92.47250250777654
- type: manhattan_spearman
value: 93.36739690549109
- task:
type: Retrieval
dataset:
type: dbpedia-pl
name: MTEB DBPedia-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 8.434
- type: map_at_10
value: 18.424
- type: map_at_100
value: 26.428
- type: map_at_1000
value: 28.002
- type: map_at_3
value: 13.502
- type: map_at_5
value: 15.577
- type: mrr_at_1
value: 63.0
- type: mrr_at_10
value: 72.714
- type: mrr_at_100
value: 73.021
- type: mrr_at_1000
value: 73.028
- type: mrr_at_3
value: 70.75
- type: mrr_at_5
value: 72.3
- type: ndcg_at_1
value: 52.75
- type: ndcg_at_10
value: 39.839999999999996
- type: ndcg_at_100
value: 44.989000000000004
- type: ndcg_at_1000
value: 52.532999999999994
- type: ndcg_at_3
value: 45.198
- type: ndcg_at_5
value: 42.015
- type: precision_at_1
value: 63.0
- type: precision_at_10
value: 31.05
- type: precision_at_100
value: 10.26
- type: precision_at_1000
value: 1.9879999999999998
- type: precision_at_3
value: 48.25
- type: precision_at_5
value: 40.45
- type: recall_at_1
value: 8.434
- type: recall_at_10
value: 24.004
- type: recall_at_100
value: 51.428
- type: recall_at_1000
value: 75.712
- type: recall_at_3
value: 15.015
- type: recall_at_5
value: 18.282999999999998
- task:
type: Retrieval
dataset:
type: fiqa-pl
name: MTEB FiQA-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 19.088
- type: map_at_10
value: 31.818
- type: map_at_100
value: 33.689
- type: map_at_1000
value: 33.86
- type: map_at_3
value: 27.399
- type: map_at_5
value: 29.945
- type: mrr_at_1
value: 38.117000000000004
- type: mrr_at_10
value: 47.668
- type: mrr_at_100
value: 48.428
- type: mrr_at_1000
value: 48.475
- type: mrr_at_3
value: 45.242
- type: mrr_at_5
value: 46.716
- type: ndcg_at_1
value: 38.272
- type: ndcg_at_10
value: 39.903
- type: ndcg_at_100
value: 46.661
- type: ndcg_at_1000
value: 49.625
- type: ndcg_at_3
value: 35.921
- type: ndcg_at_5
value: 37.558
- type: precision_at_1
value: 38.272
- type: precision_at_10
value: 11.358
- type: precision_at_100
value: 1.8190000000000002
- type: precision_at_1000
value: 0.23500000000000001
- type: precision_at_3
value: 24.434
- type: precision_at_5
value: 18.395
- type: recall_at_1
value: 19.088
- type: recall_at_10
value: 47.355999999999995
- type: recall_at_100
value: 72.451
- type: recall_at_1000
value: 90.257
- type: recall_at_3
value: 32.931
- type: recall_at_5
value: 39.878
- task:
type: Retrieval
dataset:
type: hotpotqa-pl
name: MTEB HotpotQA-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 39.095
- type: map_at_10
value: 62.529
- type: map_at_100
value: 63.425
- type: map_at_1000
value: 63.483000000000004
- type: map_at_3
value: 58.887
- type: map_at_5
value: 61.18599999999999
- type: mrr_at_1
value: 78.123
- type: mrr_at_10
value: 84.231
- type: mrr_at_100
value: 84.408
- type: mrr_at_1000
value: 84.414
- type: mrr_at_3
value: 83.286
- type: mrr_at_5
value: 83.94
- type: ndcg_at_1
value: 78.19
- type: ndcg_at_10
value: 70.938
- type: ndcg_at_100
value: 73.992
- type: ndcg_at_1000
value: 75.1
- type: ndcg_at_3
value: 65.863
- type: ndcg_at_5
value: 68.755
- type: precision_at_1
value: 78.19
- type: precision_at_10
value: 14.949000000000002
- type: precision_at_100
value: 1.733
- type: precision_at_1000
value: 0.188
- type: precision_at_3
value: 42.381
- type: precision_at_5
value: 27.711000000000002
- type: recall_at_1
value: 39.095
- type: recall_at_10
value: 74.747
- type: recall_at_100
value: 86.631
- type: recall_at_1000
value: 93.923
- type: recall_at_3
value: 63.571999999999996
- type: recall_at_5
value: 69.27799999999999
- task:
type: Retrieval
dataset:
type: msmarco-pl
name: MTEB MSMARCO-PL
config: default
split: validation
revision: None
metrics:
- type: map_at_1
value: 19.439999999999998
- type: map_at_10
value: 30.264000000000003
- type: map_at_100
value: 31.438
- type: map_at_1000
value: 31.495
- type: map_at_3
value: 26.735
- type: map_at_5
value: 28.716
- type: mrr_at_1
value: 19.914
- type: mrr_at_10
value: 30.753999999999998
- type: mrr_at_100
value: 31.877
- type: mrr_at_1000
value: 31.929000000000002
- type: mrr_at_3
value: 27.299
- type: mrr_at_5
value: 29.254
- type: ndcg_at_1
value: 20.014000000000003
- type: ndcg_at_10
value: 36.472
- type: ndcg_at_100
value: 42.231
- type: ndcg_at_1000
value: 43.744
- type: ndcg_at_3
value: 29.268
- type: ndcg_at_5
value: 32.79
- type: precision_at_1
value: 20.014000000000003
- type: precision_at_10
value: 5.814
- type: precision_at_100
value: 0.8710000000000001
- type: precision_at_1000
value: 0.1
- type: precision_at_3
value: 12.426
- type: precision_at_5
value: 9.238
- type: recall_at_1
value: 19.439999999999998
- type: recall_at_10
value: 55.535000000000004
- type: recall_at_100
value: 82.44399999999999
- type: recall_at_1000
value: 94.217
- type: recall_at_3
value: 35.963
- type: recall_at_5
value: 44.367000000000004
- task:
type: Classification
dataset:
type: mteb/amazon_massive_intent
name: MTEB MassiveIntentClassification (pl)
config: pl
split: test
revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7
metrics:
- type: accuracy
value: 72.01412239408205
- type: f1
value: 70.04544187503352
- task:
type: Classification
dataset:
type: mteb/amazon_massive_scenario
name: MTEB MassiveScenarioClassification (pl)
config: pl
split: test
revision: 7d571f92784cd94a019292a1f45445077d0ef634
metrics:
- type: accuracy
value: 75.26899798251513
- type: f1
value: 75.55876166863844
- task:
type: Retrieval
dataset:
type: nfcorpus-pl
name: MTEB NFCorpus-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 5.772
- type: map_at_10
value: 12.708
- type: map_at_100
value: 16.194
- type: map_at_1000
value: 17.630000000000003
- type: map_at_3
value: 9.34
- type: map_at_5
value: 10.741
- type: mrr_at_1
value: 43.344
- type: mrr_at_10
value: 53.429
- type: mrr_at_100
value: 53.88699999999999
- type: mrr_at_1000
value: 53.925
- type: mrr_at_3
value: 51.342
- type: mrr_at_5
value: 52.456
- type: ndcg_at_1
value: 41.641
- type: ndcg_at_10
value: 34.028000000000006
- type: ndcg_at_100
value: 31.613000000000003
- type: ndcg_at_1000
value: 40.428
- type: ndcg_at_3
value: 38.991
- type: ndcg_at_5
value: 36.704
- type: precision_at_1
value: 43.034
- type: precision_at_10
value: 25.324999999999996
- type: precision_at_100
value: 7.889
- type: precision_at_1000
value: 2.069
- type: precision_at_3
value: 36.739
- type: precision_at_5
value: 32.074000000000005
- type: recall_at_1
value: 5.772
- type: recall_at_10
value: 16.827
- type: recall_at_100
value: 32.346000000000004
- type: recall_at_1000
value: 62.739
- type: recall_at_3
value: 10.56
- type: recall_at_5
value: 12.655
- task:
type: Retrieval
dataset:
type: nq-pl
name: MTEB NQ-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 26.101000000000003
- type: map_at_10
value: 39.912
- type: map_at_100
value: 41.037
- type: map_at_1000
value: 41.077000000000005
- type: map_at_3
value: 35.691
- type: map_at_5
value: 38.155
- type: mrr_at_1
value: 29.403000000000002
- type: mrr_at_10
value: 42.376999999999995
- type: mrr_at_100
value: 43.248999999999995
- type: mrr_at_1000
value: 43.277
- type: mrr_at_3
value: 38.794000000000004
- type: mrr_at_5
value: 40.933
- type: ndcg_at_1
value: 29.519000000000002
- type: ndcg_at_10
value: 47.33
- type: ndcg_at_100
value: 52.171
- type: ndcg_at_1000
value: 53.125
- type: ndcg_at_3
value: 39.316
- type: ndcg_at_5
value: 43.457
- type: precision_at_1
value: 29.519000000000002
- type: precision_at_10
value: 8.03
- type: precision_at_100
value: 1.075
- type: precision_at_1000
value: 0.117
- type: precision_at_3
value: 18.009
- type: precision_at_5
value: 13.221
- type: recall_at_1
value: 26.101000000000003
- type: recall_at_10
value: 67.50399999999999
- type: recall_at_100
value: 88.64699999999999
- type: recall_at_1000
value: 95.771
- type: recall_at_3
value: 46.669
- type: recall_at_5
value: 56.24
- task:
type: Classification
dataset:
type: laugustyniak/abusive-clauses-pl
name: MTEB PAC
config: default
split: test
revision: None
metrics:
- type: accuracy
value: 63.76773819866782
- type: ap
value: 74.87896817642536
- type: f1
value: 61.420506092721425
- task:
type: PairClassification
dataset:
type: PL-MTEB/ppc-pairclassification
name: MTEB PPC
config: default
split: test
revision: None
metrics:
- type: cos_sim_accuracy
value: 82.1
- type: cos_sim_ap
value: 91.09417013497443
- type: cos_sim_f1
value: 84.78437754271766
- type: cos_sim_precision
value: 83.36
- type: cos_sim_recall
value: 86.25827814569537
- type: dot_accuracy
value: 75.9
- type: dot_ap
value: 86.82680649789796
- type: dot_f1
value: 80.5379746835443
- type: dot_precision
value: 77.12121212121212
- type: dot_recall
value: 84.27152317880795
- type: euclidean_accuracy
value: 81.6
- type: euclidean_ap
value: 90.81248760600693
- type: euclidean_f1
value: 84.35374149659863
- type: euclidean_precision
value: 86.7132867132867
- type: euclidean_recall
value: 82.11920529801324
- type: manhattan_accuracy
value: 81.6
- type: manhattan_ap
value: 90.81272803548767
- type: manhattan_f1
value: 84.33530906011855
- type: manhattan_precision
value: 86.30849220103987
- type: manhattan_recall
value: 82.45033112582782
- type: max_accuracy
value: 82.1
- type: max_ap
value: 91.09417013497443
- type: max_f1
value: 84.78437754271766
- task:
type: PairClassification
dataset:
type: PL-MTEB/psc-pairclassification
name: MTEB PSC
config: default
split: test
revision: None
metrics:
- type: cos_sim_accuracy
value: 98.05194805194806
- type: cos_sim_ap
value: 99.52709687103496
- type: cos_sim_f1
value: 96.83257918552036
- type: cos_sim_precision
value: 95.82089552238806
- type: cos_sim_recall
value: 97.86585365853658
- type: dot_accuracy
value: 92.30055658627087
- type: dot_ap
value: 94.12759311032353
- type: dot_f1
value: 87.00906344410878
- type: dot_precision
value: 86.22754491017965
- type: dot_recall
value: 87.8048780487805
- type: euclidean_accuracy
value: 98.05194805194806
- type: euclidean_ap
value: 99.49402675624125
- type: euclidean_f1
value: 96.8133535660091
- type: euclidean_precision
value: 96.37462235649546
- type: euclidean_recall
value: 97.2560975609756
- type: manhattan_accuracy
value: 98.05194805194806
- type: manhattan_ap
value: 99.50120505935962
- type: manhattan_f1
value: 96.8133535660091
- type: manhattan_precision
value: 96.37462235649546
- type: manhattan_recall
value: 97.2560975609756
- type: max_accuracy
value: 98.05194805194806
- type: max_ap
value: 99.52709687103496
- type: max_f1
value: 96.83257918552036
- task:
type: Classification
dataset:
type: PL-MTEB/polemo2_in
name: MTEB PolEmo2.0-IN
config: default
split: test
revision: None
metrics:
- type: accuracy
value: 69.45983379501385
- type: f1
value: 68.60917948426784
- task:
type: Classification
dataset:
type: PL-MTEB/polemo2_out
name: MTEB PolEmo2.0-OUT
config: default
split: test
revision: None
metrics:
- type: accuracy
value: 43.13765182186235
- type: f1
value: 36.15557441785656
- task:
type: Retrieval
dataset:
type: quora-pl
name: MTEB Quora-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 67.448
- type: map_at_10
value: 81.566
- type: map_at_100
value: 82.284
- type: map_at_1000
value: 82.301
- type: map_at_3
value: 78.425
- type: map_at_5
value: 80.43400000000001
- type: mrr_at_1
value: 77.61
- type: mrr_at_10
value: 84.467
- type: mrr_at_100
value: 84.63199999999999
- type: mrr_at_1000
value: 84.634
- type: mrr_at_3
value: 83.288
- type: mrr_at_5
value: 84.095
- type: ndcg_at_1
value: 77.66
- type: ndcg_at_10
value: 85.63199999999999
- type: ndcg_at_100
value: 87.166
- type: ndcg_at_1000
value: 87.306
- type: ndcg_at_3
value: 82.32300000000001
- type: ndcg_at_5
value: 84.22
- type: precision_at_1
value: 77.66
- type: precision_at_10
value: 13.136000000000001
- type: precision_at_100
value: 1.522
- type: precision_at_1000
value: 0.156
- type: precision_at_3
value: 36.153
- type: precision_at_5
value: 23.982
- type: recall_at_1
value: 67.448
- type: recall_at_10
value: 93.83200000000001
- type: recall_at_100
value: 99.212
- type: recall_at_1000
value: 99.94
- type: recall_at_3
value: 84.539
- type: recall_at_5
value: 89.71000000000001
- task:
type: Retrieval
dataset:
type: scidocs-pl
name: MTEB SCIDOCS-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 4.393
- type: map_at_10
value: 11.472
- type: map_at_100
value: 13.584999999999999
- type: map_at_1000
value: 13.918
- type: map_at_3
value: 8.212
- type: map_at_5
value: 9.864
- type: mrr_at_1
value: 21.7
- type: mrr_at_10
value: 32.268
- type: mrr_at_100
value: 33.495000000000005
- type: mrr_at_1000
value: 33.548
- type: mrr_at_3
value: 29.15
- type: mrr_at_5
value: 30.91
- type: ndcg_at_1
value: 21.6
- type: ndcg_at_10
value: 19.126
- type: ndcg_at_100
value: 27.496
- type: ndcg_at_1000
value: 33.274
- type: ndcg_at_3
value: 18.196
- type: ndcg_at_5
value: 15.945
- type: precision_at_1
value: 21.6
- type: precision_at_10
value: 9.94
- type: precision_at_100
value: 2.1999999999999997
- type: precision_at_1000
value: 0.359
- type: precision_at_3
value: 17.2
- type: precision_at_5
value: 14.12
- type: recall_at_1
value: 4.393
- type: recall_at_10
value: 20.166999999999998
- type: recall_at_100
value: 44.678000000000004
- type: recall_at_1000
value: 72.868
- type: recall_at_3
value: 10.473
- type: recall_at_5
value: 14.313
- task:
type: PairClassification
dataset:
type: PL-MTEB/sicke-pl-pairclassification
name: MTEB SICK-E-PL
config: default
split: test
revision: None
metrics:
- type: cos_sim_accuracy
value: 82.65389319200979
- type: cos_sim_ap
value: 76.13749398520014
- type: cos_sim_f1
value: 66.64355062413314
- type: cos_sim_precision
value: 64.93243243243244
- type: cos_sim_recall
value: 68.44729344729345
- type: dot_accuracy
value: 76.0905014268243
- type: dot_ap
value: 58.058968583382494
- type: dot_f1
value: 61.181080324657145
- type: dot_precision
value: 50.391885661595204
- type: dot_recall
value: 77.84900284900284
- type: euclidean_accuracy
value: 82.61312678353036
- type: euclidean_ap
value: 76.10290283033221
- type: euclidean_f1
value: 66.50782845473111
- type: euclidean_precision
value: 63.6897001303781
- type: euclidean_recall
value: 69.58689458689459
- type: manhattan_accuracy
value: 82.6742763962495
- type: manhattan_ap
value: 76.12712309700966
- type: manhattan_f1
value: 66.59700452803902
- type: manhattan_precision
value: 65.16700749829583
- type: manhattan_recall
value: 68.09116809116809
- type: max_accuracy
value: 82.6742763962495
- type: max_ap
value: 76.13749398520014
- type: max_f1
value: 66.64355062413314
- task:
type: STS
dataset:
type: PL-MTEB/sickr-pl-sts
name: MTEB SICK-R-PL
config: default
split: test
revision: None
metrics:
- type: cos_sim_pearson
value: 81.23898481255246
- type: cos_sim_spearman
value: 76.0416957474899
- type: euclidean_pearson
value: 78.96475496102107
- type: euclidean_spearman
value: 76.07208683063504
- type: manhattan_pearson
value: 78.92666424673251
- type: manhattan_spearman
value: 76.04968227583831
- task:
type: STS
dataset:
type: mteb/sts22-crosslingual-sts
name: MTEB STS22 (pl)
config: pl
split: test
revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80
metrics:
- type: cos_sim_pearson
value: 39.13987124398541
- type: cos_sim_spearman
value: 40.40194528288759
- type: euclidean_pearson
value: 29.14566247168167
- type: euclidean_spearman
value: 39.97389932591777
- type: manhattan_pearson
value: 29.172993134388935
- type: manhattan_spearman
value: 39.85681935287037
- task:
type: Retrieval
dataset:
type: scifact-pl
name: MTEB SciFact-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 57.260999999999996
- type: map_at_10
value: 66.92399999999999
- type: map_at_100
value: 67.443
- type: map_at_1000
value: 67.47800000000001
- type: map_at_3
value: 64.859
- type: map_at_5
value: 65.71900000000001
- type: mrr_at_1
value: 60.333000000000006
- type: mrr_at_10
value: 67.95400000000001
- type: mrr_at_100
value: 68.42
- type: mrr_at_1000
value: 68.45
- type: mrr_at_3
value: 66.444
- type: mrr_at_5
value: 67.128
- type: ndcg_at_1
value: 60.333000000000006
- type: ndcg_at_10
value: 71.209
- type: ndcg_at_100
value: 73.37
- type: ndcg_at_1000
value: 74.287
- type: ndcg_at_3
value: 67.66799999999999
- type: ndcg_at_5
value: 68.644
- type: precision_at_1
value: 60.333000000000006
- type: precision_at_10
value: 9.467
- type: precision_at_100
value: 1.053
- type: precision_at_1000
value: 0.11299999999999999
- type: precision_at_3
value: 26.778000000000002
- type: precision_at_5
value: 16.933
- type: recall_at_1
value: 57.260999999999996
- type: recall_at_10
value: 83.256
- type: recall_at_100
value: 92.767
- type: recall_at_1000
value: 100.0
- type: recall_at_3
value: 72.933
- type: recall_at_5
value: 75.744
- task:
type: Retrieval
dataset:
type: trec-covid-pl
name: MTEB TRECCOVID-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 0.22
- type: map_at_10
value: 1.693
- type: map_at_100
value: 9.281
- type: map_at_1000
value: 21.462999999999997
- type: map_at_3
value: 0.609
- type: map_at_5
value: 0.9570000000000001
- type: mrr_at_1
value: 80.0
- type: mrr_at_10
value: 88.73299999999999
- type: mrr_at_100
value: 88.73299999999999
- type: mrr_at_1000
value: 88.73299999999999
- type: mrr_at_3
value: 88.333
- type: mrr_at_5
value: 88.73299999999999
- type: ndcg_at_1
value: 79.0
- type: ndcg_at_10
value: 71.177
- type: ndcg_at_100
value: 52.479
- type: ndcg_at_1000
value: 45.333
- type: ndcg_at_3
value: 77.48
- type: ndcg_at_5
value: 76.137
- type: precision_at_1
value: 82.0
- type: precision_at_10
value: 74.0
- type: precision_at_100
value: 53.68000000000001
- type: precision_at_1000
value: 19.954
- type: precision_at_3
value: 80.667
- type: precision_at_5
value: 80.80000000000001
- type: recall_at_1
value: 0.22
- type: recall_at_10
value: 1.934
- type: recall_at_100
value: 12.728
- type: recall_at_1000
value: 41.869
- type: recall_at_3
value: 0.637
- type: recall_at_5
value: 1.042
language: pl
license: apache-2.0
widget:
- source_sentence: "query: Jak dożyć 100 lat?"
sentences:
- "passage: Trzeba zdrowo się odżywiać i uprawiać sport."
- "passage: Trzeba pić alkohol, imprezować i jeździć szybkimi autami."
- "passage: Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
---
<h1 align="center">MMLW-e5-large</h1>
MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish.
This is a distilled model that can be used to generate embeddings applicable to many tasks such as semantic similarity, clustering, information retrieval. The model can also serve as a base for further fine-tuning.
It transforms texts to 1024 dimensional vectors.
The model was initialized with multilingual E5 checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 60 million Polish-English text pairs. We utilised [English FlagEmbeddings (BGE)](https://huggingface.co/BAAI/bge-base-en) as teacher models for distillation.
## Usage (Sentence-Transformers)
⚠️ Our embedding models require the use of specific prefixes and suffixes when encoding texts. For this model, queries should be prefixed with **"query: "** and passages with **"passage: "** ⚠️
You can use the model like this with [sentence-transformers](https://www.SBERT.net):
```python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
query_prefix = "query: "
answer_prefix = "passage: "
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = SentenceTransformer("sdadas/mmlw-e5-large")
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])
# Trzeba zdrowo się odżywiać i uprawiać sport.
```
## Evaluation Results
- The model achieves an **Average Score** of **61.17** on the Polish Massive Text Embedding Benchmark (MTEB). See [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for detailed results.
- The model achieves **NDCG@10** of **56.09** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.
## Acknowledgements
This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative.
## Citation
```bibtex
@article{dadas2024pirb,
title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods},
author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata},
year={2024},
eprint={2402.13350},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```