xiaowenbin's picture
Update README.md
5e1a390 verified
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- semantic-search
- chinese
- mteb
model-index:
- name: sbert-chinese-general-v1
results:
- task:
type: STS
dataset:
type: C-MTEB/AFQMC
name: MTEB AFQMC
config: default
split: validation
revision: None
metrics:
- type: cos_sim_pearson
value: 22.293919432958074
- type: cos_sim_spearman
value: 22.56718923553609
- type: euclidean_pearson
value: 22.525656322797026
- type: euclidean_spearman
value: 22.56718923553609
- type: manhattan_pearson
value: 22.501773028824065
- type: manhattan_spearman
value: 22.536992587828397
- task:
type: STS
dataset:
type: C-MTEB/ATEC
name: MTEB ATEC
config: default
split: test
revision: None
metrics:
- type: cos_sim_pearson
value: 30.33575274463879
- type: cos_sim_spearman
value: 30.298708742167772
- type: euclidean_pearson
value: 32.33094743729218
- type: euclidean_spearman
value: 30.298710993858734
- type: manhattan_pearson
value: 32.31155376195945
- type: manhattan_spearman
value: 30.267669681690744
- task:
type: Classification
dataset:
type: mteb/amazon_reviews_multi
name: MTEB AmazonReviewsClassification (zh)
config: zh
split: test
revision: 1399c76144fd37290681b995c656ef9b2e06e26d
metrics:
- type: accuracy
value: 37.507999999999996
- type: f1
value: 36.436808400753286
- task:
type: STS
dataset:
type: C-MTEB/BQ
name: MTEB BQ
config: default
split: test
revision: None
metrics:
- type: cos_sim_pearson
value: 41.493256724214255
- type: cos_sim_spearman
value: 40.98395961967895
- type: euclidean_pearson
value: 41.12345737966565
- type: euclidean_spearman
value: 40.983959619555996
- type: manhattan_pearson
value: 41.02584539471014
- type: manhattan_spearman
value: 40.87549513383032
- task:
type: BitextMining
dataset:
type: mteb/bucc-bitext-mining
name: MTEB BUCC (zh-en)
config: zh-en
split: test
revision: d51519689f32196a32af33b075a01d0e7c51e252
metrics:
- type: accuracy
value: 9.794628751974724
- type: f1
value: 9.350535369492716
- type: precision
value: 9.179392662804986
- type: recall
value: 9.794628751974724
- task:
type: Clustering
dataset:
type: C-MTEB/CLSClusteringP2P
name: MTEB CLSClusteringP2P
config: default
split: test
revision: None
metrics:
- type: v_measure
value: 34.984726547788284
- task:
type: Clustering
dataset:
type: C-MTEB/CLSClusteringS2S
name: MTEB CLSClusteringS2S
config: default
split: test
revision: None
metrics:
- type: v_measure
value: 27.81945732281589
- task:
type: Reranking
dataset:
type: C-MTEB/CMedQAv1-reranking
name: MTEB CMedQAv1
config: default
split: test
revision: None
metrics:
- type: map
value: 53.06586280826805
- type: mrr
value: 59.58781746031746
- task:
type: Reranking
dataset:
type: C-MTEB/CMedQAv2-reranking
name: MTEB CMedQAv2
config: default
split: test
revision: None
metrics:
- type: map
value: 52.83635946154306
- type: mrr
value: 59.315079365079356
- task:
type: Retrieval
dataset:
type: C-MTEB/CmedqaRetrieval
name: MTEB CmedqaRetrieval
config: default
split: dev
revision: None
metrics:
- type: map_at_1
value: 5.721
- type: map_at_10
value: 8.645
- type: map_at_100
value: 9.434
- type: map_at_1000
value: 9.586
- type: map_at_3
value: 7.413
- type: map_at_5
value: 8.05
- type: mrr_at_1
value: 9.626999999999999
- type: mrr_at_10
value: 13.094
- type: mrr_at_100
value: 13.854
- type: mrr_at_1000
value: 13.958
- type: mrr_at_3
value: 11.724
- type: mrr_at_5
value: 12.409
- type: ndcg_at_1
value: 9.626999999999999
- type: ndcg_at_10
value: 11.35
- type: ndcg_at_100
value: 15.593000000000002
- type: ndcg_at_1000
value: 19.619
- type: ndcg_at_3
value: 9.317
- type: ndcg_at_5
value: 10.049
- type: precision_at_1
value: 9.626999999999999
- type: precision_at_10
value: 2.796
- type: precision_at_100
value: 0.629
- type: precision_at_1000
value: 0.11800000000000001
- type: precision_at_3
value: 5.476
- type: precision_at_5
value: 4.1209999999999996
- type: recall_at_1
value: 5.721
- type: recall_at_10
value: 15.190000000000001
- type: recall_at_100
value: 33.633
- type: recall_at_1000
value: 62.019999999999996
- type: recall_at_3
value: 9.099
- type: recall_at_5
value: 11.423
- task:
type: PairClassification
dataset:
type: C-MTEB/CMNLI
name: MTEB Cmnli
config: default
split: validation
revision: None
metrics:
- type: cos_sim_accuracy
value: 77.36620565243535
- type: cos_sim_ap
value: 85.92291866877001
- type: cos_sim_f1
value: 78.19390231037029
- type: cos_sim_precision
value: 71.24183006535948
- type: cos_sim_recall
value: 86.64952069207388
- type: dot_accuracy
value: 77.36620565243535
- type: dot_ap
value: 85.94113738490068
- type: dot_f1
value: 78.19390231037029
- type: dot_precision
value: 71.24183006535948
- type: dot_recall
value: 86.64952069207388
- type: euclidean_accuracy
value: 77.36620565243535
- type: euclidean_ap
value: 85.92291893444687
- type: euclidean_f1
value: 78.19390231037029
- type: euclidean_precision
value: 71.24183006535948
- type: euclidean_recall
value: 86.64952069207388
- type: manhattan_accuracy
value: 77.29404690318701
- type: manhattan_ap
value: 85.88284362100919
- type: manhattan_f1
value: 78.17836812144213
- type: manhattan_precision
value: 71.18448838548666
- type: manhattan_recall
value: 86.69628244096329
- type: max_accuracy
value: 77.36620565243535
- type: max_ap
value: 85.94113738490068
- type: max_f1
value: 78.19390231037029
- task:
type: Retrieval
dataset:
type: C-MTEB/CovidRetrieval
name: MTEB CovidRetrieval
config: default
split: dev
revision: None
metrics:
- type: map_at_1
value: 26.976
- type: map_at_10
value: 35.18
- type: map_at_100
value: 35.921
- type: map_at_1000
value: 35.998999999999995
- type: map_at_3
value: 32.763
- type: map_at_5
value: 34.165
- type: mrr_at_1
value: 26.976
- type: mrr_at_10
value: 35.234
- type: mrr_at_100
value: 35.939
- type: mrr_at_1000
value: 36.016
- type: mrr_at_3
value: 32.771
- type: mrr_at_5
value: 34.172999999999995
- type: ndcg_at_1
value: 26.976
- type: ndcg_at_10
value: 39.635
- type: ndcg_at_100
value: 43.54
- type: ndcg_at_1000
value: 45.723
- type: ndcg_at_3
value: 34.652
- type: ndcg_at_5
value: 37.186
- type: precision_at_1
value: 26.976
- type: precision_at_10
value: 5.406
- type: precision_at_100
value: 0.736
- type: precision_at_1000
value: 0.091
- type: precision_at_3
value: 13.418
- type: precision_at_5
value: 9.293999999999999
- type: recall_at_1
value: 26.976
- type: recall_at_10
value: 53.766999999999996
- type: recall_at_100
value: 72.761
- type: recall_at_1000
value: 90.148
- type: recall_at_3
value: 40.095
- type: recall_at_5
value: 46.233000000000004
- task:
type: Retrieval
dataset:
type: C-MTEB/DuRetrieval
name: MTEB DuRetrieval
config: default
split: dev
revision: None
metrics:
- type: map_at_1
value: 11.285
- type: map_at_10
value: 30.259000000000004
- type: map_at_100
value: 33.772000000000006
- type: map_at_1000
value: 34.037
- type: map_at_3
value: 21.038999999999998
- type: map_at_5
value: 25.939
- type: mrr_at_1
value: 45.1
- type: mrr_at_10
value: 55.803999999999995
- type: mrr_at_100
value: 56.301
- type: mrr_at_1000
value: 56.330999999999996
- type: mrr_at_3
value: 53.333
- type: mrr_at_5
value: 54.798
- type: ndcg_at_1
value: 45.1
- type: ndcg_at_10
value: 41.156
- type: ndcg_at_100
value: 49.518
- type: ndcg_at_1000
value: 52.947
- type: ndcg_at_3
value: 39.708
- type: ndcg_at_5
value: 38.704
- type: precision_at_1
value: 45.1
- type: precision_at_10
value: 20.75
- type: precision_at_100
value: 3.424
- type: precision_at_1000
value: 0.42700000000000005
- type: precision_at_3
value: 35.632999999999996
- type: precision_at_5
value: 30.080000000000002
- type: recall_at_1
value: 11.285
- type: recall_at_10
value: 43.242000000000004
- type: recall_at_100
value: 68.604
- type: recall_at_1000
value: 85.904
- type: recall_at_3
value: 24.404
- type: recall_at_5
value: 32.757
- task:
type: Retrieval
dataset:
type: C-MTEB/EcomRetrieval
name: MTEB EcomRetrieval
config: default
split: dev
revision: None
metrics:
- type: map_at_1
value: 21
- type: map_at_10
value: 28.364
- type: map_at_100
value: 29.199
- type: map_at_1000
value: 29.265
- type: map_at_3
value: 25.717000000000002
- type: map_at_5
value: 27.311999999999998
- type: mrr_at_1
value: 21
- type: mrr_at_10
value: 28.364
- type: mrr_at_100
value: 29.199
- type: mrr_at_1000
value: 29.265
- type: mrr_at_3
value: 25.717000000000002
- type: mrr_at_5
value: 27.311999999999998
- type: ndcg_at_1
value: 21
- type: ndcg_at_10
value: 32.708
- type: ndcg_at_100
value: 37.184
- type: ndcg_at_1000
value: 39.273
- type: ndcg_at_3
value: 27.372000000000003
- type: ndcg_at_5
value: 30.23
- type: precision_at_1
value: 21
- type: precision_at_10
value: 4.66
- type: precision_at_100
value: 0.685
- type: precision_at_1000
value: 0.086
- type: precision_at_3
value: 10.732999999999999
- type: precision_at_5
value: 7.82
- type: recall_at_1
value: 21
- type: recall_at_10
value: 46.6
- type: recall_at_100
value: 68.5
- type: recall_at_1000
value: 85.6
- type: recall_at_3
value: 32.2
- type: recall_at_5
value: 39.1
- task:
type: Classification
dataset:
type: C-MTEB/IFlyTek-classification
name: MTEB IFlyTek
config: default
split: validation
revision: None
metrics:
- type: accuracy
value: 44.878799538283964
- type: f1
value: 33.84678310261366
- task:
type: Classification
dataset:
type: C-MTEB/JDReview-classification
name: MTEB JDReview
config: default
split: test
revision: None
metrics:
- type: accuracy
value: 82.1951219512195
- type: ap
value: 46.78292030042397
- type: f1
value: 76.20482468514128
- task:
type: STS
dataset:
type: C-MTEB/LCQMC
name: MTEB LCQMC
config: default
split: test
revision: None
metrics:
- type: cos_sim_pearson
value: 62.84331627244547
- type: cos_sim_spearman
value: 68.39990265073726
- type: euclidean_pearson
value: 66.87431827169324
- type: euclidean_spearman
value: 68.39990264979167
- type: manhattan_pearson
value: 66.89702078900328
- type: manhattan_spearman
value: 68.42107302159141
- task:
type: Reranking
dataset:
type: C-MTEB/Mmarco-reranking
name: MTEB MMarcoReranking
config: default
split: dev
revision: None
metrics:
- type: map
value: 9.28600891904827
- type: mrr
value: 8.057936507936509
- task:
type: Retrieval
dataset:
type: C-MTEB/MMarcoRetrieval
name: MTEB MMarcoRetrieval
config: default
split: dev
revision: None
metrics:
- type: map_at_1
value: 22.820999999999998
- type: map_at_10
value: 30.44
- type: map_at_100
value: 31.35
- type: map_at_1000
value: 31.419000000000004
- type: map_at_3
value: 28.134999999999998
- type: map_at_5
value: 29.482000000000003
- type: mrr_at_1
value: 23.782
- type: mrr_at_10
value: 31.141999999999996
- type: mrr_at_100
value: 32.004
- type: mrr_at_1000
value: 32.068000000000005
- type: mrr_at_3
value: 28.904000000000003
- type: mrr_at_5
value: 30.214999999999996
- type: ndcg_at_1
value: 23.782
- type: ndcg_at_10
value: 34.625
- type: ndcg_at_100
value: 39.226
- type: ndcg_at_1000
value: 41.128
- type: ndcg_at_3
value: 29.968
- type: ndcg_at_5
value: 32.35
- type: precision_at_1
value: 23.782
- type: precision_at_10
value: 4.994
- type: precision_at_100
value: 0.736
- type: precision_at_1000
value: 0.09
- type: precision_at_3
value: 12.13
- type: precision_at_5
value: 8.495999999999999
- type: recall_at_1
value: 22.820999999999998
- type: recall_at_10
value: 47.141
- type: recall_at_100
value: 68.952
- type: recall_at_1000
value: 83.985
- type: recall_at_3
value: 34.508
- type: recall_at_5
value: 40.232
- task:
type: Classification
dataset:
type: mteb/amazon_massive_intent
name: MTEB MassiveIntentClassification (zh-CN)
config: zh-CN
split: test
revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7
metrics:
- type: accuracy
value: 57.343644922663074
- type: f1
value: 56.744802953803486
- task:
type: Classification
dataset:
type: mteb/amazon_massive_scenario
name: MTEB MassiveScenarioClassification (zh-CN)
config: zh-CN
split: test
revision: 7d571f92784cd94a019292a1f45445077d0ef634
metrics:
- type: accuracy
value: 62.363819771351714
- type: f1
value: 62.15920863434656
- task:
type: Retrieval
dataset:
type: C-MTEB/MedicalRetrieval
name: MTEB MedicalRetrieval
config: default
split: dev
revision: None
metrics:
- type: map_at_1
value: 14.6
- type: map_at_10
value: 18.231
- type: map_at_100
value: 18.744
- type: map_at_1000
value: 18.811
- type: map_at_3
value: 17.133000000000003
- type: map_at_5
value: 17.663
- type: mrr_at_1
value: 14.6
- type: mrr_at_10
value: 18.231
- type: mrr_at_100
value: 18.744
- type: mrr_at_1000
value: 18.811
- type: mrr_at_3
value: 17.133000000000003
- type: mrr_at_5
value: 17.663
- type: ndcg_at_1
value: 14.6
- type: ndcg_at_10
value: 20.349
- type: ndcg_at_100
value: 23.204
- type: ndcg_at_1000
value: 25.44
- type: ndcg_at_3
value: 17.995
- type: ndcg_at_5
value: 18.945999999999998
- type: precision_at_1
value: 14.6
- type: precision_at_10
value: 2.7199999999999998
- type: precision_at_100
value: 0.414
- type: precision_at_1000
value: 0.06
- type: precision_at_3
value: 6.833
- type: precision_at_5
value: 4.5600000000000005
- type: recall_at_1
value: 14.6
- type: recall_at_10
value: 27.200000000000003
- type: recall_at_100
value: 41.4
- type: recall_at_1000
value: 60
- type: recall_at_3
value: 20.5
- type: recall_at_5
value: 22.8
- task:
type: Classification
dataset:
type: C-MTEB/MultilingualSentiment-classification
name: MTEB MultilingualSentiment
config: default
split: validation
revision: None
metrics:
- type: accuracy
value: 66.58333333333333
- type: f1
value: 66.26700927460007
- task:
type: PairClassification
dataset:
type: C-MTEB/OCNLI
name: MTEB Ocnli
config: default
split: validation
revision: None
metrics:
- type: cos_sim_accuracy
value: 72.00866269626421
- type: cos_sim_ap
value: 77.00520104243304
- type: cos_sim_f1
value: 74.39303710490151
- type: cos_sim_precision
value: 65.69579288025889
- type: cos_sim_recall
value: 85.74445617740233
- type: dot_accuracy
value: 72.00866269626421
- type: dot_ap
value: 77.00520104243304
- type: dot_f1
value: 74.39303710490151
- type: dot_precision
value: 65.69579288025889
- type: dot_recall
value: 85.74445617740233
- type: euclidean_accuracy
value: 72.00866269626421
- type: euclidean_ap
value: 77.00520104243304
- type: euclidean_f1
value: 74.39303710490151
- type: euclidean_precision
value: 65.69579288025889
- type: euclidean_recall
value: 85.74445617740233
- type: manhattan_accuracy
value: 72.1710882512182
- type: manhattan_ap
value: 77.00551017913976
- type: manhattan_f1
value: 74.23423423423424
- type: manhattan_precision
value: 64.72898664571878
- type: manhattan_recall
value: 87.0116156282999
- type: max_accuracy
value: 72.1710882512182
- type: max_ap
value: 77.00551017913976
- type: max_f1
value: 74.39303710490151
- task:
type: Classification
dataset:
type: C-MTEB/OnlineShopping-classification
name: MTEB OnlineShopping
config: default
split: test
revision: None
metrics:
- type: accuracy
value: 88.19000000000001
- type: ap
value: 85.13415594781077
- type: f1
value: 88.17344156114062
- task:
type: STS
dataset:
type: C-MTEB/PAWSX
name: MTEB PAWSX
config: default
split: test
revision: None
metrics:
- type: cos_sim_pearson
value: 13.70522140998517
- type: cos_sim_spearman
value: 15.07546667334743
- type: euclidean_pearson
value: 17.49511420225285
- type: euclidean_spearman
value: 15.093970931789618
- type: manhattan_pearson
value: 17.44069961390521
- type: manhattan_spearman
value: 15.076029291596962
- task:
type: STS
dataset:
type: C-MTEB/QBQTC
name: MTEB QBQTC
config: default
split: test
revision: None
metrics:
- type: cos_sim_pearson
value: 26.835294224547155
- type: cos_sim_spearman
value: 27.920204597498856
- type: euclidean_pearson
value: 26.153796707702803
- type: euclidean_spearman
value: 27.920971379720548
- type: manhattan_pearson
value: 26.21954147857523
- type: manhattan_spearman
value: 27.996860049937478
- task:
type: STS
dataset:
type: mteb/sts22-crosslingual-sts
name: MTEB STS22 (zh)
config: zh
split: test
revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80
metrics:
- type: cos_sim_pearson
value: 55.15901259718581
- type: cos_sim_spearman
value: 61.57967880874167
- type: euclidean_pearson
value: 53.83523291596683
- type: euclidean_spearman
value: 61.57967880874167
- type: manhattan_pearson
value: 54.99971428907956
- type: manhattan_spearman
value: 61.61229543613867
- task:
type: STS
dataset:
type: mteb/sts22-crosslingual-sts
name: MTEB STS22 (zh-en)
config: zh-en
split: test
revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80
metrics:
- type: cos_sim_pearson
value: 34.20930208460845
- type: cos_sim_spearman
value: 33.879011104224524
- type: euclidean_pearson
value: 35.08526425284862
- type: euclidean_spearman
value: 33.879011104224524
- type: manhattan_pearson
value: 35.509419089701275
- type: manhattan_spearman
value: 33.30035487147621
- task:
type: STS
dataset:
type: C-MTEB/STSB
name: MTEB STSB
config: default
split: test
revision: None
metrics:
- type: cos_sim_pearson
value: 82.30068282185835
- type: cos_sim_spearman
value: 82.16763221361724
- type: euclidean_pearson
value: 80.52772752433374
- type: euclidean_spearman
value: 82.16797037220333
- type: manhattan_pearson
value: 80.51093859500105
- type: manhattan_spearman
value: 82.17643310049654
- task:
type: Reranking
dataset:
type: C-MTEB/T2Reranking
name: MTEB T2Reranking
config: default
split: dev
revision: None
metrics:
- type: map
value: 65.14113035189213
- type: mrr
value: 74.9589270937443
- task:
type: Retrieval
dataset:
type: C-MTEB/T2Retrieval
name: MTEB T2Retrieval
config: default
split: dev
revision: None
metrics:
- type: map_at_1
value: 12.013
- type: map_at_10
value: 30.885
- type: map_at_100
value: 34.643
- type: map_at_1000
value: 34.927
- type: map_at_3
value: 21.901
- type: map_at_5
value: 26.467000000000002
- type: mrr_at_1
value: 49.623
- type: mrr_at_10
value: 58.05200000000001
- type: mrr_at_100
value: 58.61300000000001
- type: mrr_at_1000
value: 58.643
- type: mrr_at_3
value: 55.947
- type: mrr_at_5
value: 57.229
- type: ndcg_at_1
value: 49.623
- type: ndcg_at_10
value: 41.802
- type: ndcg_at_100
value: 49.975
- type: ndcg_at_1000
value: 53.504
- type: ndcg_at_3
value: 43.515
- type: ndcg_at_5
value: 41.576
- type: precision_at_1
value: 49.623
- type: precision_at_10
value: 22.052
- type: precision_at_100
value: 3.6450000000000005
- type: precision_at_1000
value: 0.45399999999999996
- type: precision_at_3
value: 38.616
- type: precision_at_5
value: 31.966
- type: recall_at_1
value: 12.013
- type: recall_at_10
value: 41.891
- type: recall_at_100
value: 67.096
- type: recall_at_1000
value: 84.756
- type: recall_at_3
value: 24.695
- type: recall_at_5
value: 32.09
- task:
type: Classification
dataset:
type: C-MTEB/TNews-classification
name: MTEB TNews
config: default
split: validation
revision: None
metrics:
- type: accuracy
value: 39.800999999999995
- type: f1
value: 38.5345899934575
- task:
type: Clustering
dataset:
type: C-MTEB/ThuNewsClusteringP2P
name: MTEB ThuNewsClusteringP2P
config: default
split: test
revision: None
metrics:
- type: v_measure
value: 40.16574242797479
- task:
type: Clustering
dataset:
type: C-MTEB/ThuNewsClusteringS2S
name: MTEB ThuNewsClusteringS2S
config: default
split: test
revision: None
metrics:
- type: v_measure
value: 24.232617974671754
- task:
type: Retrieval
dataset:
type: C-MTEB/VideoRetrieval
name: MTEB VideoRetrieval
config: default
split: dev
revision: None
metrics:
- type: map_at_1
value: 24.6
- type: map_at_10
value: 31.328
- type: map_at_100
value: 32.088
- type: map_at_1000
value: 32.164
- type: map_at_3
value: 29.133
- type: map_at_5
value: 30.358
- type: mrr_at_1
value: 24.6
- type: mrr_at_10
value: 31.328
- type: mrr_at_100
value: 32.088
- type: mrr_at_1000
value: 32.164
- type: mrr_at_3
value: 29.133
- type: mrr_at_5
value: 30.358
- type: ndcg_at_1
value: 24.6
- type: ndcg_at_10
value: 35.150999999999996
- type: ndcg_at_100
value: 39.024
- type: ndcg_at_1000
value: 41.157
- type: ndcg_at_3
value: 30.637999999999998
- type: ndcg_at_5
value: 32.833
- type: precision_at_1
value: 24.6
- type: precision_at_10
value: 4.74
- type: precision_at_100
value: 0.66
- type: precision_at_1000
value: 0.083
- type: precision_at_3
value: 11.667
- type: precision_at_5
value: 8.06
- type: recall_at_1
value: 24.6
- type: recall_at_10
value: 47.4
- type: recall_at_100
value: 66
- type: recall_at_1000
value: 83
- type: recall_at_3
value: 35
- type: recall_at_5
value: 40.300000000000004
- task:
type: Classification
dataset:
type: C-MTEB/waimai-classification
name: MTEB Waimai
config: default
split: test
revision: None
metrics:
- type: accuracy
value: 83.96000000000001
- type: ap
value: 65.11027167433211
- type: f1
value: 82.03549710974653
license: apache-2.0
language:
- zh
---
# DMetaSoul/sbert-chinese-general-v1
此模型基于 [bert-base-chinese](https://huggingface.co/bert-base-chinese) 版本 BERT 模型,在 NLI、PAWS-X、PKU-Paraphrase-Bank、STS 等语义相似数据集上进行训练,适用于**通用语义匹配**场景(此模型在 Chinese-STS 任务上效果较好,但在其它任务上效果并非最优,存在一定过拟合风险),比如文本特征抽取、文本向量聚类、文本语义搜索等业务场景。
注:此模型的[轻量化版本](https://huggingface.co/DMetaSoul/sbert-chinese-general-v1-distill),也已经开源啦!
# Usage
## 1. Sentence-Transformers
通过 [sentence-transformers](https://www.SBERT.net) 框架来使用该模型,首先进行安装:
```
pip install -U sentence-transformers
```
然后使用下面的代码来载入该模型并进行文本表征向量的提取:
```python
from sentence_transformers import SentenceTransformer
sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"]
model = SentenceTransformer('DMetaSoul/sbert-chinese-general-v1')
embeddings = model.encode(sentences)
print(embeddings)
```
## 2. HuggingFace Transformers
如果不想使用 [sentence-transformers](https://www.SBERT.net) 的话,也可以通过 HuggingFace Transformers 来载入该模型并进行文本向量抽取:
```python
from transformers import AutoTokenizer, AutoModel
import torch
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/sbert-chinese-general-v1')
model = AutoModel.from_pretrained('DMetaSoul/sbert-chinese-general-v1')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
```
## Evaluation
该模型在公开的几个语义匹配数据集上进行了评测,计算了向量相似度跟真实标签之间的相关性系数:
| | **csts_dev** | **csts_test** | **afqmc** | **lcqmc** | **bqcorpus** | **pawsx** | **xiaobu** |
| ------------ | ------------ | ------------- | --------- | --------- | ------------ | --------- | ---------- |
| **spearman** | 84.54% | 82.17% | 23.80% | 65.94% | 45.52% | 11.52% | 48.51% |
## Citing & Authors
E-mail: xiaowenbin@dmetasoul.com