German_Semantic_V3b / README.md
aari1995's picture
Update README.md
fe42009 verified
|
raw
history blame
18.4 kB
metadata
language:
  - de
library_name: sentence-transformers
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dataset_size:10K<n<100K
  - loss:MatryoshkaLoss
  - loss:ContrastiveLoss
base_model: aari1995/gbert-large-alibi
metrics:
  - pearson_cosine
  - spearman_cosine
  - pearson_manhattan
  - spearman_manhattan
  - pearson_euclidean
  - spearman_euclidean
  - pearson_dot
  - spearman_dot
  - pearson_max
  - spearman_max
widget:
  - source_sentence: Das Tor ist gelb.
    sentences:
      - Das Tor ist blau.
      - Ein Mann mit seinem Hund am Strand.
      - Die Menschen sitzen auf Bänken.
  - source_sentence: Das Tor ist blau.
    sentences:
      - Ein blaues Moped parkt auf dem Bürgersteig.
      - Drei Hunde spielen im weißen Schnee.
      - Bombenanschläge töten 19 Menschen im Irak
  - source_sentence: Ein Mann übt Boxen
    sentences:
      - Ein Fußballspieler versucht ein Tackling.
      - 1 Getötet bei Protest in Bangladesch
      - Das Mädchen sang in ein Mikrofon.
  - source_sentence: Drei Männer tanzen.
    sentences:
      - Ein Mann tanzt.
      - Ein Mann arbeitet an seinem Laptop.
      - Das Mädchen sang in ein Mikrofon.
  - source_sentence: Eine Flagge weht.
    sentences:
      - Die Flagge bewegte sich in der Luft.
      - Zwei Personen beobachten das Wasser.
      - Zwei Frauen sitzen in einem Cafe.
pipeline_tag: sentence-similarity
model-index:
  - name: SentenceTransformer based on aari1995/gbert-large-nli_mix
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts test 1024
          type: sts-test-1024
        metrics:
          - type: pearson_cosine
            value: 0.8538749625112824
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8622934726599119
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.8554617861095041
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.8632850500504865
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.8554205957277228
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.8630779166725503
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.8170146846171837
            name: Pearson Dot
          - type: spearman_dot
            value: 0.8149857685956332
            name: Spearman Dot
          - type: pearson_max
            value: 0.8554617861095041
            name: Pearson Max
          - type: spearman_max
            value: 0.8632850500504865
            name: Spearman Max
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts test 768
          type: sts-test-768
        metrics:
          - type: pearson_cosine
            value: 0.853820621972726
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.863198271488271
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.8558709278385018
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.8637532036004547
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.8558597695346744
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.8634247094122574
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.8169163431962185
            name: Pearson Dot
          - type: spearman_dot
            value: 0.8156867907361973
            name: Spearman Dot
          - type: pearson_max
            value: 0.8558709278385018
            name: Pearson Max
          - type: spearman_max
            value: 0.8637532036004547
            name: Spearman Max
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts test 512
          type: sts-test-512
        metrics:
          - type: pearson_cosine
            value: 0.8502336569709972
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8623838162450902
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.8547121881183612
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.8628698143219098
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.8546114371189246
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.8625109910600326
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.8108392647310044
            name: Pearson Dot
          - type: spearman_dot
            value: 0.8103261097232485
            name: Spearman Dot
          - type: pearson_max
            value: 0.8547121881183612
            name: Pearson Max
          - type: spearman_max
            value: 0.8628698143219098
            name: Spearman Max
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts test 256
          type: sts-test-256
        metrics:
          - type: pearson_cosine
            value: 0.8441242786553879
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8582717489671877
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.8517415030362573
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.8591688553092182
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.8516965854845419
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.8591770194196562
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.7901870400809775
            name: Pearson Dot
          - type: spearman_dot
            value: 0.7891397281321177
            name: Spearman Dot
          - type: pearson_max
            value: 0.8517415030362573
            name: Pearson Max
          - type: spearman_max
            value: 0.8591770194196562
            name: Spearman Max
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts test 128
          type: sts-test-128
        metrics:
          - type: pearson_cosine
            value: 0.8369352495821198
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8545806562301762
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.8474289413580527
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.8546935424655524
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.8478267316251253
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.8550464936365929
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.7732663297266509
            name: Pearson Dot
          - type: spearman_dot
            value: 0.7720532782903432
            name: Spearman Dot
          - type: pearson_max
            value: 0.8478267316251253
            name: Pearson Max
          - type: spearman_max
            value: 0.8550464936365929
            name: Spearman Max
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts test 64
          type: sts-test-64
        metrics:
          - type: pearson_cosine
            value: 0.8282288301025145
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8507215646125454
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.8404915813802649
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.8482910175231816
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.8425986040609018
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.8498681513437906
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.7518854418344252
            name: Pearson Dot
          - type: spearman_dot
            value: 0.7518133373839283
            name: Spearman Dot
          - type: pearson_max
            value: 0.8425986040609018
            name: Pearson Max
          - type: spearman_max
            value: 0.8507215646125454
            name: Spearman Max
license: apache-2.0

German Semantic V3

The successor of German_Semantic_STS_V2 is here!

Major updates and USPs:

  • Sequence length: 8192, (16 times more than V2 and other models) => thanks to the ALiBi implementation of Jina-Team!
  • Matryoshka Embeddings: The model is trained for embedding sizes from 1024 down to 64, allowing you to store much smaller embeddings with little quality loss.
  • License: Apache 2.0
  • German only: This model is German-only, causing the model to learn more efficient and deal better with shorter queries.
  • Flexibility: Trained with flexible sequence-length and embedding truncation, flexibility is a core feature of the model, while improving on V2-performance.

Usage:

from sentence_transformers import SentenceTransformer


matryoshka_dim = 1024 # How big your embeddings should be, choose from: 64, 128, 256, 512, 1024
model = SentenceTransformer("aari1995/German_Semantic_V3", trust_remote_code=True, truncate_dim=matryoshka_dim)

# model.truncate_dim = 64 # truncation dimensions can also be changed after loading
# model.max_seq_length = 512 #optionally, set your maximum sequence length lower if your hardware is limited 

# Run inference
sentences = [
    'Eine Flagge weht.',
    'Die Flagge bewegte sich in der Luft.',
    'Zwei Personen beobachten das Wasser.',
]
embeddings = model.encode(sentences)

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: gbert-large (alibi applied)
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 1024 tokens
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • multiple German datasets
  • Languages: de

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: JinaBertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("aari1995/German_Semantic_V3", trust_remote_code=True)
# Run inference
sentences = [
    'Eine Flagge weht.',
    'Die Flagge bewegte sich in der Luft.',
    'Zwei Personen beobachten das Wasser.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric Value
pearson_cosine 0.8539
spearman_cosine 0.8623
pearson_manhattan 0.8555
spearman_manhattan 0.8633
pearson_euclidean 0.8554
spearman_euclidean 0.8631
pearson_dot 0.817
spearman_dot 0.815
pearson_max 0.8555
spearman_max 0.8633

Semantic Similarity

Metric Value
pearson_cosine 0.8538
spearman_cosine 0.8632
pearson_manhattan 0.8559
spearman_manhattan 0.8638
pearson_euclidean 0.8559
spearman_euclidean 0.8634
pearson_dot 0.8169
spearman_dot 0.8157
pearson_max 0.8559
spearman_max 0.8638

Semantic Similarity

Metric Value
pearson_cosine 0.8502
spearman_cosine 0.8624
pearson_manhattan 0.8547
spearman_manhattan 0.8629
pearson_euclidean 0.8546
spearman_euclidean 0.8625
pearson_dot 0.8108
spearman_dot 0.8103
pearson_max 0.8547
spearman_max 0.8629

Semantic Similarity

Metric Value
pearson_cosine 0.8441
spearman_cosine 0.8583
pearson_manhattan 0.8517
spearman_manhattan 0.8592
pearson_euclidean 0.8517
spearman_euclidean 0.8592
pearson_dot 0.7902
spearman_dot 0.7891
pearson_max 0.8517
spearman_max 0.8592

Semantic Similarity

Metric Value
pearson_cosine 0.8369
spearman_cosine 0.8546
pearson_manhattan 0.8474
spearman_manhattan 0.8547
pearson_euclidean 0.8478
spearman_euclidean 0.855
pearson_dot 0.7733
spearman_dot 0.7721
pearson_max 0.8478
spearman_max 0.855

Semantic Similarity

Metric Value
pearson_cosine 0.8282
spearman_cosine 0.8507
pearson_manhattan 0.8405
spearman_manhattan 0.8483
pearson_euclidean 0.8426
spearman_euclidean 0.8499
pearson_dot 0.7519
spearman_dot 0.7518
pearson_max 0.8426
spearman_max 0.8507

Training Details

  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "ContrastiveLoss",
        "matryoshka_dims": [
            1024,
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

License / Credits and Special thanks to:

  • to Jina AI for the model architecture, especially their ALiBi implementation
  • to deepset for gbert-large, which is imho still the greatest German model

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

ContrastiveLoss

@inproceedings{hadsell2006dimensionality,
    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)}, 
    title={Dimensionality Reduction by Learning an Invariant Mapping}, 
    year={2006},
    volume={2},
    number={},
    pages={1735-1742},
    doi={10.1109/CVPR.2006.100}
}