andersborges commited on
Commit
ecd1b60
·
verified ·
1 Parent(s): c93c7e4

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +47 -20
README.md CHANGED
@@ -1,20 +1,23 @@
1
  ---
2
- base_model:
3
- - jealk/TTC-L2V-supervised-2
4
- language:
5
- - da
6
  library_name: model2vec
7
  license: mit
8
- model_name: andersborges/model2vecdk
9
  tags:
10
  - embeddings
11
  - static-embeddings
12
  - sentence-transformers
 
 
 
 
 
 
 
13
  ---
14
 
15
- # andersborges/model2vecdk Model Card
16
 
17
- This [Model2Vec](https://github.com/MinishLab/model2vec) model is a distilled version of the ['jealk/TTC-L2V-supervised-2'](https://huggingface.co/['jealk/TTC-L2V-supervised-2']) Sentence Transformer. It uses static embeddings, allowing text embeddings to be computed orders of magnitude faster on both GPU and CPU. It is designed for applications where computational resources are limited or where real-time performance is critical. Model2Vec models are the smallest, fastest, and most performant static embedders available. The distilled models are up to 50 times smaller and 500 times faster than traditional Sentence Transformers.
18
 
19
 
20
  ## Installation
@@ -38,7 +41,7 @@ from model2vec import StaticModel
38
  model = StaticModel.from_pretrained("andersborges/model2vecdk")
39
 
40
  # Compute text embeddings
41
- embeddings = model.encode(["Example sentence"])
42
  ```
43
 
44
  ### Using Sentence Transformers
@@ -52,31 +55,55 @@ from sentence_transformers import SentenceTransformer
52
  model = SentenceTransformer("andersborges/model2vecdk")
53
 
54
  # Compute text embeddings
55
- embeddings = model.encode(["Example sentence"])
56
  ```
57
 
58
- ### Distilling a Model2Vec model
59
 
60
- You can distill a Model2Vec model from a Sentence Transformer model using the `distill` method. First, install the `distill` extra with `pip install model2vec[distill]`. Then, run the following code:
61
 
62
- ```python
63
- from model2vec.distill import distill
 
64
 
65
- # Distill a Sentence Transformer model, in this case the BAAI/bge-base-en-v1.5 model
66
- m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256)
67
 
68
- # Save the model
69
- m2v_model.save_pretrained("m2v_model")
 
 
 
 
 
 
 
70
  ```
71
 
72
- ## How it works
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
- Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Best of all, you don't need any data to distill a model using Model2Vec.
75
 
76
- It works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using [SIF weighting](https://openreview.net/pdf?id=SyK00v5xx). During inference, we simply take the mean of all token embeddings occurring in a sentence.
77
 
78
  ## Additional Resources
79
 
 
80
  - [Model2Vec Repo](https://github.com/MinishLab/model2vec)
81
  - [Model2Vec Base Models](https://huggingface.co/collections/minishlab/model2vec-base-models-66fd9dd9b7c3b3c0f25ca90e)
82
  - [Model2Vec Results](https://github.com/MinishLab/model2vec/tree/main/results)
 
1
  ---
 
 
 
 
2
  library_name: model2vec
3
  license: mit
4
+ model_name: model2vecdk
5
  tags:
6
  - embeddings
7
  - static-embeddings
8
  - sentence-transformers
9
+ base_model:
10
+ - jealk/TTC-L2V-supervised-2
11
+ language:
12
+ - da
13
+ datasets:
14
+ - DDSC/nordic-embedding-training-data
15
+ repo_url: https://github.com/andersborges/dkmodel2vec
16
  ---
17
 
18
+ # dkmodel2vec Model Card
19
 
20
+ This [Model2Vec](https://github.com/MinishLab/model2vec) model is a distilled version of a [LLM2Vec](https://github.com/McGill-NLP/llm2vec) model. It uses static embeddings, allowing text embeddings to be computed orders of magnitude faster on both GPU and CPU. It is designed for applications where computational resources are limited or where real-time performance is critical. Model2Vec models are the smallest, fastest, and most performant static embedders available. The distilled models are up to 50 times smaller and 500 times faster than traditional Sentence Transformers.
21
 
22
 
23
  ## Installation
 
41
  model = StaticModel.from_pretrained("andersborges/model2vecdk")
42
 
43
  # Compute text embeddings
44
+ embeddings = model.encode(["Jeg elsker kage"])
45
  ```
46
 
47
  ### Using Sentence Transformers
 
55
  model = SentenceTransformer("andersborges/model2vecdk")
56
 
57
  # Compute text embeddings
58
+ embeddings = model.encode(["Jeg elsker kage"])
59
  ```
60
 
61
+ ## How it works
62
 
63
+ Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Best of all, you don't need any data to distill a model using Model2Vec.
64
 
65
+ It works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using [SIF weighting](https://openreview.net/pdf?id=SyK00v5xx). During inference, we simply take the mean of all token embeddings occurring in a sentence.
66
+
67
+ ## Training
68
 
69
+ See [repo](https://github.com/andersborges/dkmodel2vec). The model was trained with the following commands:
 
70
 
71
+ ```bash
72
+ # distill model
73
+ python scripts/hyperparams.py --output-dim 256 --sif-coefficient 0.0005 --strip-upper-case --strip-exotic --focus-pca --normalize-embeddings --vocab-size 150000
74
+
75
+ # dump features
76
+ python scripts/featurize.py --max-means 100000 --max-length 800
77
+
78
+ #fine tune
79
+ python scripts/finetune.py --model2vec-model-name scripts/models/dk-llm2vec-model2vec-dim256_sif0.0005_strip_upper_case_strip_exotic_focus_pca_normalize_embeddings --data-path features/features_100000_max_length_800 --lr 0.0001
80
  ```
81
 
82
+ ## Evaluation
83
+ The model was evaluated on the 10% of unseen data from the DDSC/nordic-embedding-training-data which contains examples of triplets containing a query, a positive (relevant) document and a negative (not relevant) document. The model achieved the following performance:
84
+
85
+ accuracy
86
+
87
+ model2vecdk : 0.867
88
+ BM25: 0.882
89
+ multilingual-e5-large-instruct: 0.963
90
+
91
+ The model was also evaluated using the [Scandinavian Embedding Benchmark](https://kennethenevoldsen.github.io/scandinavian-embedding-benchmark/) and achieved the following performance:
92
+
93
+ | Rank | Model | Average Score | Average Rank | Angry Tweets | Bornholm Parallel | DKHate | Da Political Comments | DanFEVER | LCC | Language Identification | Massive Intent | Massive Scenario | ScaLA | TV2Nord Retrieval | Twitterhjerne |
94
+ |------|--------------------------------|---------------|--------------|--------------|--------------------|--------|------------------------|----------|-------|--------------------------|----------------|------------------|--------|---------------------|----------------|
95
+ | 1 | TTC-L2V-supervised-2 | 0.68 | 4.75 | 67.09 | 54.59 | 69.00 | 45.84 | 38.31 | 73.67 | 88.61 | 74.80 | 78.35 | 53.04 | 92.79 | 85.02 |
96
+ | 2 | multilingual-e5-large-instruct | 0.66 | 7.75 | 64.57 | 55.02 | 67.14 | 45.33 | 39.52 | 70.60 | 82.48 | 71.89 | 77.51 | 50.18 | 93.69 | 77.23 |
97
+ | 3 | text-embedding-3-large | 0.64 | 8.92 | 57.80 | 43.34 | 70.21 | 43.41 | 39.61 | 58.07 | 79.74 | 69.27 | 75.92 | 50.69 | 95.20 | 81.08 |
98
+ | 42 | dfm-encoder-small-v1 (SimCSE) | 0.42 | 33.54 | 51.92 | 40.82 | 60.00 | 35.25 | 16.99 | 58.53 | 50.50 | 47.92 | 52.95 | 51.36 | 22.28 | 20.02 |
99
+ | 43 | **model2vecdk** | 0.42 | 36.62 | 48.19 | 7.83 | 59.73 | 32.40 | 26.04 | 47.67 | 63.97 | 51.23 | 60.87 | 50.18 | 55.47 | 20.19 |
100
+ | 44 | xlm-roberta-large | 0.40 | 35.92 | 51.74 | 4.34 | 60.21 | 31.85 | 10.62 | 48.73 | 81.29 | 47.26 | 49.55 | 60.29 | 6.11 | 20.39 |
101
 
 
102
 
 
103
 
104
  ## Additional Resources
105
 
106
+ - [Repo used to finetune](https://github.com/andersborges/dkmodel2vec)
107
  - [Model2Vec Repo](https://github.com/MinishLab/model2vec)
108
  - [Model2Vec Base Models](https://huggingface.co/collections/minishlab/model2vec-base-models-66fd9dd9b7c3b3c0f25ca90e)
109
  - [Model2Vec Results](https://github.com/MinishLab/model2vec/tree/main/results)