Update README.md
Browse files
README.md
CHANGED
@@ -1,130 +1,130 @@
|
|
1 |
-
---
|
2 |
-
base_model: baai/bge-base-en-v1.5
|
3 |
-
language:
|
4 |
-
- en
|
5 |
-
library_name: model2vec
|
6 |
-
license: mit
|
7 |
-
model_name: brown-fairy-base-v0
|
8 |
-
tags:
|
9 |
-
- embeddings
|
10 |
-
- static-embeddings
|
11 |
-
- sentence-transformers
|
12 |
-
---
|
13 |
-
#
|
14 |
-
|
15 |
-
<div align="center">
|
16 |
-
<img width="50%" alt="Fairy logo" src="./assets/fairy_logo.png">
|
17 |
-
</div>
|
18 |
-
|
19 |
-
> [!TIP]
|
20 |
-
> Fairies are among the most enchanting and magical beings in folklore and mythology. They appear across countless cultures and stories, from ancient forests to modern gardens. They are celebrated for their ability to bridge the mundane and magical realms, known for their ethereal grace and transformative powers. Fairies are tiny, higher-dimensional beings that can interact with the world in ways that are beyond our understanding.
|
21 |
-
|
22 |
-
The fairy series of models are an attempt to tune the beetle series of models to be more suitable for downstream tasks. These models are meant to fully open experiments at making state-of-the-art static embeddings.
|
23 |
-
|
24 |
-
The brown-fairy-base-v0 model is a distillation of the `baai/bge-base-en-v1.5` model into the `brown-beetle-base-v0` model. There was no PCA or Zipf applied to this model.
|
25 |
-
|
26 |
-
## Installation
|
27 |
-
|
28 |
-
Install model2vec using pip:
|
29 |
-
|
30 |
-
```bash
|
31 |
-
pip install model2vec
|
32 |
-
```
|
33 |
-
|
34 |
-
## Usage
|
35 |
-
|
36 |
-
Load this model using the `from_pretrained` method:
|
37 |
-
|
38 |
-
```python
|
39 |
-
from model2vec import StaticModel
|
40 |
-
|
41 |
-
# Load a pretrained Model2Vec model
|
42 |
-
model = StaticModel.from_pretrained("bhavnicksm/brown-fairy-base-v0")
|
43 |
-
|
44 |
-
# Compute text embeddings
|
45 |
-
embeddings = model.encode(["Example sentence"])
|
46 |
-
```
|
47 |
-
|
48 |
-
Read more about the Model2Vec library [here](https://github.com/MinishLab/model2vec).
|
49 |
-
|
50 |
-
## Reproduce this model
|
51 |
-
|
52 |
-
This model was trained on a subset of the 2 Million texts from the [FineWeb-Edu](https://huggingface.co/datasets/mixedbread-ai/fineweb-edu) dataset, which was labeled by the `baai/bge-base-en-v1.5` model.
|
53 |
-
|
54 |
-
<details>
|
55 |
-
<summary>Training Code</summary>
|
56 |
-
|
57 |
-
Note: The datasets need to me made seperately and loaded with the `datasets` library.
|
58 |
-
|
59 |
-
```python
|
60 |
-
static_embedding = StaticEmbedding.from_model2vec("bhavnicksm/brown-beetle-base-v0")
|
61 |
-
model = SentenceTransformer(
|
62 |
-
modules=[static_embedding]
|
63 |
-
)
|
64 |
-
|
65 |
-
loss = MSELoss(model)
|
66 |
-
|
67 |
-
run_name = "brown-fairy-base-v0"
|
68 |
-
args = SentenceTransformerTrainingArguments(
|
69 |
-
# Required parameter:
|
70 |
-
output_dir=f"output/{run_name}",
|
71 |
-
# Optional training parameters:
|
72 |
-
num_train_epochs=1,
|
73 |
-
per_device_train_batch_size=2048,
|
74 |
-
per_device_eval_batch_size=2048,
|
75 |
-
learning_rate=1e-1,
|
76 |
-
warmup_ratio=0.1,
|
77 |
-
fp16=False, # Set to False if you get an error that your GPU can't run on FP16
|
78 |
-
bf16=True, # Set to True if you have a GPU that supports BF16
|
79 |
-
batch_sampler=BatchSamplers.NO_DUPLICATES,
|
80 |
-
# Optional tracking/debugging parameters:
|
81 |
-
eval_strategy="steps",
|
82 |
-
eval_steps=50,
|
83 |
-
save_strategy="steps",
|
84 |
-
save_steps=50,
|
85 |
-
save_total_limit=5,
|
86 |
-
logging_steps=50,
|
87 |
-
logging_first_step=True,
|
88 |
-
run_name=run_name,
|
89 |
-
)
|
90 |
-
|
91 |
-
evaluator = NanoBEIREvaluator()
|
92 |
-
evaluator(model)
|
93 |
-
|
94 |
-
trainer = SentenceTransformerTrainer(
|
95 |
-
model=model,
|
96 |
-
args=args,
|
97 |
-
train_dataset=train_dataset,
|
98 |
-
eval_dataset=eval_dataset,
|
99 |
-
loss=loss,
|
100 |
-
evaluator=evaluator,
|
101 |
-
)
|
102 |
-
trainer.train()
|
103 |
-
|
104 |
-
evaluator(model)
|
105 |
-
|
106 |
-
model.save_pretrained(f"output/{run_name}")
|
107 |
-
```
|
108 |
-
|
109 |
-
</details>
|
110 |
-
|
111 |
-
## Comparison with other models
|
112 |
-
|
113 |
-
Coming soon...
|
114 |
-
|
115 |
-
## Acknowledgements
|
116 |
-
|
117 |
-
This model is based on the [Model2Vec](https://github.com/MinishLab/model2vec) library. Credit goes to the [Minish Lab](https://github.com/MinishLab) team for developing this library.
|
118 |
-
|
119 |
-
## Citation
|
120 |
-
|
121 |
-
This model builds on work done by Minish Lab. Please cite the [Model2Vec repository](https://github.com/MinishLab/model2vec) if you use this model in your work.
|
122 |
-
|
123 |
-
```bibtex
|
124 |
-
@software{minishlab2024model2vec,
|
125 |
-
authors = {Stephan Tulkens, Thomas van Dongen},
|
126 |
-
title = {Model2Vec: Turn any Sentence Transformer into a Small Fast Model},
|
127 |
-
year = {2024},
|
128 |
-
url = {https://github.com/MinishLab/model2vec},
|
129 |
-
}
|
130 |
-
```
|
|
|
1 |
+
---
|
2 |
+
base_model: baai/bge-base-en-v1.5
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
library_name: model2vec
|
6 |
+
license: mit
|
7 |
+
model_name: brown-fairy-base-v0
|
8 |
+
tags:
|
9 |
+
- embeddings
|
10 |
+
- static-embeddings
|
11 |
+
- sentence-transformers
|
12 |
+
---
|
13 |
+
# 🧚🏻♀️ brown-fairy-base-v0 Model Card
|
14 |
+
|
15 |
+
<div align="center">
|
16 |
+
<img width="50%" alt="Fairy logo" src="./assets/fairy_logo.png">
|
17 |
+
</div>
|
18 |
+
|
19 |
+
> [!TIP]
|
20 |
+
> Fairies are among the most enchanting and magical beings in folklore and mythology. They appear across countless cultures and stories, from ancient forests to modern gardens. They are celebrated for their ability to bridge the mundane and magical realms, known for their ethereal grace and transformative powers. Fairies are tiny, higher-dimensional beings that can interact with the world in ways that are beyond our understanding.
|
21 |
+
|
22 |
+
The fairy series of models are an attempt to tune the beetle series of models to be more suitable for downstream tasks. These models are meant to fully open experiments at making state-of-the-art static embeddings.
|
23 |
+
|
24 |
+
The brown-fairy-base-v0 model is a distillation of the `baai/bge-base-en-v1.5` model into the `brown-beetle-base-v0` model. There was no PCA or Zipf applied to this model.
|
25 |
+
|
26 |
+
## Installation
|
27 |
+
|
28 |
+
Install model2vec using pip:
|
29 |
+
|
30 |
+
```bash
|
31 |
+
pip install model2vec
|
32 |
+
```
|
33 |
+
|
34 |
+
## Usage
|
35 |
+
|
36 |
+
Load this model using the `from_pretrained` method:
|
37 |
+
|
38 |
+
```python
|
39 |
+
from model2vec import StaticModel
|
40 |
+
|
41 |
+
# Load a pretrained Model2Vec model
|
42 |
+
model = StaticModel.from_pretrained("bhavnicksm/brown-fairy-base-v0")
|
43 |
+
|
44 |
+
# Compute text embeddings
|
45 |
+
embeddings = model.encode(["Example sentence"])
|
46 |
+
```
|
47 |
+
|
48 |
+
Read more about the Model2Vec library [here](https://github.com/MinishLab/model2vec).
|
49 |
+
|
50 |
+
## Reproduce this model
|
51 |
+
|
52 |
+
This model was trained on a subset of the 2 Million texts from the [FineWeb-Edu](https://huggingface.co/datasets/mixedbread-ai/fineweb-edu) dataset, which was labeled by the `baai/bge-base-en-v1.5` model.
|
53 |
+
|
54 |
+
<details>
|
55 |
+
<summary>Training Code</summary>
|
56 |
+
|
57 |
+
Note: The datasets need to me made seperately and loaded with the `datasets` library.
|
58 |
+
|
59 |
+
```python
|
60 |
+
static_embedding = StaticEmbedding.from_model2vec("bhavnicksm/brown-beetle-base-v0")
|
61 |
+
model = SentenceTransformer(
|
62 |
+
modules=[static_embedding]
|
63 |
+
)
|
64 |
+
|
65 |
+
loss = MSELoss(model)
|
66 |
+
|
67 |
+
run_name = "brown-fairy-base-v0"
|
68 |
+
args = SentenceTransformerTrainingArguments(
|
69 |
+
# Required parameter:
|
70 |
+
output_dir=f"output/{run_name}",
|
71 |
+
# Optional training parameters:
|
72 |
+
num_train_epochs=1,
|
73 |
+
per_device_train_batch_size=2048,
|
74 |
+
per_device_eval_batch_size=2048,
|
75 |
+
learning_rate=1e-1,
|
76 |
+
warmup_ratio=0.1,
|
77 |
+
fp16=False, # Set to False if you get an error that your GPU can't run on FP16
|
78 |
+
bf16=True, # Set to True if you have a GPU that supports BF16
|
79 |
+
batch_sampler=BatchSamplers.NO_DUPLICATES,
|
80 |
+
# Optional tracking/debugging parameters:
|
81 |
+
eval_strategy="steps",
|
82 |
+
eval_steps=50,
|
83 |
+
save_strategy="steps",
|
84 |
+
save_steps=50,
|
85 |
+
save_total_limit=5,
|
86 |
+
logging_steps=50,
|
87 |
+
logging_first_step=True,
|
88 |
+
run_name=run_name,
|
89 |
+
)
|
90 |
+
|
91 |
+
evaluator = NanoBEIREvaluator()
|
92 |
+
evaluator(model)
|
93 |
+
|
94 |
+
trainer = SentenceTransformerTrainer(
|
95 |
+
model=model,
|
96 |
+
args=args,
|
97 |
+
train_dataset=train_dataset,
|
98 |
+
eval_dataset=eval_dataset,
|
99 |
+
loss=loss,
|
100 |
+
evaluator=evaluator,
|
101 |
+
)
|
102 |
+
trainer.train()
|
103 |
+
|
104 |
+
evaluator(model)
|
105 |
+
|
106 |
+
model.save_pretrained(f"output/{run_name}")
|
107 |
+
```
|
108 |
+
|
109 |
+
</details>
|
110 |
+
|
111 |
+
## Comparison with other models
|
112 |
+
|
113 |
+
Coming soon...
|
114 |
+
|
115 |
+
## Acknowledgements
|
116 |
+
|
117 |
+
This model is based on the [Model2Vec](https://github.com/MinishLab/model2vec) library. Credit goes to the [Minish Lab](https://github.com/MinishLab) team for developing this library.
|
118 |
+
|
119 |
+
## Citation
|
120 |
+
|
121 |
+
This model builds on work done by Minish Lab. Please cite the [Model2Vec repository](https://github.com/MinishLab/model2vec) if you use this model in your work.
|
122 |
+
|
123 |
+
```bibtex
|
124 |
+
@software{minishlab2024model2vec,
|
125 |
+
authors = {Stephan Tulkens, Thomas van Dongen},
|
126 |
+
title = {Model2Vec: Turn any Sentence Transformer into a Small Fast Model},
|
127 |
+
year = {2024},
|
128 |
+
url = {https://github.com/MinishLab/model2vec},
|
129 |
+
}
|
130 |
+
```
|