bwang0911 commited on
Commit
1ee3980
1 Parent(s): e65f2ba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -0
README.md CHANGED
@@ -1,3 +1,76 @@
1
  ---
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ inference: false
6
  ---
7
+
8
+ <br><br>
9
+
10
+ <p align="center">
11
+ <img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
12
+ </p>
13
+
14
+
15
+ <p align="center">
16
+ <b>The text embedding suit trained by Jina AI, Finetuner team.</b>
17
+ </p>
18
+
19
+
20
+ ## Intented Usage & Model Info
21
+
22
+ `jina-embedding-b-en-v1` is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
23
+ This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
24
+ These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
25
+ The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
26
+
27
+ The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
28
+
29
+ With a standard size of 110 million parameters,
30
+ the model enables fast inference while delivering better performance than our small model.
31
+ It is recommended to use a single GPU for inference.
32
+ Additionally, we provide the following options:
33
+
34
+ - `jina-embedding-s-en-v1`: 35 million parameters.
35
+ - `jina-embedding-l-en-v1`: 800 million parameters.
36
+ - `jina-embedding-xl-en-v1`: 3 billion parameters (soon).
37
+ - `jina-embedding-xxl-en-v1`: 11 billion parameters (soon).
38
+
39
+ ## Data & Parameters
40
+
41
+ More info will be released together with the technique report.
42
+
43
+ ## Metrics
44
+
45
+ We compared the model against `all-minilm-l6-v2` from sbert and `text-embeddings-ada-002` from OpenAI:
46
+
47
+ |Name|param |context|
48
+ |------------------------------|-----|------|
49
+ |all-minilm-l6-v2|33m |128|
50
+ |all-mpnet--base-v2 |110m |128|
51
+ |ada-embedding-002|Unknown/API based |8192|
52
+ |jina-embedding-s-en-v1|35m |512|
53
+ |jina-embedding-b-en-v1|110m |512|
54
+
55
+
56
+ |Name|STS12|STS13|STS14|STS15|STS16|STS17|TRECOVID|Quora|SciFact|
57
+ |------------------------------|-----|-----|-----|-----|-----|-----|--------|-----|-----|
58
+ |all-minilm-l6-v2|0.724|0.806|0.756|0.854|0.79 |0.876|0.473 |0.876|0.645 |
59
+ |all-mpnet--base-v2|0.726|0.835|0.78 |0.857|0.8 |0.906|0.513 |0.875|0.656 |
60
+ |ada-embedding-002|0.698|0.833|0.761|0.861|0.86 |0.903|0.685 |0.876|0.726 |
61
+ |jina-embedding-b-en-v1|0.736|0.804|0.745|0.844|0.793|0.873|0.481 |0.87|0.616 |
62
+
63
+ For more tasks and metrics, please checkout [MTEB](https://huggingface.co/spaces/mteb/leaderboard) benchmark.
64
+
65
+ ## Usage [WIP]
66
+
67
+ ```python
68
+ !pip install finetuner[text]
69
+ import finetuner
70
+ model = finetuner.get_model('jinaai/jina-embedding-b-en-v1')
71
+ embeddings = model.encode(['sentence 1', 'sentence 2'])
72
+ ```
73
+
74
+ ## Fine-tuning [WIP]
75
+
76
+ Please consider [Finetuner](https://github.com/jina-ai/finetuner).