JPBianchi commited on
Commit
0cc727c
2 Parent(s): 30ffb9e 676f649

Merge branch 'main' of hf.co:spaces/JPBianchi/vectorsearch

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. data/.DS_Store +0 -0
  2. data/golden_100.json +0 -0
  3. data/impact_theory_data.json +3 -0
  4. data/training_data_300.json +0 -0
  5. data/validation_data_100.json +0 -0
  6. models/.DS_Store +0 -0
  7. models/finetuned-all-MiniLM-L6-v2-300/.DS_Store +0 -0
  8. models/finetuned-all-MiniLM-L6-v2-300/1_Pooling/config.json +7 -0
  9. models/finetuned-all-MiniLM-L6-v2-300/README.md +91 -0
  10. models/finetuned-all-MiniLM-L6-v2-300/config.json +26 -0
  11. models/finetuned-all-MiniLM-L6-v2-300/config_sentence_transformers.json +7 -0
  12. models/finetuned-all-MiniLM-L6-v2-300/eval/Information-Retrieval_evaluation_results.csv +11 -0
  13. models/finetuned-all-MiniLM-L6-v2-300/modules.json +20 -0
  14. models/finetuned-all-MiniLM-L6-v2-300/pytorch_model.bin +3 -0
  15. models/finetuned-all-MiniLM-L6-v2-300/sentence_bert_config.json +4 -0
  16. models/finetuned-all-MiniLM-L6-v2-300/special_tokens_map.json +7 -0
  17. models/finetuned-all-MiniLM-L6-v2-300/tokenizer.json +0 -0
  18. models/finetuned-all-MiniLM-L6-v2-300/tokenizer_config.json +22 -0
  19. models/finetuned-all-MiniLM-L6-v2-300/vocab.txt +0 -0
  20. models/local.txt +1 -0
  21. models/models/.DS_Store +0 -0
  22. models/models/all-MiniLM-L6-v2/.DS_Store +0 -0
  23. models/models/all-MiniLM-L6-v2/1_Pooling/config.json +7 -0
  24. models/models/all-MiniLM-L6-v2/README.md +176 -0
  25. models/models/all-MiniLM-L6-v2/config.json +26 -0
  26. models/models/all-MiniLM-L6-v2/config_sentence_transformers.json +7 -0
  27. models/models/all-MiniLM-L6-v2/modules.json +20 -0
  28. models/models/all-MiniLM-L6-v2/pytorch_model.bin +3 -0
  29. models/models/all-MiniLM-L6-v2/sentence_bert_config.json +4 -0
  30. models/models/all-MiniLM-L6-v2/special_tokens_map.json +7 -0
  31. models/models/all-MiniLM-L6-v2/tokenizer.json +0 -0
  32. models/models/all-MiniLM-L6-v2/tokenizer_config.json +22 -0
  33. models/models/all-MiniLM-L6-v2/vocab.txt +0 -0
  34. models/models/all-mpnet-base-v2/.DS_Store +0 -0
  35. models/models/all-mpnet-base-v2/1_Pooling/config.json +7 -0
  36. models/models/all-mpnet-base-v2/README.md +176 -0
  37. models/models/all-mpnet-base-v2/config.json +24 -0
  38. models/models/all-mpnet-base-v2/config_sentence_transformers.json +7 -0
  39. models/models/all-mpnet-base-v2/modules.json +20 -0
  40. models/models/all-mpnet-base-v2/pytorch_model.bin +3 -0
  41. models/models/all-mpnet-base-v2/sentence_bert_config.json +4 -0
  42. models/models/all-mpnet-base-v2/special_tokens_map.json +15 -0
  43. models/models/all-mpnet-base-v2/tokenizer.json +0 -0
  44. models/models/all-mpnet-base-v2/tokenizer_config.json +22 -0
  45. models/models/all-mpnet-base-v2/vocab.txt +0 -0
  46. models/models/finetuned-all-mpnet-base-v2-300/.DS_Store +0 -0
  47. models/models/finetuned-all-mpnet-base-v2-300/1_Pooling/config.json +7 -0
  48. models/models/finetuned-all-mpnet-base-v2-300/README.md +91 -0
  49. models/models/finetuned-all-mpnet-base-v2-300/config.json +24 -0
  50. models/models/finetuned-all-mpnet-base-v2-300/config_sentence_transformers.json +7 -0
data/.DS_Store ADDED
Binary file (6.15 kB). View file
 
data/golden_100.json ADDED
The diff for this file is too large to render. See raw diff
 
data/impact_theory_data.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f9a71cfb9065a90784b6f69da8ec606a4c03fc2bb1a2cae6a7d3182bb7819a22
3
+ size 26933365
data/training_data_300.json ADDED
The diff for this file is too large to render. See raw diff
 
data/validation_data_100.json ADDED
The diff for this file is too large to render. See raw diff
 
models/.DS_Store ADDED
Binary file (6.15 kB). View file
 
models/finetuned-all-MiniLM-L6-v2-300/.DS_Store ADDED
Binary file (6.15 kB). View file
 
models/finetuned-all-MiniLM-L6-v2-300/1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 384,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
models/finetuned-all-MiniLM-L6-v2-300/README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+
8
+ ---
9
+
10
+ # {MODEL_NAME}
11
+
12
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
13
+
14
+ <!--- Describe your model here -->
15
+
16
+ ## Usage (Sentence-Transformers)
17
+
18
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
19
+
20
+ ```
21
+ pip install -U sentence-transformers
22
+ ```
23
+
24
+ Then you can use the model like this:
25
+
26
+ ```python
27
+ from sentence_transformers import SentenceTransformer
28
+ sentences = ["This is an example sentence", "Each sentence is converted"]
29
+
30
+ model = SentenceTransformer('{MODEL_NAME}')
31
+ embeddings = model.encode(sentences)
32
+ print(embeddings)
33
+ ```
34
+
35
+
36
+
37
+ ## Evaluation Results
38
+
39
+ <!--- Describe how your model was evaluated -->
40
+
41
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
42
+
43
+
44
+ ## Training
45
+ The model was trained with the parameters:
46
+
47
+ **DataLoader**:
48
+
49
+ `torch.utils.data.dataloader.DataLoader` of length 10 with parameters:
50
+ ```
51
+ {'batch_size': 32, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
52
+ ```
53
+
54
+ **Loss**:
55
+
56
+ `sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
57
+ ```
58
+ {'scale': 20.0, 'similarity_fct': 'cos_sim'}
59
+ ```
60
+
61
+ Parameters of the fit()-Method:
62
+ ```
63
+ {
64
+ "epochs": 10,
65
+ "evaluation_steps": 50,
66
+ "evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator.InformationRetrievalEvaluator",
67
+ "max_grad_norm": 1,
68
+ "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
69
+ "optimizer_params": {
70
+ "lr": 2e-05
71
+ },
72
+ "scheduler": "WarmupLinear",
73
+ "steps_per_epoch": null,
74
+ "warmup_steps": 10,
75
+ "weight_decay": 0.01
76
+ }
77
+ ```
78
+
79
+
80
+ ## Full Model Architecture
81
+ ```
82
+ SentenceTransformer(
83
+ (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel
84
+ (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
85
+ (2): Normalize()
86
+ )
87
+ ```
88
+
89
+ ## Citing & Authors
90
+
91
+ <!--- Describe where people can find more information -->
models/finetuned-all-MiniLM-L6-v2-300/config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/Users/jpb2/.cache/torch/sentence_transformers/sentence-transformers_all-MiniLM-L6-v2/",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 384,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 1536,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 6,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.33.1",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 30522
26
+ }
models/finetuned-all-MiniLM-L6-v2-300/config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.0.0",
4
+ "transformers": "4.6.1",
5
+ "pytorch": "1.8.1"
6
+ }
7
+ }
models/finetuned-all-MiniLM-L6-v2-300/eval/Information-Retrieval_evaluation_results.csv ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ epoch,steps,cos_sim-Accuracy@1,cos_sim-Accuracy@3,cos_sim-Accuracy@5,cos_sim-Accuracy@10,cos_sim-Precision@1,cos_sim-Recall@1,cos_sim-Precision@3,cos_sim-Recall@3,cos_sim-Precision@5,cos_sim-Recall@5,cos_sim-Precision@10,cos_sim-Recall@10,cos_sim-MRR@10,cos_sim-NDCG@10,cos_sim-MAP@100,dot_score-Accuracy@1,dot_score-Accuracy@3,dot_score-Accuracy@5,dot_score-Accuracy@10,dot_score-Precision@1,dot_score-Recall@1,dot_score-Precision@3,dot_score-Recall@3,dot_score-Precision@5,dot_score-Recall@5,dot_score-Precision@10,dot_score-Recall@10,dot_score-MRR@10,dot_score-NDCG@10,dot_score-MAP@100
2
+ 0,-1,0.9,0.95,0.96,0.98,0.9,0.9,0.31666666666666665,0.95,0.19199999999999995,0.96,0.09799999999999998,0.98,0.9282619047619047,0.9407679373201044,0.9287259199134199,0.9,0.95,0.96,0.98,0.9,0.9,0.31666666666666665,0.95,0.19199999999999995,0.96,0.09799999999999998,0.98,0.9282619047619047,0.9407679373201044,0.9287259199134199
3
+ 1,-1,0.92,0.94,0.98,0.98,0.92,0.92,0.3133333333333333,0.94,0.19599999999999995,0.98,0.09799999999999998,0.98,0.94,0.9498456573943649,0.9404494949494949,0.92,0.94,0.98,0.98,0.92,0.92,0.3133333333333333,0.94,0.19599999999999995,0.98,0.09799999999999998,0.98,0.94,0.9498456573943649,0.9404494949494949
4
+ 2,-1,0.92,0.96,0.98,0.98,0.92,0.92,0.31999999999999995,0.96,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9428333333333333,0.952103186260223,0.9433273809523809,0.92,0.96,0.98,0.98,0.92,0.92,0.31999999999999995,0.96,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9428333333333333,0.952103186260223,0.9433273809523809
5
+ 3,-1,0.92,0.97,0.98,0.98,0.92,0.92,0.3233333333333333,0.97,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9436666666666667,0.9527964206794891,0.9441648659463786,0.92,0.97,0.98,0.98,0.92,0.92,0.3233333333333333,0.97,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9436666666666667,0.9527964206794891,0.9441648659463786
6
+ 4,-1,0.92,0.97,0.98,0.98,0.92,0.92,0.3233333333333333,0.97,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9436666666666667,0.9527964206794891,0.944189247311828,0.92,0.97,0.98,0.98,0.92,0.92,0.3233333333333333,0.97,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9436666666666667,0.9527964206794891,0.944189247311828
7
+ 5,-1,0.92,0.97,0.98,0.98,0.92,0.92,0.3233333333333333,0.97,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9436666666666667,0.9527964206794891,0.94415837621498,0.92,0.97,0.98,0.98,0.92,0.92,0.3233333333333333,0.97,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9436666666666667,0.9527964206794891,0.94415837621498
8
+ 6,-1,0.92,0.97,0.98,0.98,0.92,0.92,0.3233333333333333,0.97,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9436666666666667,0.9527964206794891,0.9441678187403995,0.92,0.97,0.98,0.98,0.92,0.92,0.3233333333333333,0.97,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9436666666666667,0.9527964206794891,0.9441678187403995
9
+ 7,-1,0.92,0.97,0.98,0.98,0.92,0.92,0.3233333333333333,0.97,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9436666666666667,0.9527964206794891,0.944174432497013,0.92,0.97,0.98,0.98,0.92,0.92,0.3233333333333333,0.97,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9436666666666667,0.9527964206794891,0.944174432497013
10
+ 8,-1,0.92,0.96,0.98,0.98,0.92,0.92,0.31999999999999995,0.96,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9428333333333333,0.952103186260223,0.9433410991636799,0.92,0.96,0.98,0.98,0.92,0.92,0.31999999999999995,0.96,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9428333333333333,0.952103186260223,0.9433410991636799
11
+ 9,-1,0.92,0.96,0.98,0.98,0.92,0.92,0.31999999999999995,0.96,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9428333333333333,0.952103186260223,0.943351851851852,0.92,0.96,0.98,0.98,0.92,0.92,0.31999999999999995,0.96,0.19599999999999995,0.98,0.09799999999999998,0.98,0.9428333333333333,0.952103186260223,0.943351851851852
models/finetuned-all-MiniLM-L6-v2-300/modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
models/finetuned-all-MiniLM-L6-v2-300/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19042a66f72a393f7ac7c494c22ea0e8fa32c4108d0f6f3bca94be5de46d5ad9
3
+ size 90885737
models/finetuned-all-MiniLM-L6-v2-300/sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 256,
3
+ "do_lower_case": false
4
+ }
models/finetuned-all-MiniLM-L6-v2-300/special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
models/finetuned-all-MiniLM-L6-v2-300/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
models/finetuned-all-MiniLM-L6-v2-300/tokenizer_config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "clean_up_tokenization_spaces": true,
3
+ "cls_token": "[CLS]",
4
+ "do_basic_tokenize": true,
5
+ "do_lower_case": true,
6
+ "mask_token": "[MASK]",
7
+ "max_length": 128,
8
+ "model_max_length": 512,
9
+ "never_split": null,
10
+ "pad_to_multiple_of": null,
11
+ "pad_token": "[PAD]",
12
+ "pad_token_type_id": 0,
13
+ "padding_side": "right",
14
+ "sep_token": "[SEP]",
15
+ "stride": 0,
16
+ "strip_accents": null,
17
+ "tokenize_chinese_chars": true,
18
+ "tokenizer_class": "BertTokenizer",
19
+ "truncation_side": "right",
20
+ "truncation_strategy": "longest_first",
21
+ "unk_token": "[UNK]"
22
+ }
models/finetuned-all-MiniLM-L6-v2-300/vocab.txt ADDED
The diff for this file is too large to render. See raw diff
 
models/local.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ Used to let streamlit if it's running locally or online
models/models/.DS_Store ADDED
Binary file (6.15 kB). View file
 
models/models/all-MiniLM-L6-v2/.DS_Store ADDED
Binary file (6.15 kB). View file
 
models/models/all-MiniLM-L6-v2/1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 384,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
models/models/all-MiniLM-L6-v2/README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ language: en
8
+ license: apache-2.0
9
+ datasets:
10
+ - s2orc
11
+ - flax-sentence-embeddings/stackexchange_xml
12
+ - ms_marco
13
+ - gooaq
14
+ - yahoo_answers_topics
15
+ - code_search_net
16
+ - search_qa
17
+ - eli5
18
+ - snli
19
+ - multi_nli
20
+ - wikihow
21
+ - natural_questions
22
+ - trivia_qa
23
+ - embedding-data/sentence-compression
24
+ - embedding-data/flickr30k-captions
25
+ - embedding-data/altlex
26
+ - embedding-data/simple-wiki
27
+ - embedding-data/QQP
28
+ - embedding-data/SPECTER
29
+ - embedding-data/PAQ_pairs
30
+ - embedding-data/WikiAnswers
31
+
32
+ ---
33
+
34
+
35
+ # all-MiniLM-L6-v2
36
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
37
+
38
+ ## Usage (Sentence-Transformers)
39
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
40
+
41
+ ```
42
+ pip install -U sentence-transformers
43
+ ```
44
+
45
+ Then you can use the model like this:
46
+ ```python
47
+ from sentence_transformers import SentenceTransformer
48
+ sentences = ["This is an example sentence", "Each sentence is converted"]
49
+
50
+ model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
51
+ embeddings = model.encode(sentences)
52
+ print(embeddings)
53
+ ```
54
+
55
+ ## Usage (HuggingFace Transformers)
56
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
57
+
58
+ ```python
59
+ from transformers import AutoTokenizer, AutoModel
60
+ import torch
61
+ import torch.nn.functional as F
62
+
63
+ #Mean Pooling - Take attention mask into account for correct averaging
64
+ def mean_pooling(model_output, attention_mask):
65
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
66
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
67
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
68
+
69
+
70
+ # Sentences we want sentence embeddings for
71
+ sentences = ['This is an example sentence', 'Each sentence is converted']
72
+
73
+ # Load model from HuggingFace Hub
74
+ tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
75
+ model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
76
+
77
+ # Tokenize sentences
78
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
79
+
80
+ # Compute token embeddings
81
+ with torch.no_grad():
82
+ model_output = model(**encoded_input)
83
+
84
+ # Perform pooling
85
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
86
+
87
+ # Normalize embeddings
88
+ sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
89
+
90
+ print("Sentence embeddings:")
91
+ print(sentence_embeddings)
92
+ ```
93
+
94
+ ## Evaluation Results
95
+
96
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=sentence-transformers/all-MiniLM-L6-v2)
97
+
98
+ ------
99
+
100
+ ## Background
101
+
102
+ The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
103
+ contrastive learning objective. We used the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model and fine-tuned in on a
104
+ 1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
105
+
106
+ We developped this model during the
107
+ [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
108
+ organized by Hugging Face. We developped this model as part of the project:
109
+ [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
110
+
111
+ ## Intended uses
112
+
113
+ Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
114
+ the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
115
+
116
+ By default, input text longer than 256 word pieces is truncated.
117
+
118
+
119
+ ## Training procedure
120
+
121
+ ### Pre-training
122
+
123
+ We use the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model. Please refer to the model card for more detailed information about the pre-training procedure.
124
+
125
+ ### Fine-tuning
126
+
127
+ We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch.
128
+ We then apply the cross entropy loss by comparing with true pairs.
129
+
130
+ #### Hyper parameters
131
+
132
+ We trained ou model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core).
133
+ We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with
134
+ a 2e-5 learning rate. The full training script is accessible in this current repository: `train_script.py`.
135
+
136
+ #### Training data
137
+
138
+ We use the concatenation from multiple datasets to fine-tune our model. The total number of sentence pairs is above 1 billion sentences.
139
+ We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
140
+
141
+
142
+ | Dataset | Paper | Number of training tuples |
143
+ |--------------------------------------------------------|:----------------------------------------:|:--------------------------:|
144
+ | [Reddit comments (2015-2018)](https://github.com/PolyAI-LDN/conversational-datasets/tree/master/reddit) | [paper](https://arxiv.org/abs/1904.06472) | 726,484,430 |
145
+ | [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Abstracts) | [paper](https://aclanthology.org/2020.acl-main.447/) | 116,288,806 |
146
+ | [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs | [paper](https://doi.org/10.1145/2623330.2623677) | 77,427,422 |
147
+ | [PAQ](https://github.com/facebookresearch/PAQ) (Question, Answer) pairs | [paper](https://arxiv.org/abs/2102.07033) | 64,371,441 |
148
+ | [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Titles) | [paper](https://aclanthology.org/2020.acl-main.447/) | 52,603,982 |
149
+ | [S2ORC](https://github.com/allenai/s2orc) (Title, Abstract) | [paper](https://aclanthology.org/2020.acl-main.447/) | 41,769,185 |
150
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Body) pairs | - | 25,316,456 |
151
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title+Body, Answer) pairs | - | 21,396,559 |
152
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs | - | 21,396,559 |
153
+ | [MS MARCO](https://microsoft.github.io/msmarco/) triplets | [paper](https://doi.org/10.1145/3404835.3462804) | 9,144,553 |
154
+ | [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) | [paper](https://arxiv.org/pdf/2104.08727.pdf) | 3,012,496 |
155
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 1,198,260 |
156
+ | [Code Search](https://huggingface.co/datasets/code_search_net) | - | 1,151,414 |
157
+ | [COCO](https://cocodataset.org/#home) Image captions | [paper](https://link.springer.com/chapter/10.1007%2F978-3-319-10602-1_48) | 828,395|
158
+ | [SPECTER](https://github.com/allenai/specter) citation triplets | [paper](https://doi.org/10.18653/v1/2020.acl-main.207) | 684,100 |
159
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 681,164 |
160
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 659,896 |
161
+ | [SearchQA](https://huggingface.co/datasets/search_qa) | [paper](https://arxiv.org/abs/1704.05179) | 582,261 |
162
+ | [Eli5](https://huggingface.co/datasets/eli5) | [paper](https://doi.org/10.18653/v1/p19-1346) | 325,475 |
163
+ | [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/229/33) | 317,695 |
164
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles) | | 304,525 |
165
+ | AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | [paper SNLI](https://doi.org/10.18653/v1/d15-1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18-1101) | 277,230 |
166
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies) | | 250,519 |
167
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | | 250,460 |
168
+ | [Sentence Compression](https://github.com/google-research-datasets/sentence-compression) | [paper](https://www.aclweb.org/anthology/D13-1155/) | 180,000 |
169
+ | [Wikihow](https://github.com/pvl/wikihow_pairs_dataset) | [paper](https://arxiv.org/abs/1810.09305) | 128,542 |
170
+ | [Altlex](https://github.com/chridey/altlex/) | [paper](https://aclanthology.org/P16-1135.pdf) | 112,696 |
171
+ | [Quora Question Triplets](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) | - | 103,663 |
172
+ | [Simple Wikipedia](https://cs.pomona.edu/~dkauchak/simplification/) | [paper](https://www.aclweb.org/anthology/P11-2117/) | 102,225 |
173
+ | [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/1455) | 100,231 |
174
+ | [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) | [paper](https://aclanthology.org/P18-2124.pdf) | 87,599 |
175
+ | [TriviaQA](https://huggingface.co/datasets/trivia_qa) | - | 73,346 |
176
+ | **Total** | | **1,170,060,424** |
models/models/all-MiniLM-L6-v2/config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/Users/jpb2/.cache/torch/sentence_transformers/sentence-transformers_all-MiniLM-L6-v2/",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 384,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 1536,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 6,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.33.1",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 30522
26
+ }
models/models/all-MiniLM-L6-v2/config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.0.0",
4
+ "transformers": "4.6.1",
5
+ "pytorch": "1.8.1"
6
+ }
7
+ }
models/models/all-MiniLM-L6-v2/modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
models/models/all-MiniLM-L6-v2/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:72ea817e757ec2f5aea799d9be2f38ea29fadbeadcc63952feacc79524ccd8c5
3
+ size 90885737
models/models/all-MiniLM-L6-v2/sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 256,
3
+ "do_lower_case": false
4
+ }
models/models/all-MiniLM-L6-v2/special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
models/models/all-MiniLM-L6-v2/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
models/models/all-MiniLM-L6-v2/tokenizer_config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "clean_up_tokenization_spaces": true,
3
+ "cls_token": "[CLS]",
4
+ "do_basic_tokenize": true,
5
+ "do_lower_case": true,
6
+ "mask_token": "[MASK]",
7
+ "max_length": 128,
8
+ "model_max_length": 512,
9
+ "never_split": null,
10
+ "pad_to_multiple_of": null,
11
+ "pad_token": "[PAD]",
12
+ "pad_token_type_id": 0,
13
+ "padding_side": "right",
14
+ "sep_token": "[SEP]",
15
+ "stride": 0,
16
+ "strip_accents": null,
17
+ "tokenize_chinese_chars": true,
18
+ "tokenizer_class": "BertTokenizer",
19
+ "truncation_side": "right",
20
+ "truncation_strategy": "longest_first",
21
+ "unk_token": "[UNK]"
22
+ }
models/models/all-MiniLM-L6-v2/vocab.txt ADDED
The diff for this file is too large to render. See raw diff
 
models/models/all-mpnet-base-v2/.DS_Store ADDED
Binary file (6.15 kB). View file
 
models/models/all-mpnet-base-v2/1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
models/models/all-mpnet-base-v2/README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ language: en
8
+ license: apache-2.0
9
+ datasets:
10
+ - s2orc
11
+ - flax-sentence-embeddings/stackexchange_xml
12
+ - ms_marco
13
+ - gooaq
14
+ - yahoo_answers_topics
15
+ - code_search_net
16
+ - search_qa
17
+ - eli5
18
+ - snli
19
+ - multi_nli
20
+ - wikihow
21
+ - natural_questions
22
+ - trivia_qa
23
+ - embedding-data/sentence-compression
24
+ - embedding-data/flickr30k-captions
25
+ - embedding-data/altlex
26
+ - embedding-data/simple-wiki
27
+ - embedding-data/QQP
28
+ - embedding-data/SPECTER
29
+ - embedding-data/PAQ_pairs
30
+ - embedding-data/WikiAnswers
31
+
32
+ ---
33
+
34
+
35
+ # all-mpnet-base-v2
36
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
37
+
38
+ ## Usage (Sentence-Transformers)
39
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
40
+
41
+ ```
42
+ pip install -U sentence-transformers
43
+ ```
44
+
45
+ Then you can use the model like this:
46
+ ```python
47
+ from sentence_transformers import SentenceTransformer
48
+ sentences = ["This is an example sentence", "Each sentence is converted"]
49
+
50
+ model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
51
+ embeddings = model.encode(sentences)
52
+ print(embeddings)
53
+ ```
54
+
55
+ ## Usage (HuggingFace Transformers)
56
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
57
+
58
+ ```python
59
+ from transformers import AutoTokenizer, AutoModel
60
+ import torch
61
+ import torch.nn.functional as F
62
+
63
+ #Mean Pooling - Take attention mask into account for correct averaging
64
+ def mean_pooling(model_output, attention_mask):
65
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
66
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
67
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
68
+
69
+
70
+ # Sentences we want sentence embeddings for
71
+ sentences = ['This is an example sentence', 'Each sentence is converted']
72
+
73
+ # Load model from HuggingFace Hub
74
+ tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
75
+ model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')
76
+
77
+ # Tokenize sentences
78
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
79
+
80
+ # Compute token embeddings
81
+ with torch.no_grad():
82
+ model_output = model(**encoded_input)
83
+
84
+ # Perform pooling
85
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
86
+
87
+ # Normalize embeddings
88
+ sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
89
+
90
+ print("Sentence embeddings:")
91
+ print(sentence_embeddings)
92
+ ```
93
+
94
+ ## Evaluation Results
95
+
96
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=sentence-transformers/all-mpnet-base-v2)
97
+
98
+ ------
99
+
100
+ ## Background
101
+
102
+ The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
103
+ contrastive learning objective. We used the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model and fine-tuned in on a
104
+ 1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
105
+
106
+ We developped this model during the
107
+ [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
108
+ organized by Hugging Face. We developped this model as part of the project:
109
+ [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
110
+
111
+ ## Intended uses
112
+
113
+ Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
114
+ the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
115
+
116
+ By default, input text longer than 384 word pieces is truncated.
117
+
118
+
119
+ ## Training procedure
120
+
121
+ ### Pre-training
122
+
123
+ We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model. Please refer to the model card for more detailed information about the pre-training procedure.
124
+
125
+ ### Fine-tuning
126
+
127
+ We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch.
128
+ We then apply the cross entropy loss by comparing with true pairs.
129
+
130
+ #### Hyper parameters
131
+
132
+ We trained ou model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core).
133
+ We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with
134
+ a 2e-5 learning rate. The full training script is accessible in this current repository: `train_script.py`.
135
+
136
+ #### Training data
137
+
138
+ We use the concatenation from multiple datasets to fine-tune our model. The total number of sentence pairs is above 1 billion sentences.
139
+ We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
140
+
141
+
142
+ | Dataset | Paper | Number of training tuples |
143
+ |--------------------------------------------------------|:----------------------------------------:|:--------------------------:|
144
+ | [Reddit comments (2015-2018)](https://github.com/PolyAI-LDN/conversational-datasets/tree/master/reddit) | [paper](https://arxiv.org/abs/1904.06472) | 726,484,430 |
145
+ | [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Abstracts) | [paper](https://aclanthology.org/2020.acl-main.447/) | 116,288,806 |
146
+ | [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs | [paper](https://doi.org/10.1145/2623330.2623677) | 77,427,422 |
147
+ | [PAQ](https://github.com/facebookresearch/PAQ) (Question, Answer) pairs | [paper](https://arxiv.org/abs/2102.07033) | 64,371,441 |
148
+ | [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Titles) | [paper](https://aclanthology.org/2020.acl-main.447/) | 52,603,982 |
149
+ | [S2ORC](https://github.com/allenai/s2orc) (Title, Abstract) | [paper](https://aclanthology.org/2020.acl-main.447/) | 41,769,185 |
150
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Body) pairs | - | 25,316,456 |
151
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title+Body, Answer) pairs | - | 21,396,559 |
152
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs | - | 21,396,559 |
153
+ | [MS MARCO](https://microsoft.github.io/msmarco/) triplets | [paper](https://doi.org/10.1145/3404835.3462804) | 9,144,553 |
154
+ | [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) | [paper](https://arxiv.org/pdf/2104.08727.pdf) | 3,012,496 |
155
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 1,198,260 |
156
+ | [Code Search](https://huggingface.co/datasets/code_search_net) | - | 1,151,414 |
157
+ | [COCO](https://cocodataset.org/#home) Image captions | [paper](https://link.springer.com/chapter/10.1007%2F978-3-319-10602-1_48) | 828,395|
158
+ | [SPECTER](https://github.com/allenai/specter) citation triplets | [paper](https://doi.org/10.18653/v1/2020.acl-main.207) | 684,100 |
159
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 681,164 |
160
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 659,896 |
161
+ | [SearchQA](https://huggingface.co/datasets/search_qa) | [paper](https://arxiv.org/abs/1704.05179) | 582,261 |
162
+ | [Eli5](https://huggingface.co/datasets/eli5) | [paper](https://doi.org/10.18653/v1/p19-1346) | 325,475 |
163
+ | [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/229/33) | 317,695 |
164
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles) | | 304,525 |
165
+ | AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | [paper SNLI](https://doi.org/10.18653/v1/d15-1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18-1101) | 277,230 |
166
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies) | | 250,519 |
167
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | | 250,460 |
168
+ | [Sentence Compression](https://github.com/google-research-datasets/sentence-compression) | [paper](https://www.aclweb.org/anthology/D13-1155/) | 180,000 |
169
+ | [Wikihow](https://github.com/pvl/wikihow_pairs_dataset) | [paper](https://arxiv.org/abs/1810.09305) | 128,542 |
170
+ | [Altlex](https://github.com/chridey/altlex/) | [paper](https://aclanthology.org/P16-1135.pdf) | 112,696 |
171
+ | [Quora Question Triplets](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) | - | 103,663 |
172
+ | [Simple Wikipedia](https://cs.pomona.edu/~dkauchak/simplification/) | [paper](https://www.aclweb.org/anthology/P11-2117/) | 102,225 |
173
+ | [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/1455) | 100,231 |
174
+ | [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) | [paper](https://aclanthology.org/P18-2124.pdf) | 87,599 |
175
+ | [TriviaQA](https://huggingface.co/datasets/trivia_qa) | - | 73,346 |
176
+ | **Total** | | **1,170,060,424** |
models/models/all-mpnet-base-v2/config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/Users/jpb2/.cache/torch/sentence_transformers/sentence-transformers_all-mpnet-base-v2/",
3
+ "architectures": [
4
+ "MPNetModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "mpnet",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "relative_attention_num_buckets": 32,
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.33.1",
23
+ "vocab_size": 30527
24
+ }
models/models/all-mpnet-base-v2/config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.0.0",
4
+ "transformers": "4.6.1",
5
+ "pytorch": "1.8.1"
6
+ }
7
+ }
models/models/all-mpnet-base-v2/modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
models/models/all-mpnet-base-v2/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:70be8a5840b262b79cbbad82c69c96f6476bddc373182a012f1bcb251865322b
3
+ size 438009257
models/models/all-mpnet-base-v2/sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 384,
3
+ "do_lower_case": false
4
+ }
models/models/all-mpnet-base-v2/special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "[UNK]"
15
+ }
models/models/all-mpnet-base-v2/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
models/models/all-mpnet-base-v2/tokenizer_config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "clean_up_tokenization_spaces": true,
4
+ "cls_token": "<s>",
5
+ "do_lower_case": true,
6
+ "eos_token": "</s>",
7
+ "mask_token": "<mask>",
8
+ "max_length": 128,
9
+ "model_max_length": 512,
10
+ "pad_to_multiple_of": null,
11
+ "pad_token": "<pad>",
12
+ "pad_token_type_id": 0,
13
+ "padding_side": "right",
14
+ "sep_token": "</s>",
15
+ "stride": 0,
16
+ "strip_accents": null,
17
+ "tokenize_chinese_chars": true,
18
+ "tokenizer_class": "MPNetTokenizer",
19
+ "truncation_side": "right",
20
+ "truncation_strategy": "longest_first",
21
+ "unk_token": "[UNK]"
22
+ }
models/models/all-mpnet-base-v2/vocab.txt ADDED
The diff for this file is too large to render. See raw diff
 
models/models/finetuned-all-mpnet-base-v2-300/.DS_Store ADDED
Binary file (6.15 kB). View file
 
models/models/finetuned-all-mpnet-base-v2-300/1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
models/models/finetuned-all-mpnet-base-v2-300/README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+
8
+ ---
9
+
10
+ # {MODEL_NAME}
11
+
12
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
13
+
14
+ <!--- Describe your model here -->
15
+
16
+ ## Usage (Sentence-Transformers)
17
+
18
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
19
+
20
+ ```
21
+ pip install -U sentence-transformers
22
+ ```
23
+
24
+ Then you can use the model like this:
25
+
26
+ ```python
27
+ from sentence_transformers import SentenceTransformer
28
+ sentences = ["This is an example sentence", "Each sentence is converted"]
29
+
30
+ model = SentenceTransformer('{MODEL_NAME}')
31
+ embeddings = model.encode(sentences)
32
+ print(embeddings)
33
+ ```
34
+
35
+
36
+
37
+ ## Evaluation Results
38
+
39
+ <!--- Describe how your model was evaluated -->
40
+
41
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
42
+
43
+
44
+ ## Training
45
+ The model was trained with the parameters:
46
+
47
+ **DataLoader**:
48
+
49
+ `torch.utils.data.dataloader.DataLoader` of length 10 with parameters:
50
+ ```
51
+ {'batch_size': 32, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
52
+ ```
53
+
54
+ **Loss**:
55
+
56
+ `sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
57
+ ```
58
+ {'scale': 20.0, 'similarity_fct': 'cos_sim'}
59
+ ```
60
+
61
+ Parameters of the fit()-Method:
62
+ ```
63
+ {
64
+ "epochs": 10,
65
+ "evaluation_steps": 50,
66
+ "evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator.InformationRetrievalEvaluator",
67
+ "max_grad_norm": 1,
68
+ "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
69
+ "optimizer_params": {
70
+ "lr": 2e-05
71
+ },
72
+ "scheduler": "WarmupLinear",
73
+ "steps_per_epoch": null,
74
+ "warmup_steps": 10,
75
+ "weight_decay": 0.01
76
+ }
77
+ ```
78
+
79
+
80
+ ## Full Model Architecture
81
+ ```
82
+ SentenceTransformer(
83
+ (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel
84
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
85
+ (2): Normalize()
86
+ )
87
+ ```
88
+
89
+ ## Citing & Authors
90
+
91
+ <!--- Describe where people can find more information -->
models/models/finetuned-all-mpnet-base-v2-300/config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/Users/jpb2/.cache/torch/sentence_transformers/sentence-transformers_all-mpnet-base-v2/",
3
+ "architectures": [
4
+ "MPNetModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "mpnet",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "relative_attention_num_buckets": 32,
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.33.1",
23
+ "vocab_size": 30527
24
+ }
models/models/finetuned-all-mpnet-base-v2-300/config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.0.0",
4
+ "transformers": "4.6.1",
5
+ "pytorch": "1.8.1"
6
+ }
7
+ }