shreyansh26 commited on
Commit
af24c8f
1 Parent(s): e52542d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -0
README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - sentence-transformers/embedding-training-data
4
+ - flax-sentence-embeddings/stackexchange_xml
5
+ - snli
6
+ - eli5
7
+ - search_qa
8
+ - multi_nli
9
+ - wikihow
10
+ - natural_questions
11
+ - trivia_qa
12
+ - ms_marco
13
+ - gooaq
14
+ - yahoo_answers_topics
15
+ language:
16
+ - en
17
+ inference: false
18
+ pipeline_tag: sentence-similarity
19
+ task_categories:
20
+ - sentence-similarity
21
+ - feature-extraction
22
+ - text-retrieval
23
+ tags:
24
+ - information retrieval
25
+ - ir
26
+ - documents retrieval
27
+ - passage retrieval
28
+ - beir
29
+ - benchmark
30
+ - sts
31
+ - semantic search
32
+ - sentence-transformers
33
+ - feature-extraction
34
+ - sentence-similarity
35
+ - transformers
36
+ ---
37
+
38
+ # bert-base-1024-biencoder-64M-pairs
39
+
40
+ A long context biencoder based on [MosaicML's BERT pretrained on 1024 sequence length](https://huggingface.co/mosaicml/mosaic-bert-base-seqlen-1024). This model maps sentences & paragraphs to a 768 dimensional dense vector space
41
+ and can be used for tasks like clustering or semantic search.
42
+
43
+ ## Usage
44
+
45
+ ### Download the model and related scripts
46
+ ```git clone https://huggingface.co/shreyansh26/bert-base-1024-biencoder-64M-pairs```
47
+
48
+ ### Inference
49
+ ```python
50
+ import torch
51
+ from torch import nn
52
+ from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline, AutoModel
53
+ from mosaic_bert import BertModel
54
+
55
+ # pip install triton==2.0.0.dev20221202 --no-deps if using Pytorch 2.0
56
+
57
+ class AutoModelForSentenceEmbedding(nn.Module):
58
+ def __init__(self, model, tokenizer, normalize=True):
59
+ super(AutoModelForSentenceEmbedding, self).__init__()
60
+
61
+ self.model = model.to("cuda")
62
+ self.normalize = normalize
63
+ self.tokenizer = tokenizer
64
+
65
+ def forward(self, **kwargs):
66
+ model_output = self.model(**kwargs)
67
+ embeddings = self.mean_pooling(model_output, kwargs['attention_mask'])
68
+ if self.normalize:
69
+ embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
70
+
71
+ return embeddings
72
+
73
+ def mean_pooling(self, model_output, attention_mask):
74
+ token_embeddings = model_output[0] # First element of model_output contains all token embeddings
75
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
76
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
77
+
78
+ model = AutoModel.from_pretrained("<path-to-model>", trust_remote_code=True).to("cuda")
79
+ model = AutoModelForSentenceEmbedding(model, tokenizer)
80
+ tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
81
+
82
+ sentences = ["This is an example sentence", "Each sentence is converted"]
83
+
84
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=1024, return_tensors='pt').to("cuda")
85
+ embeddings = model(**encoded_input)
86
+
87
+ print(embeddings)
88
+ print(embeddings.shape)
89
+ ```
90
+
91
+ ## Other details
92
+
93
+ ### Training
94
+
95
+ This model has been trained on 64M randomly sampled pairs of sentences/paragraphs from the same training set that Sentence Transformers models use. Details of the
96
+ training set [here](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#training-data).
97
+
98
+ The training (along with hyperparameters), inference and data loading scripts can all be found in [this Github repository](https://github.com/shreyansh26/Long-Context-Biencoder).
99
+
100
+ ### Evaluations
101
+
102
+ We ran the model on a few retrieval based benchmarks (CQADupstackEnglishRetrieval, DBPedia, MSMARCO, QuoraRetrieval) and the results are [here](https://github.com/shreyansh26/Long-Context-Biencoder/tree/master/models/results/64M_results).