File size: 3,492 Bytes
f260bdd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c8dda06
867d540
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f260bdd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
datasets:
- sentence-transformers/embedding-training-data
- flax-sentence-embeddings/stackexchange_xml
- snli
- eli5
- search_qa
- multi_nli
- wikihow
- natural_questions
- trivia_qa
- ms_marco
- gooaq
- yahoo_answers_topics
language:
- en
inference: false
pipeline_tag: sentence-similarity
task_categories:
  - sentence-similarity
  - feature-extraction
  - text-retrieval
tags:
  - information retrieval
  - ir
  - documents retrieval
  - passage retrieval
  - beir
  - benchmark
  - sts
  - semantic search
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
  - transformers
---

# bert-base-1024-biencoder-6M-pairs

A long context biencoder based on [MosaicML's BERT pretrained on 1024 sequence length](https://huggingface.co/mosaicml/mosaic-bert-base-seqlen-1024). This model maps sentences & paragraphs to a 768 dimensional dense vector space 
and can be used for tasks like clustering or semantic search.

## Usage

### Download the model and related scripts
```git clone https://huggingface.co/shreyansh26/bert-base-1024-biencoder-6M-pairs```

### Inference
```python
import torch
from torch import nn
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline, AutoModel
from mosaic_bert import BertModel

# pip install triton==2.0.0.dev20221202 --no-deps if using Pytorch 2.0

class AutoModelForSentenceEmbedding(nn.Module):
    def __init__(self, model, tokenizer, normalize=True):
        super(AutoModelForSentenceEmbedding, self).__init__()

        self.model = model.to("cuda")
        self.normalize = normalize
        self.tokenizer = tokenizer

    def forward(self, **kwargs):
        model_output = self.model(**kwargs)
        embeddings = self.mean_pooling(model_output, kwargs['attention_mask'])
        if self.normalize:
            embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

        return embeddings

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

model = AutoModel.from_pretrained("<path-to-model>", trust_remote_code=True).to("cuda")
model = AutoModelForSentenceEmbedding(model, tokenizer)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

sentences = ["This is an example sentence", "Each sentence is converted"]

encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=1024, return_tensors='pt').to("cuda")
embeddings = model(**encoded_input)

print(embeddings)
print(embeddings.shape)
```

## Other details

### Training

This model has been trained on 6.4M randomly sampled pairs of sentences/paragraphs from the same training set that Sentence Transformers models use. Details of the
training set [here](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#training-data). 

The training (along with hyperparameters), inference and data loading scripts can all be found in [this Github repository](https://github.com/shreyansh26/Long-Context-Biencoder).

### Evaluations

We ran the model on a few retrieval based benchmarks (CQADupstackEnglishRetrieval, DBPedia, MSMARCO, QuoraRetrieval) and the results are [here](https://github.com/shreyansh26/Long-Context-Biencoder/tree/master/models/results/6M_results).