izhx
/

udever-bloom-560m

@@ -7722,7 +7722,7 @@ model-index:
 `udever-bloom-560m` is finetuned from [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) via [BitFit](https://aclanthology.org/2022.acl-short.1/) on MS MARCO Passage Ranking, SNLI and MultiNLI data.
 It is a universal embedding model across tasks, natural and programming languages.
-(From a technical view, `udever` is merely with some minor improvements to `sgpt-bloom`)
 <div align=center><img width="338" height="259" src="https://user-images.githubusercontent.com/26690193/277643721-cdb7f227-cae5-40e1-b6e1-a201bde00339.png" /></div>
@@ -7742,6 +7742,7 @@ It is a universal embedding model across tasks, natural and programming language
 - **Repository:** [github.com/izhx/uni-rep](https://github.com/izhx/uni-rep)
 - **Paper :** [Language Models are Universal Embedders](https://arxiv.org/pdf/2310.08232.pdf)
@@ -7750,6 +7751,37 @@ It is a universal embedding model across tasks, natural and programming language
 Use the code below to get started with the model.
 ```python
 ```
@@ -7767,7 +7799,7 @@ Use the code below to get started with the model.
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
 MS MARCO hard negatives provided by (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py#L86).
 Negatives for SNLI and MultiNLI are randomly sampled.

 `udever-bloom-560m` is finetuned from [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) via [BitFit](https://aclanthology.org/2022.acl-short.1/) on MS MARCO Passage Ranking, SNLI and MultiNLI data.
 It is a universal embedding model across tasks, natural and programming languages.
+(From the technical view, `udever` is merely with some minor improvements to `sgpt-bloom`)
 <div align=center><img width="338" height="259" src="https://user-images.githubusercontent.com/26690193/277643721-cdb7f227-cae5-40e1-b6e1-a201bde00339.png" /></div>
 - **Repository:** [github.com/izhx/uni-rep](https://github.com/izhx/uni-rep)
 - **Paper :** [Language Models are Universal Embedders](https://arxiv.org/pdf/2310.08232.pdf)
+- **Training Date :** 2023-06
 Use the code below to get started with the model.
 ```python
+import torch
+from transformers import AutoTokenizer, BloomModel
+tokenizer = AutoTokenizer.from_pretrained('izhx/udever-bloom-560m')
+model = BloomModel.from_pretrained('izhx/udever-bloom-560m')
+boq, eoq, bod, eod = '[BOQ]', '[EOQ]', '[BOD]', '[EOD]'
+eoq_id, eod_id = tokenizer.convert_tokens_to_ids([eoq, eod])
+if tokenizer.padding_side != 'left':
+    print('!!!', tokenizer.padding_side)
+    tokenizer.padding_side = 'left'
+def encode(texts: list, is_query: bool = True, max_length=300):
+    bos = boq if is_query else bod
+    eos_id = eoq_id if is_query else eod_id
+    texts = [bos + t for t in texts]
+    encoding = tokenizer(
+        texts, truncation=True, max_length=max_length - 1, padding=True
+    )
+    for ids, mask in zip(encoding['input_ids'], encoding['attention_mask']):
+        ids.append(eos_id)
+        mask.append(1)
+    inputs = tokenizer.pad(encoding, return_tensors='pt')
+    with torch.inference_mode():
+        outputs = model(**inputs)
+        embeds = outputs.last_hidden_state[:, -1]
+    return embeds
+encode(['I am Bert', 'You are Elmo'])
 ```
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing
 MS MARCO hard negatives provided by (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py#L86).
 Negatives for SNLI and MultiNLI are randomly sampled.