izhx commited on
Commit
23bc6c1
1 Parent(s): 3b7a20f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -2
README.md CHANGED
@@ -7722,7 +7722,7 @@ model-index:
7722
 
7723
  `udever-bloom-560m` is finetuned from [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) via [BitFit](https://aclanthology.org/2022.acl-short.1/) on MS MARCO Passage Ranking, SNLI and MultiNLI data.
7724
  It is a universal embedding model across tasks, natural and programming languages.
7725
- (From a technical view, `udever` is merely with some minor improvements to `sgpt-bloom`)
7726
 
7727
  <div align=center><img width="338" height="259" src="https://user-images.githubusercontent.com/26690193/277643721-cdb7f227-cae5-40e1-b6e1-a201bde00339.png" /></div>
7728
 
@@ -7742,6 +7742,7 @@ It is a universal embedding model across tasks, natural and programming language
7742
 
7743
  - **Repository:** [github.com/izhx/uni-rep](https://github.com/izhx/uni-rep)
7744
  - **Paper :** [Language Models are Universal Embedders](https://arxiv.org/pdf/2310.08232.pdf)
 
7745
 
7746
 
7747
 
@@ -7750,6 +7751,37 @@ It is a universal embedding model across tasks, natural and programming language
7750
  Use the code below to get started with the model.
7751
 
7752
  ```python
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7753
 
7754
  ```
7755
 
@@ -7767,7 +7799,7 @@ Use the code below to get started with the model.
7767
 
7768
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
7769
 
7770
- #### Preprocessing [optional]
7771
 
7772
  MS MARCO hard negatives provided by (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py#L86).
7773
  Negatives for SNLI and MultiNLI are randomly sampled.
 
7722
 
7723
  `udever-bloom-560m` is finetuned from [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) via [BitFit](https://aclanthology.org/2022.acl-short.1/) on MS MARCO Passage Ranking, SNLI and MultiNLI data.
7724
  It is a universal embedding model across tasks, natural and programming languages.
7725
+ (From the technical view, `udever` is merely with some minor improvements to `sgpt-bloom`)
7726
 
7727
  <div align=center><img width="338" height="259" src="https://user-images.githubusercontent.com/26690193/277643721-cdb7f227-cae5-40e1-b6e1-a201bde00339.png" /></div>
7728
 
 
7742
 
7743
  - **Repository:** [github.com/izhx/uni-rep](https://github.com/izhx/uni-rep)
7744
  - **Paper :** [Language Models are Universal Embedders](https://arxiv.org/pdf/2310.08232.pdf)
7745
+ - **Training Date :** 2023-06
7746
 
7747
 
7748
 
 
7751
  Use the code below to get started with the model.
7752
 
7753
  ```python
7754
+ import torch
7755
+ from transformers import AutoTokenizer, BloomModel
7756
+
7757
+ tokenizer = AutoTokenizer.from_pretrained('izhx/udever-bloom-560m')
7758
+ model = BloomModel.from_pretrained('izhx/udever-bloom-560m')
7759
+
7760
+ boq, eoq, bod, eod = '[BOQ]', '[EOQ]', '[BOD]', '[EOD]'
7761
+ eoq_id, eod_id = tokenizer.convert_tokens_to_ids([eoq, eod])
7762
+
7763
+ if tokenizer.padding_side != 'left':
7764
+ print('!!!', tokenizer.padding_side)
7765
+ tokenizer.padding_side = 'left'
7766
+
7767
+
7768
+ def encode(texts: list, is_query: bool = True, max_length=300):
7769
+ bos = boq if is_query else bod
7770
+ eos_id = eoq_id if is_query else eod_id
7771
+ texts = [bos + t for t in texts]
7772
+ encoding = tokenizer(
7773
+ texts, truncation=True, max_length=max_length - 1, padding=True
7774
+ )
7775
+ for ids, mask in zip(encoding['input_ids'], encoding['attention_mask']):
7776
+ ids.append(eos_id)
7777
+ mask.append(1)
7778
+ inputs = tokenizer.pad(encoding, return_tensors='pt')
7779
+ with torch.inference_mode():
7780
+ outputs = model(**inputs)
7781
+ embeds = outputs.last_hidden_state[:, -1]
7782
+ return embeds
7783
+
7784
+ encode(['I am Bert', 'You are Elmo'])
7785
 
7786
  ```
7787
 
 
7799
 
7800
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
7801
 
7802
+ #### Preprocessing
7803
 
7804
  MS MARCO hard negatives provided by (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py#L86).
7805
  Negatives for SNLI and MultiNLI are randomly sampled.