trapoom555 commited on
Commit
ce7960f
1 Parent(s): a1069b5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -124
README.md CHANGED
@@ -1,125 +1,125 @@
1
- ---
2
- license: mit
3
- language:
4
- - en
5
- tags:
6
- - sentence-embedding
7
- - sentence-similarity
8
- - transformers
9
- - feature-extraction
10
- pipeline_tag: sentence-similarity
11
- ---
12
-
13
- # MiniCPM-2B-Text-Embedding-cft
14
-
15
- ## Description
16
-
17
- This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets.
18
-
19
- ## Base Model
20
-
21
- [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)
22
-
23
- ## Usage
24
-
25
- 1. Clone MiniCPM-2B-dpo-bf16 repository
26
-
27
- ```bash
28
- git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
29
- ```
30
-
31
- 2. Change a tokenizer setting in `tokenizer_config.json`
32
-
33
- ```json
34
- "add_eos_token": true
35
- ```
36
-
37
- 3. Use the model
38
-
39
- ```python
40
- from transformers import AutoModelForCausalLM, AutoTokenizer
41
- import torch
42
- import numpy as np
43
-
44
- class MiniCPMSentenceEmbedding:
45
- def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
46
- self.tokenizer = AutoTokenizer.from_pretrained(model_path)
47
- self.model = AutoModelForCausalLM.from_pretrained(model_path,
48
- torch_dtype=torch.bfloat16,
49
- device_map='cuda',
50
- trust_remote_code=True)
51
- if adapter_path != None:
52
- # Load fine-tuned LoRA
53
- self.model.load_adapter(adapter_path)
54
-
55
- def get_last_hidden_state(self, text):
56
- inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
57
- with torch.no_grad():
58
- out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
59
- return out.squeeze().float().cpu().numpy()
60
-
61
- def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
62
- """
63
- Returns a list of embeddings for the given sentences.
64
-
65
- Args:
66
- sentences: List of sentences to encode
67
-
68
- Returns:
69
- List of embeddings for the given sentences
70
- """
71
-
72
- out = []
73
-
74
- for s in sentences:
75
- out.append(self.get_last_hidden_state(s))
76
-
77
- return out
78
-
79
- minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft')
80
-
81
- example_sentences = ["I don't like apples", "I like apples"]
82
-
83
- encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)
84
-
85
- print(encoded_sentences)
86
-
87
- ```
88
-
89
- ## Training Details
90
-
91
- | **Training Details** | **Value** |
92
- |-------------------------|-------------------|
93
- | Loss | InfoNCE |
94
- | Batch Size | 60 |
95
- | InfoNCE Temperature | 0.05 |
96
- | Learning Rate | 5e-05 |
97
- | Warmup Steps | 100 |
98
- | Learning Rate Scheduler | CosineAnnealingLR |
99
- | LoRA Rank | 8 |
100
- | LoRA Alpha | 32 |
101
- | LoRA Dropout | 0.1 |
102
- | Training Precision | bf16 |
103
- | Max Epoch | 1 |
104
- | GPU | RTX3090 |
105
- | Num GPUs | 4 |
106
-
107
- ## Training Scripts
108
-
109
- **_(coming soon...)_**
110
-
111
- ## Checkpoints
112
-
113
- We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft-checkpoints).
114
-
115
- ## Evaluation Results
116
-
117
- **_(coming soon...)_**
118
-
119
- ## Contributors
120
-
121
- Trapoom Ukarapol, Zhicheng Lee, Amy Xin
122
-
123
- ## Foot Notes
124
-
125
  This project is the topic-free final project of the Tsinghua University NLP course for Spring 2024.
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - sentence-embedding
7
+ - sentence-similarity
8
+ - transformers
9
+ - feature-extraction
10
+ pipeline_tag: sentence-similarity
11
+ ---
12
+
13
+ # MiniCPM-2B-Text-Embedding-cft
14
+
15
+ ## Description
16
+
17
+ This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets.
18
+
19
+ ## Base Model
20
+
21
+ [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)
22
+
23
+ ## Usage
24
+
25
+ 1. Clone MiniCPM-2B-dpo-bf16 repository
26
+
27
+ ```bash
28
+ git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
29
+ ```
30
+
31
+ 2. Change a tokenizer setting in `tokenizer_config.json`
32
+
33
+ ```json
34
+ "add_eos_token": true
35
+ ```
36
+
37
+ 3. Use the model
38
+
39
+ ```python
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer
41
+ import torch
42
+ import numpy as np
43
+
44
+ class MiniCPMSentenceEmbedding:
45
+ def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
46
+ self.tokenizer = AutoTokenizer.from_pretrained(model_path)
47
+ self.model = AutoModelForCausalLM.from_pretrained(model_path,
48
+ torch_dtype=torch.bfloat16,
49
+ device_map='cuda',
50
+ trust_remote_code=True)
51
+ if adapter_path != None:
52
+ # Load fine-tuned LoRA
53
+ self.model.load_adapter(adapter_path)
54
+
55
+ def get_last_hidden_state(self, text):
56
+ inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
57
+ with torch.no_grad():
58
+ out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
59
+ return out.squeeze().float().cpu().numpy()
60
+
61
+ def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
62
+ """
63
+ Returns a list of embeddings for the given sentences.
64
+
65
+ Args:
66
+ sentences: List of sentences to encode
67
+
68
+ Returns:
69
+ List of embeddings for the given sentences
70
+ """
71
+
72
+ out = []
73
+
74
+ for s in sentences:
75
+ out.append(self.get_last_hidden_state(s))
76
+
77
+ return out
78
+
79
+ minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft')
80
+
81
+ example_sentences = ["I don't like apples", "I like apples"]
82
+
83
+ encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)
84
+
85
+ print(encoded_sentences)
86
+
87
+ ```
88
+
89
+ ## Training Details
90
+
91
+ | **Training Details** | **Value** |
92
+ |-------------------------|-------------------|
93
+ | Loss | InfoNCE |
94
+ | Batch Size | 60 |
95
+ | InfoNCE Temperature | 0.05 |
96
+ | Learning Rate | 5e-05 |
97
+ | Warmup Steps | 100 |
98
+ | Learning Rate Scheduler | CosineAnnealingLR |
99
+ | LoRA Rank | 8 |
100
+ | LoRA Alpha | 32 |
101
+ | LoRA Dropout | 0.1 |
102
+ | Training Precision | bf16 |
103
+ | Max Epoch | 1 |
104
+ | GPU | RTX3090 |
105
+ | Num GPUs | 4 |
106
+
107
+ ## Training Scripts
108
+
109
+ The training script for this model is written in this [Github repository](https://github.com/trapoom555/Language-Model-STS-CFT/tree/main).
110
+
111
+ ## Checkpoints
112
+
113
+ We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft-checkpoints).
114
+
115
+ ## Evaluation Results
116
+
117
+ **_(coming soon...)_**
118
+
119
+ ## Contributors
120
+
121
+ Trapoom Ukarapol, Zhicheng Lee, Amy Xin
122
+
123
+ ## Foot Notes
124
+
125
  This project is the topic-free final project of the Tsinghua University NLP course for Spring 2024.