trapoom555 commited on
Commit
aa8ab46
1 Parent(s): 238be30

Modify readme

Browse files
Files changed (1) hide show
  1. README.md +122 -0
README.md CHANGED
@@ -1,3 +1,125 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - sentence-embedding
7
+ - sentence-similarity
8
+ - transformers
9
+ - feature-extraction
10
+ pipeline_tag: sentence-similarity
11
  ---
12
+
13
+ # MiniCPM-2B-Text-Embedding-cft
14
+
15
+ ## Description
16
+
17
+ This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets.
18
+
19
+ ## Base Model
20
+
21
+ [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)
22
+
23
+ ## Usage
24
+
25
+ 1. Clone MiniCPM-2B-dpo-bf16 repository
26
+
27
+ ```bash
28
+ git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
29
+ ```
30
+
31
+ 2. Change a tokenizer setting in `tokenizer_config.json`
32
+
33
+ ```json
34
+ "add_eos_token": true
35
+ ```
36
+
37
+ 3. Use the model
38
+
39
+ ```python
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer
41
+ import torch
42
+ import numpy as np
43
+
44
+ class MiniCPMSentenceEmbedding:
45
+ def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
46
+ self.tokenizer = AutoTokenizer.from_pretrained(model_path)
47
+ self.model = AutoModelForCausalLM.from_pretrained(model_path,
48
+ torch_dtype=torch.bfloat16,
49
+ device_map='cuda',
50
+ trust_remote_code=True)
51
+ if adapter_path != None:
52
+ # Load fine-tuned LoRA
53
+ self.model.load_adapter(adapter_path)
54
+
55
+ def get_last_hidden_state(self, text):
56
+ inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
57
+ with torch.no_grad():
58
+ out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
59
+ return out.squeeze().float().cpu().numpy()
60
+
61
+ def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
62
+ """
63
+ Returns a list of embeddings for the given sentences.
64
+
65
+ Args:
66
+ sentences: List of sentences to encode
67
+
68
+ Returns:
69
+ List of embeddings for the given sentences
70
+ """
71
+
72
+ out = []
73
+
74
+ for s in sentences:
75
+ out.append(self.get_last_hidden_state(s))
76
+
77
+ return out
78
+
79
+ minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft')
80
+
81
+ example_sentences = ["I don't like apples", "I like apples"]
82
+
83
+ encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)
84
+
85
+ print(encoded_sentences)
86
+
87
+ ```
88
+
89
+ ## Training Details
90
+
91
+ | **Training Details** | **Value** |
92
+ |-------------------------|-------------------|
93
+ | Loss | InfoNCE |
94
+ | Batch Size | 60 |
95
+ | InfoNCE Temperature | 0.05 |
96
+ | Learning Rate | 1e-05 |
97
+ | Warmup Steps | 100 |
98
+ | Learning Rate Scheduler | CosineAnnealingLR |
99
+ | LoRA Rank | 8 |
100
+ | LoRA Alpha | 32 |
101
+ | LoRA Dropout | 0.1 |
102
+ | Training Precision | bf16 |
103
+ | Max Epoch | 1 |
104
+ | GPU | RTX3090 |
105
+ | Num GPUs | 4 |
106
+
107
+ ## Training Scripts
108
+
109
+ **_(coming soon...)_**
110
+
111
+ ## Checkpoints
112
+
113
+ We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft-checkpoints).
114
+
115
+ ## Evaluation Results
116
+
117
+ **_(coming soon...)_**
118
+
119
+ ## Contributors
120
+
121
+ Trapoom Ukarapol, Zhicheng Lee, Amy Xin
122
+
123
+ ## Foot Notes
124
+
125
+ This project is the topic-free final project of the Tsinghua University NLP course for Spring 2024.