trapoom555 commited on
Commit
5ab02d7
1 Parent(s): b08a7ee

initial commit

Browse files
Files changed (4) hide show
  1. README.md +120 -0
  2. adapter_config.json +29 -0
  3. adapter_model.safetensors +3 -0
  4. training_args.bin +3 -0
README.md CHANGED
@@ -1,3 +1,123 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - sentence-embedding
7
+ - sentence-similarity
8
+ - transformers
9
+ - feature-extraction
10
+ pipeline_tag: sentence-similarity
11
  ---
12
+
13
+ # MiniCPM-2B-Text-Embedding-cft
14
+
15
+ ## Description
16
+
17
+ This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets. ⚠️ **The training process ignores hard-negative samples and treat other in-bash samples + their entailments as in-batch negatives**. If you want to see the version utilizing hard-negative examples in training, please refer [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft)
18
+
19
+ ## Base Model
20
+
21
+ [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)
22
+
23
+ ## Usage
24
+
25
+ 1. Clone MiniCPM-2B-dpo-bf16 repository
26
+
27
+ ```bash
28
+ git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
29
+ ```
30
+
31
+ 2. Change a tokenizer setting in `tokenizer_config.json`
32
+
33
+ ```json
34
+ "add_eos_token": true
35
+ ```
36
+
37
+ 3. Use the model
38
+
39
+ ```python
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer
41
+ import torch
42
+ import numpy as np
43
+
44
+ class MiniCPMSentenceEmbedding:
45
+ def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
46
+ self.tokenizer = AutoTokenizer.from_pretrained(model_path)
47
+ self.model = AutoModelForCausalLM.from_pretrained(model_path,
48
+ torch_dtype=torch.bfloat16,
49
+ device_map='cuda',
50
+ trust_remote_code=True)
51
+ if adapter_path != None:
52
+ # Load fine-tuned LoRA
53
+ self.model.load_adapter(adapter_path)
54
+
55
+ def get_last_hidden_state(self, text):
56
+ inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
57
+ with torch.no_grad():
58
+ out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
59
+ return out.squeeze().float().cpu().numpy()
60
+
61
+ def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
62
+ """
63
+ Returns a list of embeddings for the given sentences.
64
+
65
+ Args:
66
+ sentences: List of sentences to encode
67
+
68
+ Returns:
69
+ List of embeddings for the given sentences
70
+ """
71
+
72
+ out = []
73
+
74
+ for s in sentences:
75
+ out.append(self.get_last_hidden_state(s))
76
+
77
+ return out
78
+
79
+ minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft-pos')
80
+
81
+ example_sentences = ["I don't like apples", "I like apples"]
82
+
83
+ encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)
84
+
85
+ print(encoded_sentences)
86
+
87
+ ```
88
+
89
+ ## Training Details
90
+
91
+ ⚠️ **The training process ignores hard-negative samples and treat other in-bash samples + their entailments as in-batch negatives**.
92
+
93
+ | **Training Details** | **Value** |
94
+ |-------------------------|-------------------|
95
+ | Loss | InfoNCE |
96
+ | Batch Size | 60 |
97
+ | InfoNCE Temperature | 0.05 |
98
+ | Learning Rate | 1e-05 |
99
+ | Warmup Steps | 100 |
100
+ | Learning Rate Scheduler | CosineAnnealingLR |
101
+ | LoRA Rank | 8 |
102
+ | LoRA Alpha | 32 |
103
+ | LoRA Dropout | 0.1 |
104
+ | Training Precision | bf16 |
105
+ | Max Epoch | 1 |
106
+ | GPU | RTX3090 |
107
+ | Num GPUs | 4 |
108
+
109
+ ## Training Scripts
110
+
111
+ **_(coming soon...)_**
112
+
113
+ ## Evaluation Results
114
+
115
+ **_(coming soon...)_**
116
+
117
+ ## Contributors
118
+
119
+ Trapoom Ukarapol, Zhicheng Lee, Amy Xin
120
+
121
+ ## Foot Notes
122
+
123
+ This project is the topic-free final project of the Tsinghua University NLP course for Spring 2024.
adapter_config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "../pretrained/MiniCPM-2B-dpo-bf16/",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": "gaussian",
9
+ "layer_replication": null,
10
+ "layers_pattern": null,
11
+ "layers_to_transform": null,
12
+ "loftq_config": {},
13
+ "lora_alpha": 32,
14
+ "lora_dropout": 0.1,
15
+ "megatron_config": null,
16
+ "megatron_core": "megatron.core",
17
+ "modules_to_save": null,
18
+ "peft_type": "LORA",
19
+ "r": 8,
20
+ "rank_pattern": {},
21
+ "revision": null,
22
+ "target_modules": [
23
+ "q_proj",
24
+ "v_proj"
25
+ ],
26
+ "task_type": "CAUSAL_LM",
27
+ "use_dora": false,
28
+ "use_rslora": false
29
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6aa43ee2910254ba201a8450312824a4d2bb44a9ee616c820e50322036b9526e
3
+ size 5919456
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4122a97d1616648f1c4c98ecf5e213bfcd54b55d4af110f51aea582f4c439d1
3
+ size 4984