initial commit

Browse files

Files changed (4) hide show

README.md +120 -0
adapter_config.json +29 -0
adapter_model.safetensors +3 -0
training_args.bin +3 -0

README.md CHANGED Viewed

@@ -1,3 +1,123 @@
 ---
 license: mit
 ---

 ---
 license: mit
+language:
+- en
+tags:
+- sentence-embedding
+- sentence-similarity
+- transformers
+- feature-extraction
+pipeline_tag: sentence-similarity
 ---
+# MiniCPM-2B-Text-Embedding-cft
+## Description
+This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets. ⚠️ **The training process ignores hard-negative samples and treat other in-bash samples + their entailments as in-batch negatives**. If you want to see the version utilizing hard-negative examples in training, please refer [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft)
+## Base Model
+[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)
+## Usage
+1. Clone MiniCPM-2B-dpo-bf16 repository
+```bash
+git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
+```
+2. Change a tokenizer setting in `tokenizer_config.json`
+```json
+"add_eos_token": true
+```
+3. Use the model
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+import numpy as np
+class MiniCPMSentenceEmbedding:
+    def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
+        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
+        self.model = AutoModelForCausalLM.from_pretrained(model_path,
+                                                          torch_dtype=torch.bfloat16,
+                                                          device_map='cuda',
+                                                          trust_remote_code=True)
+        if adapter_path != None:
+            # Load fine-tuned LoRA
+            self.model.load_adapter(adapter_path)
+    def get_last_hidden_state(self, text):
+        inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
+        with torch.no_grad():
+            out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
+        return out.squeeze().float().cpu().numpy()
+    def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
+        """
+        Returns a list of embeddings for the given sentences.
+        Args:
+            sentences: List of sentences to encode
+        Returns:
+            List of embeddings for the given sentences
+        """
+        out = []
+        for s in sentences:
+            out.append(self.get_last_hidden_state(s))
+        return out
+minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft-pos')
+example_sentences = ["I don't like apples", "I like apples"]
+encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)
+print(encoded_sentences)
+```
+## Training Details
+⚠️ **The training process ignores hard-negative samples and treat other in-bash samples + their entailments as in-batch negatives**.
+| **Training Details**    | **Value**         |
+|-------------------------|-------------------|
+| Loss                    | InfoNCE           |
+| Batch Size              | 60                |
+| InfoNCE Temperature     | 0.05              |
+| Learning Rate           | 1e-05             |
+| Warmup Steps            | 100               |
+| Learning Rate Scheduler | CosineAnnealingLR |
+| LoRA Rank               | 8                 |
+| LoRA Alpha              | 32                |
+| LoRA Dropout            | 0.1               |
+| Training Precision      | bf16              |
+| Max Epoch               | 1                 |
+| GPU                     | RTX3090           |
+| Num GPUs                | 4                 |
+## Training Scripts
+**_(coming soon...)_**
+## Evaluation Results
+**_(coming soon...)_**
+## Contributors
+Trapoom Ukarapol, Zhicheng Lee, Amy Xin
+## Foot Notes
+This project is the topic-free final project of the Tsinghua University NLP course for Spring 2024.

adapter_config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "../pretrained/MiniCPM-2B-dpo-bf16/",
+  "bias": "none",
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": "gaussian",
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_dropout": 0.1,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 8,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "q_proj",
+    "v_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "use_dora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6aa43ee2910254ba201a8450312824a4d2bb44a9ee616c820e50322036b9526e
+size 5919456

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f4122a97d1616648f1c4c98ecf5e213bfcd54b55d4af110f51aea582f4c439d1
+size 4984