trapoom555
/

MiniCPM-2B-Text-Embedding-cft

 ---
 license: mit
+language:
+- en
+tags:
+- sentence-embedding
+- sentence-similarity
+- transformers
+- feature-extraction
+pipeline_tag: sentence-similarity
 ---
+# MiniCPM-2B-Text-Embedding-cft
+## Description
+This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets.
+## Base Model
+[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)
+## Usage
+1. Clone MiniCPM-2B-dpo-bf16 repository
+```bash
+git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
+```
+2. Change a tokenizer setting in `tokenizer_config.json`
+```json
+"add_eos_token": true
+```
+3. Use the model
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+import numpy as np
+class MiniCPMSentenceEmbedding:
+    def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
+        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
+        self.model = AutoModelForCausalLM.from_pretrained(model_path,
+                                                          torch_dtype=torch.bfloat16,
+                                                          device_map='cuda',
+                                                          trust_remote_code=True)
+        if adapter_path != None:
+            # Load fine-tuned LoRA
+            self.model.load_adapter(adapter_path)
+    def get_last_hidden_state(self, text):
+        inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
+        with torch.no_grad():
+            out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
+        return out.squeeze().float().cpu().numpy()
+    def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
+        """
+        Returns a list of embeddings for the given sentences.
+        Args:
+            sentences: List of sentences to encode
+        Returns:
+            List of embeddings for the given sentences
+        """
+        out = []
+        for s in sentences:
+            out.append(self.get_last_hidden_state(s))
+        return out
+minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft')
+example_sentences = ["I don't like apples", "I like apples"]
+encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)
+print(encoded_sentences)
+```
+## Training Details
+| **Training Details**    | **Value**         |
+|-------------------------|-------------------|
+| Loss                    | InfoNCE           |
+| Batch Size              | 60                |
+| InfoNCE Temperature     | 0.05              |
+| Learning Rate           | 1e-05             |
+| Warmup Steps            | 100               |
+| Learning Rate Scheduler | CosineAnnealingLR |
+| LoRA Rank               | 8                 |
+| LoRA Alpha              | 32                |
+| LoRA Dropout            | 0.1               |
+| Training Precision      | bf16              |
+| Max Epoch               | 1                 |
+| GPU                     | RTX3090           |
+| Num GPUs                | 4                 |
+## Training Scripts
+**_(coming soon...)_**
+## Checkpoints
+We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft-checkpoints).
+## Evaluation Results
+**_(coming soon...)_**
+## Contributors
+Trapoom Ukarapol, Zhicheng Lee, Amy Xin
+## Foot Notes
+This project is the topic-free final project of the Tsinghua University NLP course for Spring 2024.