File size: 4,535 Bytes

---
license: mit
language:
- en
tags:
- sentence-embedding
- sentence-similarity
- transformers
- feature-extraction
pipeline_tag: sentence-similarity
---

# MiniCPM-2B-Text-Embedding-cft

## Description

This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets. The paper can be found [here](https://arxiv.org/abs/2408.00690).

## Base Model

[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)

## Usage

1. Clone MiniCPM-2B-dpo-bf16 repository

```bash
git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
```

2. Change a tokenizer setting in `tokenizer_config.json`

```json
"add_eos_token": true
```

3. Use the model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np

class MiniCPMSentenceEmbedding:
    def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(model_path, 
                                                          torch_dtype=torch.bfloat16,
                                                          device_map='cuda',
                                                          trust_remote_code=True)
        if adapter_path != None:
            # Load fine-tuned LoRA
            self.model.load_adapter(adapter_path)

    def get_last_hidden_state(self, text):
        inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
        with torch.no_grad():
            out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
        return out.squeeze().float().cpu().numpy()

    def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
        """
        Returns a list of embeddings for the given sentences.
        
        Args:
            sentences: List of sentences to encode

        Returns:
            List of embeddings for the given sentences
        """

        out = []

        for s in sentences:
            out.append(self.get_last_hidden_state(s))

        return out

minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft')

example_sentences = ["I don't like apples", "I like apples"]

encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)

print(encoded_sentences) 

```

## Training Details

| **Training Details**    | **Value**         |
|-------------------------|-------------------|
| Loss                    | InfoNCE           |
| Batch Size              | 60                |
| InfoNCE Temperature     | 0.05              |
| Learning Rate           | 5e-05             |
| Warmup Steps            | 100               |
| Learning Rate Scheduler | CosineAnnealingLR |
| LoRA Rank               | 8                 |
| LoRA Alpha              | 32                |
| LoRA Dropout            | 0.1               |
| Training Precision      | bf16              |
| Max Epoch               | 1                 |
| GPU                     | RTX3090           |
| Num GPUs                | 4                 |

## Training Scripts

The training script for this model is written in this [Github repository](https://github.com/trapoom555/Language-Model-STS-CFT/tree/main).

## Checkpoints

We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft-checkpoints).

## Evaluation Results

| **Benchmarks** | **Before cft** | **After cft** |
|----------------|----------------|---------------|
| STS12          | 7.27           | 76.38         |
| STS13          | 18.38          | 87.61         |
| STS14          | 15.04          | 81.55         |
| STS15          | 32.24          | 87.33         |
| STS16          | 39.79          | 85.25         |
| STS17          | 33.63          | 89.96         |
| STSBenchmark   | 33.91          | 86.51         |
| BOISSES        | 18.03          | 80.05         |
| SICK-R         | 49.30          | 79.87         |
| **Overall**    | **27.51**      | **83.84**     |

## Contributors

Trapoom Ukarapol, Zhicheng Lee, Amy Xin

## Foot Notes

This work is the final project of the Natural Language Processing Spring 2024 course at Tsinghua University 🟣. We would like to express our sincere gratitude to this course !