File size: 4,470 Bytes
ce7960f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d3a9eca
 
 
 
 
 
 
 
 
 
 
 
ce7960f
 
 
 
 
 
 
d3a9eca
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
license: mit
language:
- en
tags:
- sentence-embedding
- sentence-similarity
- transformers
- feature-extraction
pipeline_tag: sentence-similarity
---

# MiniCPM-2B-Text-Embedding-cft

## Description

This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets.

## Base Model

[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)

## Usage

1. Clone MiniCPM-2B-dpo-bf16 repository

```bash
git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
```

2. Change a tokenizer setting in `tokenizer_config.json`

```json
"add_eos_token": true
```

3. Use the model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np

class MiniCPMSentenceEmbedding:
    def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(model_path, 
                                                          torch_dtype=torch.bfloat16,
                                                          device_map='cuda',
                                                          trust_remote_code=True)
        if adapter_path != None:
            # Load fine-tuned LoRA
            self.model.load_adapter(adapter_path)

    def get_last_hidden_state(self, text):
        inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
        with torch.no_grad():
            out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
        return out.squeeze().float().cpu().numpy()

    def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
        """
        Returns a list of embeddings for the given sentences.
        
        Args:
            sentences: List of sentences to encode

        Returns:
            List of embeddings for the given sentences
        """

        out = []

        for s in sentences:
            out.append(self.get_last_hidden_state(s))

        return out

minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft')

example_sentences = ["I don't like apples", "I like apples"]

encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)

print(encoded_sentences) 

```

## Training Details

| **Training Details**    | **Value**         |
|-------------------------|-------------------|
| Loss                    | InfoNCE           |
| Batch Size              | 60                |
| InfoNCE Temperature     | 0.05              |
| Learning Rate           | 5e-05             |
| Warmup Steps            | 100               |
| Learning Rate Scheduler | CosineAnnealingLR |
| LoRA Rank               | 8                 |
| LoRA Alpha              | 32                |
| LoRA Dropout            | 0.1               |
| Training Precision      | bf16              |
| Max Epoch               | 1                 |
| GPU                     | RTX3090           |
| Num GPUs                | 4                 |

## Training Scripts

The training script for this model is written in this [Github repository](https://github.com/trapoom555/Language-Model-STS-CFT/tree/main).

## Checkpoints

We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft-checkpoints).

## Evaluation Results

| **Benchmarks** | **Before cft** | **After cft** |
|----------------|----------------|---------------|
| STS12          | 7.27           | 76.38         |
| STS13          | 18.38          | 87.61         |
| STS14          | 15.04          | 81.55         |
| STS15          | 32.24          | 87.33         |
| STS16          | 39.79          | 85.25         |
| STS17          | 33.63          | 89.96         |
| STSBenchmark   | 33.91          | 86.51         |
| BOISSES        | 18.03          | 80.05         |
| SICK-R         | 49.30          | 79.87         |
| **Overall**    | **27.51**      | **83.84**     |

## Contributors

Trapoom Ukarapol, Zhicheng Lee, Amy Xin

## Foot Notes

This work is the final project of the Natural Language Processing Spring 2024 course at Tsinghua University 🟣. We would like to express our sincere gratitude to this course !