trapoom555's picture
Update README.md
d3a9eca verified
metadata
license: mit
language:
  - en
tags:
  - sentence-embedding
  - sentence-similarity
  - transformers
  - feature-extraction
pipeline_tag: sentence-similarity

MiniCPM-2B-Text-Embedding-cft

Description

This is a fine-tuned version of MiniCPM-2B-dpo-bf16 to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets.

Base Model

MiniCPM-2B-dpo-bf16

Usage

  1. Clone MiniCPM-2B-dpo-bf16 repository
git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
  1. Change a tokenizer setting in tokenizer_config.json
"add_eos_token": true
  1. Use the model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np

class MiniCPMSentenceEmbedding:
    def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(model_path, 
                                                          torch_dtype=torch.bfloat16,
                                                          device_map='cuda',
                                                          trust_remote_code=True)
        if adapter_path != None:
            # Load fine-tuned LoRA
            self.model.load_adapter(adapter_path)

    def get_last_hidden_state(self, text):
        inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
        with torch.no_grad():
            out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
        return out.squeeze().float().cpu().numpy()

    def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
        """
        Returns a list of embeddings for the given sentences.
        
        Args:
            sentences: List of sentences to encode

        Returns:
            List of embeddings for the given sentences
        """

        out = []

        for s in sentences:
            out.append(self.get_last_hidden_state(s))

        return out

minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft')

example_sentences = ["I don't like apples", "I like apples"]

encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)

print(encoded_sentences) 

Training Details

Training Details Value
Loss InfoNCE
Batch Size 60
InfoNCE Temperature 0.05
Learning Rate 5e-05
Warmup Steps 100
Learning Rate Scheduler CosineAnnealingLR
LoRA Rank 8
LoRA Alpha 32
LoRA Dropout 0.1
Training Precision bf16
Max Epoch 1
GPU RTX3090
Num GPUs 4

Training Scripts

The training script for this model is written in this Github repository.

Checkpoints

We provide checkpoints every 500 training steps which can be found here.

Evaluation Results

Benchmarks Before cft After cft
STS12 7.27 76.38
STS13 18.38 87.61
STS14 15.04 81.55
STS15 32.24 87.33
STS16 39.79 85.25
STS17 33.63 89.96
STSBenchmark 33.91 86.51
BOISSES 18.03 80.05
SICK-R 49.30 79.87
Overall 27.51 83.84

Contributors

Trapoom Ukarapol, Zhicheng Lee, Amy Xin

Foot Notes

This work is the final project of the Natural Language Processing Spring 2024 course at Tsinghua University 🟣. We would like to express our sincere gratitude to this course !