--- license: mit language: - en tags: - sentence-embedding - sentence-similarity - transformers - feature-extraction pipeline_tag: sentence-similarity --- # MiniCPM-2B-Text-Embedding-cft ## Description This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets. ## Base Model [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) ## Usage 1. Clone MiniCPM-2B-dpo-bf16 repository ```bash git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16 ``` 2. Change a tokenizer setting in `tokenizer_config.json` ```json "add_eos_token": true ``` 3. Use the model ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch import numpy as np class MiniCPMSentenceEmbedding: def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None): self.tokenizer = AutoTokenizer.from_pretrained(model_path) self.model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) if adapter_path != None: # Load fine-tuned LoRA self.model.load_adapter(adapter_path) def get_last_hidden_state(self, text): inputs = self.tokenizer(text, return_tensors="pt").to('cuda') with torch.no_grad(): out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :] return out.squeeze().float().cpu().numpy() def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]: """ Returns a list of embeddings for the given sentences. Args: sentences: List of sentences to encode Returns: List of embeddings for the given sentences """ out = [] for s in sentences: out.append(self.get_last_hidden_state(s)) return out minicpm_sentence_embedding = PhiSentenceEmbedding(, 'trapoom555/MiniCPM-2B-Text-Embedding-cft') example_sentences = ["I don't like apples", "I like apples"] encoded_sentences = minicpm_sentence_embedding.encode(example_sentences) print(encoded_sentences) ``` ## Training Details | **Training Details** | **Value** | |-------------------------|-------------------| | Loss | InfoNCE | | Batch Size | 60 | | InfoNCE Temperature | 0.05 | | Learning Rate | 5e-05 | | Warmup Steps | 100 | | Learning Rate Scheduler | CosineAnnealingLR | | LoRA Rank | 8 | | LoRA Alpha | 32 | | LoRA Dropout | 0.1 | | Training Precision | bf16 | | Max Epoch | 1 | | GPU | RTX3090 | | Num GPUs | 4 | ## Training Scripts The training script for this model is written in this [Github repository](https://github.com/trapoom555/Language-Model-STS-CFT/tree/main). ## Checkpoints We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft-checkpoints). ## Evaluation Results | **Benchmarks** | **Before cft** | **After cft** | |----------------|----------------|---------------| | STS12 | 7.27 | 76.38 | | STS13 | 18.38 | 87.61 | | STS14 | 15.04 | 81.55 | | STS15 | 32.24 | 87.33 | | STS16 | 39.79 | 85.25 | | STS17 | 33.63 | 89.96 | | STSBenchmark | 33.91 | 86.51 | | BOISSES | 18.03 | 80.05 | | SICK-R | 49.30 | 79.87 | | **Overall** | **27.51** | **83.84** | ## Contributors Trapoom Ukarapol, Zhicheng Lee, Amy Xin ## Foot Notes This work is the final project of the Natural Language Processing Spring 2024 course at Tsinghua University 🟣. We would like to express our sincere gratitude to this course !