trapoom555
/

MiniCPM-2B-Text-Embedding-cft

Sentence Similarity

sentence-embedding

feature-extraction

Inference Endpoints

Model card Files Files and versions Community

MiniCPM-2B-Text-Embedding-cft / README.md

trapoom555's picture

Update README.md

d3a9eca verified 11 days ago

|

raw history blame contribute delete

No virus

4.47 kB

	---
	license: mit
	language:
	- en
	tags:
	- sentence-embedding
	- sentence-similarity
	- transformers
	- feature-extraction
	pipeline_tag: sentence-similarity
	---

	# MiniCPM-2B-Text-Embedding-cft

	## Description

	This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets.

	## Base Model

	[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)

	## Usage

	1. Clone MiniCPM-2B-dpo-bf16 repository

	```bash
	git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
	```

	2. Change a tokenizer setting in `tokenizer_config.json`

	```json
	"add_eos_token": true
	```

	3. Use the model

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch
	import numpy as np

	class MiniCPMSentenceEmbedding:
	def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
	self.tokenizer = AutoTokenizer.from_pretrained(model_path)
	self.model = AutoModelForCausalLM.from_pretrained(model_path,
	torch_dtype=torch.bfloat16,
	device_map='cuda',
	trust_remote_code=True)
	if adapter_path != None:
	# Load fine-tuned LoRA
	self.model.load_adapter(adapter_path)

	def get_last_hidden_state(self, text):
	inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
	with torch.no_grad():
	out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
	return out.squeeze().float().cpu().numpy()

	def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
	"""
	Returns a list of embeddings for the given sentences.

	Args:
	sentences: List of sentences to encode

	Returns:
	List of embeddings for the given sentences
	"""

	out = []

	for s in sentences:
	out.append(self.get_last_hidden_state(s))

	return out

	minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft')

	example_sentences = ["I don't like apples", "I like apples"]

	encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)

	print(encoded_sentences)

	```

	## Training Details

	\| Training Details \| Value \|
	\|-------------------------\|-------------------\|
	\| Loss \| InfoNCE \|
	\| Batch Size \| 60 \|
	\| InfoNCE Temperature \| 0.05 \|
	\| Learning Rate \| 5e-05 \|
	\| Warmup Steps \| 100 \|
	\| Learning Rate Scheduler \| CosineAnnealingLR \|
	\| LoRA Rank \| 8 \|
	\| LoRA Alpha \| 32 \|
	\| LoRA Dropout \| 0.1 \|
	\| Training Precision \| bf16 \|
	\| Max Epoch \| 1 \|
	\| GPU \| RTX3090 \|
	\| Num GPUs \| 4 \|

	## Training Scripts

	The training script for this model is written in this [Github repository](https://github.com/trapoom555/Language-Model-STS-CFT/tree/main).

	## Checkpoints

	We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft-checkpoints).

	## Evaluation Results

	\| Benchmarks \| Before cft \| After cft \|
	\|----------------\|----------------\|---------------\|
	\| STS12 \| 7.27 \| 76.38 \|
	\| STS13 \| 18.38 \| 87.61 \|
	\| STS14 \| 15.04 \| 81.55 \|
	\| STS15 \| 32.24 \| 87.33 \|
	\| STS16 \| 39.79 \| 85.25 \|
	\| STS17 \| 33.63 \| 89.96 \|
	\| STSBenchmark \| 33.91 \| 86.51 \|
	\| BOISSES \| 18.03 \| 80.05 \|
	\| SICK-R \| 49.30 \| 79.87 \|
	\| Overall \| 27.51 \| 83.84 \|

	## Contributors

	Trapoom Ukarapol, Zhicheng Lee, Amy Xin

	## Foot Notes

	This work is the final project of the Natural Language Processing Spring 2024 course at Tsinghua University 🟣. We would like to express our sincere gratitude to this course !