--- license: mit language: - en tags: - sentence-embedding - sentence-similarity - transformers - feature-extraction pipeline_tag: sentence-similarity --- # Phi-2-Text-Embedding-cft ## Description This is a fine-tuned version of [Phi-2](https://huggingface.co/microsoft/phi-2) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets. ## Base Model [Phi-2](https://huggingface.co/microsoft/phi-2) ## Usage 1. Clone Phi-2 repository ```bash git clone https://huggingface.co/microsoft/phi-2 ``` 2. Change a tokenizer setting in `tokenizer_config.json` ```json "add_eos_token": true ``` 3. Use the model ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch import numpy as np class PhiSentenceEmbedding: def __init__(self, model_path='microsoft/phi-2', adapter_path=None): self.tokenizer = AutoTokenizer.from_pretrained(model_path) self.model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) if adapter_path != None: # Load fine-tuned LoRA self.model.load_adapter(adapter_path) def get_last_hidden_state(self, text): inputs = self.tokenizer(text, return_tensors="pt").to('cuda') with torch.no_grad(): out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :] return out.squeeze().float().cpu().numpy() def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]: """ Returns a list of embeddings for the given sentences. Args: sentences: List of sentences to encode Returns: List of embeddings for the given sentences """ out = [] for s in sentences: out.append(self.get_last_hidden_state(s)) return out phi_sentence_embedding = PhiSentenceEmbedding(, 'trapoom555/Phi-2-Text-Embedding-cft') example_sentences = ["I don't like apples", "I like apples"] encoded_sentences = phi_sentence_embedding.encode(example_sentences) print(encoded_sentences) ``` ## Training Details | **Training Details** | **Value** | |-------------------------|-------------------| | Loss | InfoNCE | | Batch Size | 60 | | InfoNCE Temperature | 0.05 | | Learning Rate | 5e-05 | | Warmup Steps | 100 | | Learning Rate Scheduler | CosineAnnealingLR | | LoRA Rank | 8 | | LoRA Alpha | 32 | | LoRA Dropout | 0.1 | | Training Precision | bf16 | | Max Epoch | 1 | | GPU | RTX3090 | | Num GPUs | 4 | ## Training Scripts The training script for this model is written in this [Github repository](https://github.com/trapoom555/Language-Model-STS-CFT/tree/main). ## Checkpoints We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/Phi-2-Text-Embedding-cft-checkpoints). ## Evaluation Results | **Benchmarks** | **Before cft** | **After cft** | |----------------|----------------|---------------| | STS12 | 23.04 | 61.62 | | STS13 | 20.79 | 71.87 | | STS14 | 17.06 | 60.46 | | STS15 | 24.56 | 71.18 | | STS16 | 48.68 | 74.77 | | STS17 | 41.43 | 80.20 | | STSBenchmark | 37.87 | 79.46 | | BOISSES | 28.04 | 64.06 | | SICK-R | 48.40 | 74.37 | | **Overall** | **32.21** | **70.89** | ## Contributors Trapoom Ukarapol, Zhicheng Lee, Amy Xin ## Foot Notes This work is the final project of the Natural Language Processing Spring 2024 course at Tsinghua University 🟣. We would like to express our sincere gratitude to this course !