Finetuning The Model with Custom Dataset
#88
by
eneSadi
- opened
I am trying to finetune this model with SentenceTransformerTrainer for updating the retrieval.passage
adapter's weights. Do you have any tutorial notebook or something for this? I am getting different errors from different parts like
RuntimeError: FlashAttention is not installed. To proceed with training, please install FlashAttention. For inference, you have two options: either install FlashAttention or disable it by setting use_flash_attn=False when loading the model.
(Even I disabled use flash attention option (idk if I am doing it right))
RuntimeError: Index put requires the source and destination dtypes match, got Half for the destination and Float for the source.
NameError: name 'IterableDataset' is not defined
Here is my implementation:
I am running it on Colab session with A100 GPU
!pip install --upgrade torch transformers sentence-transformers
!pip install flash-attn --no-build-isolation
!pip install datasets
!pip install einops
!pip install 'numpy<2'
from sentence_transformers import SentenceTransformer
import torch
# I am not sure about the kwargs so I put them in both model and config kwargs
model = SentenceTransformer("jinaai/jina-embeddings-v3",
trust_remote_code=True,
model_kwargs={'use_flash_attn': False,
"use_cache": False,
'lora_main_params_trainable': False,
"default_task": "retrieval.passage",
"torch_dtype": torch.float16},
config_kwargs={'use_flash_attn': False,
"use_cache": False,
'lora_main_params_trainable': False,
"default_task": "retrieval.passage",
"torch_dtype": torch.float16})
dataset = torch.load('/content/drive/MyDrive/matrag/finetuning_dataset/msmarco_tr_sample.pt')
train_dataset = dataset['train']
eval_dataset = dataset['eval']
from sentence_transformers.losses import CoSENTLoss
from sentence_transformers.training_args import SentenceTransformerTrainingArguments, BatchSamplers
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
from sentence_transformers import SentenceTransformerTrainer
loss = CoSENTLoss(model)
args = SentenceTransformerTrainingArguments(
output_dir='jina-embeddings-v3',
num_train_epochs=1,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
lr_scheduler_type='cosine',
warmup_ratio=0.1,
bf16=False,
fp16=True,
batch_sampler=BatchSamplers.NO_DUPLICATES,
eval_strategy='steps',
eval_steps=1000,
save_strategy='steps',
save_steps=1000,
save_total_limit=2,
logging_steps=1000,
load_best_model_at_end=True,
metric_for_best_model='cosine_accuracy',
)
evaluator = EmbeddingSimilarityEvaluator(
sentences1=eval_dataset["doc"],
sentences2=eval_dataset["candidate"],
scores=eval_dataset["label"],
main_similarity=SimilarityFunction.COSINE,
name="example-dev",
)
print(evaluator(model))
trainer = SentenceTransformerTrainer(
model=model,
loss=loss,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
evaluator=evaluator
)
trainer.train()
You can use this as an example if you want reproduce:
from datasets import Dataset
train_dataset = Dataset.from_dict({
"doc": ["doc1", "doc2", "doc3"],
"candidate": ["candidate1", "candidate2", "candidate3"],
"label": [1, 0, 1],
})
eval_dataset = Dataset.from_dict({
"doc": ["doc1", "doc2", "doc3"],
"candidate": ["candidate1", "candidate2", "candidate3"],
"label": [1, 0, 1],
})
Hi @eneSadi , it looks like flash-attention is not installed. You need flash-attention to train jina-embeddings-v3.