Nomic-ai-embedding fine tuning with SentenceTransformersFinetuneEngine

#18

by Miheer29 - opened 23 days ago

Discussion

Miheer29

23 days ago

•

edited 23 days ago

Im trying to finetune Nomic-ai-embedding using SentenceTransformersFinetuneEngine and am running into an issue:

from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(
train_dataset, # Dataset to be trained on
model_id="nomic-ai/nomic-embed-text-v1.5", # HuggingFace reference to base embeddings model
model_output_path="llama_model_v1", # Output directory for fine-tuned embeddings model
val_dataset=test_dataset, # Dataset to validate on
epochs=2, # Number of Epochs to train for
)

Error:

zpn

Nomic AI org 23 days ago

I would reach out to the package SentenceTransformers as I don't have as deep knowledge of what's going on there

zpn changed discussion status to closed 23 days ago

tomaarsen

23 days ago

•

edited 23 days ago

Hello!

I'm afraid that this is not currently conveniently possible, because this SentenceTransformer instance must be initialized here with trust_remote_code=True as the model must pull code from Hugging Face. I would recommend opening an issue in LlamaIndex for it.

That said, I think you should be able to solve your problem. You can first download the model to a local directory. Then, you can download these two files and also place them in the repository:

Then, you must update your local config.json to no longer say:

  "auto_map": {
    "AutoConfig": "nomic-ai/nomic-embed-text-v1--configuration_hf_nomic_bert.NomicBertConfig",
    "AutoModel": "nomic-ai/nomic-embed-text-v1--modeling_hf_nomic_bert.NomicBertModel",
    "AutoModelForMaskedLM": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForPreTraining"
  },

but instead to say:

  "auto_map": {
    "AutoConfig": "configuration_hf_nomic_bert.NomicBertConfig",
    "AutoModel": "modeling_hf_nomic_bert.NomicBertModel",
  },

Now these files are local, and we don't need to download them from Hugging Face. As a result, you should now be able to initialize the SentenceTransformersFinetuneEngine with the path to your local directory. It should then no longer complain about the lack of trust_remote_code=True.

@Miheer29

Tom Aarsen

Miheer29

22 days ago

thank you tom!

do i need just the model tensors and config.json or would i need to clone the entire repo?

tomaarsen

22 days ago

You should probably just clone the entire repo

Miheer29

22 days ago

thank you!

also, how do i use the model with SentenceTransformersFinetuneEngine ? because there is only a model_id parameter in SentenceTransformersFinetuneEngine , there is no way to pass the actual model

would you recommend cloning the repo , making the changes and uploading the model to huggingface? if so , would i need to make any other changes to the files?

tomaarsen

22 days ago

how do i use the model with SentenceTransformersFinetuneEngine ?

model_id can also be a path to a local model, you should use that instead.

And no, I wouldn't upload it to Hugging Face for this, because then it still has to pull code from Hugging Face and it'll still need trust_remote_code=True.

Miheer29

22 days ago

This comment has been hidden

Miheer29

17 days ago

hi @tomaarsen is there anything else i can do to solve my issue

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment