Request for Assistance with Fine-Tuning the nomic-embed-text-v1 Model for spanish language
I hope this message finds you well. My name is Wilfredo, and I am currently working on a project that involves fine-tuning the nomic-ai/nomic-embed-text-v1 model for a specific application in Spanish text processing.
I am reaching out to you to request your assistance in understanding the steps required to fine-tune this model effectively. Specifically, I am looking for guidance on:
Dataset Preparation: What are the recommended practices for preparing the dataset for fine-tuning? Are there any specific data formats or preprocessing steps that should be followed?
Fine-Tuning Process: Could you provide detailed instructions or a framework for fine-tuning the model, including any specific hyperparameters or training configurations that are crucial for achieving optimal performance?
Thank you very much for your time and consideration. I look forward to your response.
Best regards,
hi sentence transformers 3 might be a good place to start! https://x.com/tomaarsen/status/1795425797408235708
as far as data, i would curate a sizeable dataset of at least 10k to finetune on, although I'm not sure how well the model will do since the tokenizer is optimized solely for english.
Thank you @zpn . Also, I would like to study the code og nomic embeded.