Nicola Procopio

nickprock

AI & ML interests

Digital Health Semantic Search - Vector Search

Recent Activity

Organizations

nickprock's activity

New activity in nickprock/mmarco-bert-base-italian-uncased about 1 month ago
updated a dataset about 1 month ago
New activity in nickprock/keras-emotions-classificator 4 months ago
Reacted to anakin87's post with πŸ”₯πŸ€— 5 months ago
view post
Post
1649
🌌 Creating adventures with local LLMs

What if πŸ€”... Homer Simpson met Spider-Man and they went on a quest for donuts? 🍩
Or if Fred Astaire and Corporal Hicks teamed up to fight xenomorphs? πŸ‘Ύ

In the words of Karpathy, LLMs are dream machines...
they seem specially made to simulate these wild scenarios!

π„π±π©πžπ«π’π¦πžπ§π­π’π§π  𝐰𝐒𝐭𝐑 𝐭𝐑𝐒𝐬 𝐒𝐝𝐞𝐚 πŸ‘‡
Nous Research / @teknium recently released NousResearch/CharacterCodex:
a massive dataset with information on 16k characters, both fictional and real.
I couldn't wait to play it...

After a few attempts, I found that combining the information in this dataset with a good model (like meta-llama/Meta-Llama-3-8B-Instruct) opens the doors to a myriad of chat adventures.

πŸ› οΈ Stack:
πŸ”ΉHaystack for orchestration πŸ—οΈ
πŸ”Ήllamafile πŸ¦™πŸ—‚οΈ to run our model locally.

πŸ““ Check out the notebook: https://t.ly/y6jrZ
(includes a bonus πŸ•΅οΈ Mystery Character Quiz)
upvoted an article 6 months ago
view article
Article

Training and Finetuning Embedding Models with Sentence Transformers v3

β€’ 159
Reacted to davanstrien's post with 🧠 6 months ago
view post
Post
1808
Do you need a high-quality dataset to train a custom sentence transformer model? Look no further! I've developed a pipeline that leverages an LLM to create a synthetic dataset of negative and positive sentence pairs based on domain-specific anchors.

Here's what the pipeline offers:
- **Dataset Generation**: Automatically create synthetic sentence pairs
- **Mine hard negatives**: Use an existing embedding model to mine hard negatives
- **Model Training**: Train a model using the latest release of Sentence Transformers.

Check out this collection ( davanstrien/sentence-transformers-from-synthetic-data-66571a6133480d1b70066b70) to see an example of what you can achieve with this pipeline. It features a sentence transformer model to detect coding prompt similarities in a @bigcode dataset.

Excited to get started? Find a tutorial here: https://github.com/davanstrien/awesome-synthetic-datasets/tree/main/examples/embedding-datasets.
  • 1 reply
Β·
Reacted to tomaarsen's post with πŸ”₯ 6 months ago
view post
Post
1942
‼️Sentence Transformers v3.0 is out! You can now train and finetune embedding models with multi-GPU training, bf16 support, loss logging, callbacks & much more. I also release 50+ datasets to train on.

1️⃣ Training Refactor
Embedding models can now be trained using an extensive trainer with a lot of powerful features:
- MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP))
- bf16 training support; loss logging
- Evaluation datasets + evaluation loss
- Improved callback support + an excellent Weights & Biases integration
- Gradient checkpointing, gradient accumulation
- Improved model card generation
- Resuming from a training checkpoint without performance loss
- Hyperparameter Optimization
and much more!
Read my detailed blogpost to learn about the components that make up this new training approach: https://huggingface.co/blog/train-sentence-transformers

2️⃣ Similarity Score
Not sure how to compare embeddings? Don't worry, you can now use model.similarity(embeddings1, embeddings2) and you'll get your similarity scores immediately. Model authors can specify their desired similarity score, so you don't have to worry about it anymore!

3️⃣ Additional Kwargs
Sentence Transformers relies on various Transformers instances (AutoModel, AutoTokenizer, AutoConfig), but it was hard to provide valuable keyword arguments to these (like 'torch_dtype=torch.bfloat16' to load a model a lower precision for 2x inference speedup). This is now easy!

4️⃣ Hyperparameter Optimization
Sentence Transformers now ships with HPO, allowing you to effectively choose your hyperparameters for your data and task.

5️⃣ Dataset Release
To help you out with finetuning models, I've released 50+ ready-to-go datasets that can be used with training or finetuning embedding models: sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552

Full release notes: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.0.0