479 171 526

Daniel van Strien PRO

davanstrien

https://danielvanstrien.xyz/

vanstriendaniel

davanstrien

AI & ML interests

Machine Learning Librarian

Articles

Synthetic dataset generation techniques: generating custom sentence similarity data

10 days ago

• 11

Synthetic dataset generation techniques: Self-Instruct

18 days ago

• 5

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

26 days ago

• 6

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 24

Extracting Insights from Model Cards Using Open Large Language Models

Nov 27, 2023

Creating open machine learning datasets? Share them on the Hugging Face Hub!

Oct 30, 2023

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 11

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

Aug 2, 2023

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

• 1

Introducing BERTopic Integration with Hugging Face Hub

May 31, 2023

• 1

Organizations

Posts 17

Post

1508

Do you need a high-quality dataset to train a custom sentence transformer model? Look no further! I've developed a pipeline that leverages an LLM to create a synthetic dataset of negative and positive sentence pairs based on domain-specific anchors.

Here's what the pipeline offers:
- **Dataset Generation**: Automatically create synthetic sentence pairs
- **Mine hard negatives**: Use an existing embedding model to mine hard negatives
- **Model Training**: Train a model using the latest release of Sentence Transformers.

Check out this collection ( davanstrien/sentence-transformers-from-synthetic-data-66571a6133480d1b70066b70) to see an example of what you can achieve with this pipeline. It features a sentence transformer model to detect coding prompt similarities in a @bigcode dataset.

Excited to get started? Find a tutorial here: https://github.com/davanstrien/awesome-synthetic-datasets/tree/main/examples/embedding-datasets.

Post

1659

How can we use open LLMs to create data for training sentence similarity models?

One of the most exciting use cases for LLMs is generating synthetic datasets that can be used to train non-LLM models. In the past, gathering enough data was one of the most significant barriers to training task-specific models. LLMs can potentially help in this area.

I've just written a new blog post on using meta-llama/Meta-Llama-3-70B-Instruct to generate synthetic similarity data based on the approach from Retrieving Texts based on Abstract Descriptions (2305.12517).

https://huggingface.co/blog/davanstrien/synthetic-similarity-datasets

View all posts