@davanstrien on Hugging Face: "Do you need a high-quality dataset to train a custom sentence transformer…"

Post

1798

Do you need a high-quality dataset to train a custom sentence transformer model? Look no further! I've developed a pipeline that leverages an LLM to create a synthetic dataset of negative and positive sentence pairs based on domain-specific anchors.

Here's what the pipeline offers:
- **Dataset Generation**: Automatically create synthetic sentence pairs
- **Mine hard negatives**: Use an existing embedding model to mine hard negatives
- **Model Training**: Train a model using the latest release of Sentence Transformers.

Check out this collection ( davanstrien/sentence-transformers-from-synthetic-data-66571a6133480d1b70066b70) to see an example of what you can achieve with this pipeline. It features a sentence transformer model to detect coding prompt similarities in a @bigcode dataset.

Excited to get started? Find a tutorial here: https://github.com/davanstrien/awesome-synthetic-datasets/tree/main/examples/embedding-datasets.

Join the conversation