Do you need a high-quality dataset to train a custom sentence transformer model? Look no further! I've developed a pipeline that leverages an LLM to create a synthetic dataset of negative and positive sentence pairs based on domain-specific anchors.
Here's what the pipeline offers: - **Dataset Generation**: Automatically create synthetic sentence pairs - **Mine hard negatives**: Use an existing embedding model to mine hard negatives - **Model Training**: Train a model using the latest release of Sentence Transformers.
How can we use open LLMs to create data for training sentence similarity models?
One of the most exciting use cases for LLMs is generating synthetic datasets that can be used to train non-LLM models. In the past, gathering enough data was one of the most significant barriers to training task-specific models. LLMs can potentially help in this area.