Daniel van Strien's picture

Daniel van Strien PRO

davanstrien

·

https://danielvanstrien.xyz/

vanstriendaniel

AI & ML interests

Machine Learning Librarian

Articles

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Creating open machine learning datasets? Share them on the Hugging Face Hub!

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

Posts 16

Post

1600

How can we use open LLMs to create data for training sentence similarity models?

One of the most exciting use cases for LLMs is generating synthetic datasets that can be used to train non-LLM models. In the past, gathering enough data was one of the most significant barriers to training task-specific models. LLMs can potentially help in this area.

I've just written a new blog post on using meta-llama/Meta-Llama-3-70B-Instruct to generate synthetic similarity data based on the approach from Retrieving Texts based on Abstract Descriptions (2305.12517).

https://huggingface.co/blog/davanstrien/synthetic-similarity-datasets

Post

1775

I've begun adding valuable blog posts on using/creating synthetic datasets to my curated list.

I am starting with a great post by @MoritzLaurer on utilizing an open LLM to generate data for training a specialized Roberta model.

Read the blog post: https://huggingface.co/blog/synthetic-data-save-costs
See the rest of the list: https://github.com/davanstrien/awesome-synthetic-datasets

Collections 12

Papers 4

arxiv:2211.10086

arxiv:2211.05100

arxiv:2205.04738

arxiv:2204.05211

spaces 37

Dataset Column Search

Gradio Url Params Request

Chat Viewer

Domain Specific Seed

Running on CPU Upgrade

DIBT haiku preferences

Argilla Space Template

models 112

davanstrien/query-gen

Updated 7 days ago • 6

davanstrien/LLama-3-dataset-tldr

Text Generation • Updated 28 days ago • 6

davanstrien/LLama-3-dataset-tldr-gguf

Updated 28 days ago • 129

davanstrien/dataset-tldr

Text Generation • Updated 29 days ago • 4

davanstrien/testdpo

Text Generation • Updated Apr 11 • 1

davanstrien/prompt_rater

Text Classification • Updated Mar 13 • 1

davanstrien/prompt_ranker

Text Classification • Updated Feb 29 • 2

davanstrien/faketransformer

Text Classification • Updated Feb 14

davanstrien/flair

Updated Jan 30 • 1

davanstrien/HaikuHermes-0.1-7B

Text Generation • Updated Jan 16 • 9 • 4

datasets 222

davanstrien/similarity-dataset-sc2-8b

Viewer • Updated about 1 hour ago • 1

davanstrien/similarity-dataset-test-code2

Viewer • Updated about 2 hours ago

davanstrien/similarity-dataset-test-code

Viewer • Updated about 19 hours ago • 1

davanstrien/self-oss-instruct-sc2-exec-filter-50k-short

Viewer • Updated about 20 hours ago • 36

davanstrien/similarity-dataset-test-sql

Viewer • Updated 4 days ago

davanstrien/similarity-dataset-test

Preview • Updated 5 days ago

davanstrien/query-gen

Viewer • Updated 7 days ago • 2

davanstrien/dataset-tldr-preference

Viewer • Updated 12 days ago • 3

davanstrien/notebooks_by_repo_type

Viewer • Updated 13 days ago • 103 • 1

davanstrien/notebooks_by_user

Viewer • Updated 13 days ago • 1 • 2