Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Community Article Published June 20, 2024

Overview

I am excited to share Synthetic Data Workshop, a Space designed to help those with limited GPU resources generate synthetic datasets without needing extensive setup. This space is an experiment using Hugging Face Spaces to offer a ready-to-go compute environment to facilitate synthetic data creation.

Screenshot of the Space

What is Synthetic Data?

Synthetic data refers to artificially generated data that aims to mimic real-world data. Created algorithmically using models or simulations, synthetic data is a valuable resource in machine learning. With the advent of large language models (LLMs), there's been a surge in using synthetic data for both creating and training these models. Over the past year, interest in synthetic data has grown significantly.

Synthetic dataset growth on the Hub

Why Synthetic Data??

The increasing quality of open LLMs has made creating synthetic data using open models and libraries more feasible than ever. Previously, generating datasets for many tasks was time-consuming and expensive, and modifying a dataset drastically once it was created was difficult. Synthetic data empowers a single person to create large datasets for training and fine-tuning models for various tasks.

While much of the interest in synthetic data focuses on improving LLMs, there are numerous reasons to use synthetic data in other machine learning pipelines, including training smaller, task-specific models.

Barriers to Creating Synthetic Data

Despite the advancements, starting with a suitable environment for generating synthetic data can be challenging, especially for those with limited GPU resources. Setting up the environment can be time-consuming and inefficient. This is where the Synthetic Data Workshop Space comes in.

Features of Synthetic Data Workshop

Synthetic Data Workshop is a Hugging Face Space template designed to simplify the generation of synthetic data using popular open-source libraries.

Included Libraries and Tools

  • Datasets: For loading, processing, and pushing datasets to the Hugging Face hub.
  • vLLM: For running open LLMs used to generate the data.
  • Outlines: For using structured generation to control the output generated by the LLMs.

In addition to these libraries, the space provides a set of Jupyter notebooks that guide users through generating synthetic datasets. These notebooks can be easily adapted to work with your datasets and explain the steps and processes involved, empowering users to adapt the approach for future projects.

Ready-to-Use Environment

  • Pre-configured setup with all necessary libraries.
  • No need for local GPU setups.

Comprehensive Tutorials

  • Step-by-step guides.
  • Example notebooks.

Topics Covered

There is an example pipeline for generating synthetic data to train Sentence Transformers models. Additional notebooks may be polished and added to the space based on user interest 🤗

How to Use the Space

Duplicating the Space

  1. Duplicate the Space to create your own instance.
  2. Opt to keep the space private or make it public.
  3. Add a JUPYTER_TOKEN secret if you plan to make the space public.
  4. Select the desired hardware: for running most of the notebooks, you'll require a GPU
  5. Add a small disk to save your work between sessions.

Duplication Screenshot

Choosing an Appropriate GPU

The space heavily utilizes vLLM, which requires a GPU. You can start with a smaller GPU and upgrade to a larger one as needed, especially when running bigger models or scaling the data you generate. You can also run the Space on a CPU but you won't be able to use the LLMs when using CPU hardware.

Next steps

The Synthetic Data Workshop is an experimental Space, and I'm eager to receive feedback from the community. Let me know what works well, what could be improved, and what additional features or tutorials you want to see.

Stay tuned for more updates and tutorials, and happy data generation!