Community Computer Vision Course documentation

Challenges and Opportunities Associated With Using Synthetic Data

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Challenges and Opportunities Associated With Using Synthetic Data

Training machine learning models requires vast amounts of data. Synthetic data can help by addressing privacy issues, augmenting limited data, and correcting imbalances in the real data. We have learned how to generate synthetic data using several different methods. Before using synthetic data to train a model, however, there are several important things that need to be considered.

Overfitting the Model

Overfitting occurs when a machine learning model learns the training data so well that it doesn’t perform well on new, unseen data. It’s akin to learning a specific way to solve a problem but then encountering a new situation where the strategy doesn’t work. If the process of generating synthetic data is too simple or there are are overly-consistent patterns, your model might overfit to the limited variations present in the synthetic data. As a very simple example, suppose you trained a model using a synthetic dataset of 25 red circles and 25 blue squares. The model will probably learn to associate circles with the color red and squares with the color blue. This model would likely fail if presented with a red square.

Be sure to double check that your dataset doesn't have the following types of patterns!

Overly Consistent Color consistent-color

Overly Consistent Size consistent-size

Overly Consistent Background consistent-background

Overly Consistent Location consistent-location

Are there biases in the synthetic data?

If the process of generating synthetic data has biases or inaccuracies, your model may unintentionally learn and perpetuate those biases. Beware of the following pitfalls:

Limited Diversity

One challenge is that synthetic data may fail to adequately represent the complexity and diversity of the real data. The shape example might seem trivial, but there are lots of situations where failing to account for the wide variety of people, places, animals, or objects will result in a model that doesn’t perform well. For example, suppose you wanted to train a model to monitor the population of an endangered species, such as aye-aye lemurs. If your dataset only contains images of ring-tailed lemurs, the model might struggle to accurately identify aye-aye lemurs in the wild. This limitation could lead to errors in population assessments. The great thing is that if you are mindful of any imbalances in the underlying dataset, you can potentially use synthetic data to de-bias the real data by augmenting with synthetic data from the under-represented class.

Try to make sure your dataset reflects the variety found in the real world!

Nice Variety nice-variety

Copying Existing Biases:

If the data you used to create the synthetic images already had biases, your model might unintentionally learn and replicate those biases. It’s like copying a friend’s notes without realizing they made a mistake – your computer might end up with the same errors.

Does the benefit of using synthetic data outweigh the computational cost?

Generating high-quality synthetic data can be computationally expensive. This may pose challenges in terms of both time and resources, especially for complex models or large datasets. As a general rule, generating and using a synthetic dataset only makes sense if it ultimately saves resources (money, time, etc.).

What is the perceived quality of the synthetic images?

Let’s consider the lung images we generated using DCGAN. While some of the images looked pretty realistic, others were not so good. A model trained with the low-quality images might fail to detect pneumonia because they contained noise that isn’t present in the real images. It is also possible that your model might get really good at recognizing patterns in the synthetic data, but those patterns might not existor may be different in the real world.

A good practice is to evaluate your dataset using a metric such as Frechet Inception Distance (FID), Inception Score (IS), or the Classification Accuracy Score (CAS).

FID:

FID uses a pre-trained neural network model, often Inception, which is good at recognizing objects in images. The model is used to extract features from both the real and generated images.FID is a measure of how “far” one distribution is from another, taking into account both the mean and covariance of the distributions.

A low FID suggests that the feature distributions of real and generated images are similar and the generated images are more likely to be realistic.

IS:

IS uses a pre-trained Inception model to evaluate the quality of generated images produced by generative models, particularly GANs. For each generated image, the Inception model assigns a score based on its confidence in recognizing objects within that image. High scores are better and indicate that the Inception model is confident about the content of the image.

CAS:

Classification accuracy is another measure of how well your model is performing on the synthetic data. A higher accuracy indicates that the model is effectively capturing the features and patterns of the real images. Low accuracy scores for certain classes may indicate issues with the generation process, such as unrealistic backgrounds, incorrect textures, or inconsistent lighting conditions. You can use CIS to help you identify and address these problems to improve the overall quality of the synthetic dataset.

Conclusion

Even after training your model, it is crucial to continuously monitor its performance in real-world scenarios. If your model encounters new situations or trends that weren’t present in the synthetic data, it might struggle to adapt. Addressing these challenges involves mindful design of the synthetic data generation process and evaluation of the model’s performance on real data. Applying these principles will help to unlock the potential of synthetic data!

Resources and Further Reading

< > Update on GitHub