Introduction
Have you ever tried to get hold of some data for your problem, be it a machine learning problem or some other development-related problem, and you just couldn’t find enough data? Either the data is closed-source and unavailable to you, or it is prohibitively costly or time-consuming to acquire. How do we deal with such a situation?
Well, one solution is synthetic data. Synthetic data is generated by a model to be used in place of real data or with real data. Here, by model, we don’t mean only machine learning or deep learning models; they can be simple mathematical or statistical models too, like a set of (stochastic) differential equations modeling a physical or economic system. Feeling excited yet? Let’s dive more into the details of synthetic data: what it is, how it is generated, and its benefits. You might be able to answer the last one a little by now ;)
What is synthetic data?
As Royal Society defines, synthetic data is the data generated using a purpose-built mathematical model or algorithm to solve a (set of) data science task(s). Keep in mind that synthetic data only mimics the real data and is not generated by real events. Ideally, the synthetic data should have the same statistical properties as the real data it is supplementing. It has many uses, such as improving AI models, protecting sensitive data, and mitigating bias.
Why would you use synthetic data?
Before answering this question, let’s talk a little bit about why real data is not sufficient anymore. Some of the non-exhaustive problems with real data are:
- It can be messy and very hard to deal with.
- Inter-company data sharing might not be possible due to privacy issues.
- Medical data is confidential and hence cannot be shared openly.
- It can be biased.
- Data collection and annotation can be expensive.
Most of the above-mentioned problems can potentially be solved by synthetic data:
- Synthetic data are generated in a structured form, and hence, they are easy to deal with.
- Companies can train synthetic data generation models that learn the distribution of the original data but don’t reveal anything about individual data points in the original data and hence maintain privacy. A similar approach can be taken for medical data.
- We can train the data generator model to generate de-biased data.
- Synthetic data can be augmented with real data to make the models or applications more robust.
How to generate synthetic data?
Here, we mention some of the ways to generate synthetic data:
- CAD & Blender: Allows the creation of photorealistic image datasets of 3D scenes while controlling parameters. It enables computing metrics by comparing the synthesized data to the ground truth (generation parameters). It is a very robust method but limited in generation quality, diversity, and quantity. Use cases include using commercial applications, generating synthetic faces, and monitoring wildlife.
- Deep generative models (Transformers/GANs/Diffusion models): Allow expanding a dataset, tackling data imbalance, and solving privacy issues. Very convenient and powerful but can create datasets with biases, incoherence, and repetitiveness, which induces an important overtraining risk and produces a restricted set of predictions. Use cases include medical image generation, efficient plant disease identification, industrial waste sorting, traffic sign recognition, and detection of emergency vehicles for an autonomous driving car application.
In this unit, we will introduce the following methods to generate synthetic data: physically-based rendering, point clouds, and GANs.
Challenges with synthetic data
Now that we have seen the power and uses of synthetic data, let’s take some time out to discuss its challenges:
- Synthetic data is not inherently private: Synthetic data can also leak information about the data it was derived from and is vulnerable to privacy attacks. Significant care is required to generate private synthetic data.
- Outliers can be hard to capture privately: Outliers and low probability events, as are often found in real data, are particularly difficult to capture and to be privately included in a synthetic dataset.
- Empirically evaluating the privacy of a single dataset can be problematic: Rigorous notions of privacy (e.g., differential privacy) are a requirement on the mechanism that generated a synthetic dataset rather than on the dataset itself.
- Black box models can be particularly opaque when it comes to generating synthetic data: Overparameterised generative models excel in producing high-dimensional synthetic data, but the levels of accuracy and privacy of these datasets are hard to estimate and can vary significantly across produced data points.
Resources
- Machine Learning for Synthetic Data Generation: A Review
- Synthetic Data — what, why and how?
- One very interesting application of synthetic data: this person does not exist