GAN metrics

In order to track progress 📈 in (un)conditional image generation, a few quantitative metrics have been proposed. Below, we explain the most popular ones. For a more extensive overview, we refer the reader to Borji, 2021 - which is an up-to-date version of Borji, 2018. The TLDR is that, despite the use of many popular metrics, objective and comprehensive evaluation of generative models is still an open problem 🤷‍♂️.

Quantitative metrics are of course just a proxy of image quality. The most widely used (Inception Score and FID) have several drawbacks Barratt et al., 2018, Sajjadi et al., 2018, Kynkäänniemi et al., 2019.

Inception score

The Inception score was proposed in Salimans et al., 2016. The authors used a pre-trained Inceptionv3 neural net to classify the images generated by a GAN, and computed a score based on the class probablities of the neural net. The authors claimed that the score correlates well with subjective human evaluation. For an extensive explanation of the metric (as well as an implementation in Numpy and Keras), we refer the reader to this blog post.

Fréchet Inception Distance (FID)

The FID metric was proposed in Heusel et al., 2018, and is currently the most widely used metric for evaluating image generation. Rather than only evaluating the generated images (as the Inception score), the FID metric compares the generated images to real images.

The Fréchet distance meaures the distance between 2 multivariate Gaussian distributions. What does that mean? Concretely, the FID metric uses a pre-trained neural network (the same one as the one of the Inception score, Inceptionv3), and first forwards both real and generated images through it in order to get feature maps. Next, one computes statistics (namely, the mean and standard deviation) of the feature maps for both distributions (generated and real images). Finally, the distance between both distributions is computed based on these statistics.

The FID metric assumes that feature maps of a pre-trained neural net extracted on real vs. fake images should be similar (the authors argue that this is a good quantitative metric for assessing image quality, correlating well with human judgement).

An important disadvantage of the FID metric is that is has an issue of generalization; a model that simply memorizes the training data can obtain a perfect score on these metrics Razavi et al., 2019.

Variants have been proposed for other modalities, such as the Fréchet Audio Distance Kilgour et al., 2018 and the Fréchet Video Distance Unterthiner et al., 2018.

The official implementation is in Tensorflow and can be found here. A PyTorch implementation can be found here.

Clean FID

In 2021, a paper by Parmar et al. indicated that the FID metric is often poorly computed, due to incorrect implementations of low-level image preprocessing (such as resizing of images) in popular frameworks such as PyTorch and TensorFlow. This can produce widely different values for the FID metric.

The official implementation of the cleaner FID version can be found here.

Note that FID has many, many other variants including spatial FID (sFID), class-aware FID (CAFD) and conditional FID, Fast FID, Memorization-informed FID (MiFID), Unbiased FID, etc.

Precision and Recall

Despite the FID metric being popular and correlating well with human evaluation, Sajjadi et al., 2018 pointed out that, due to the fact that the FID score is just a scalar number, it is unable to distinguish between different failure cases. Two generative models could obtain the same FID score while generating images that look entirely different. Hence, the authors proposed a novel approach, defining precision (P) and recall (R) for distributions.

Precision measures the similarity of generated instances to the real ones and recall measures the ability of a generator to synthesize all instances found in the training set. Hence, precision measures the quality and recall the coverage.

These metrics were then further improved by Kynkäänniemi et al., 2019.