Evaluating Generative Models

Evaluating Generative Models

Generative models are a class of machine learning models that aim to generate new data samples that resemble the training data. They have gained significant attention in recent years due to their ability to generate realistic images, text, and other data types. However, evaluating the performance of generative models can be challenging, as traditional evaluation metrics used for discriminative models are not directly applicable. This glossary entry will provide an overview of the key concepts and techniques used to evaluate generative models.

Inception Score (IS)

The Inception Score (IS) is a popular evaluation metric for generative models, particularly for image generation tasks. It is based on the idea that a good generative model should produce diverse and realistic samples. The IS is calculated by using a pre-trained classifier (typically the Inception network) to classify the generated samples and compute the entropy of the predicted class probabilities. A high IS indicates that the generated samples are both diverse (high entropy) and realistic (low conditional entropy).

Frechet Inception Distance (FID)

The Frechet Inception Distance (FID) is another evaluation metric for generative models that addresses some of the limitations of the IS. The FID measures the similarity between the distributions of the generated samples and the real data in the feature space of a pre-trained classifier (again, typically the Inception network). The FID is calculated by computing the Frechet distance between the two distributions, which takes into account both the mean and covariance of the feature vectors. A lower FID indicates that the generated samples are more similar to the real data.

Perceptual Path Length (PPL)

Perceptual Path Length (PPL) is an evaluation metric for generative models that focuses on the smoothness of the latent space. The idea is that a good generative model should have a smooth and continuous latent space, where small changes in the input result in small changes in the output. The PPL is calculated by measuring the average perceptual distance between pairs of generated samples that are interpolated in the latent space. A lower PPL indicates a smoother and more continuous latent space.

Precision, Recall, and F1 Score

Precision, recall, and F1 score are evaluation metrics borrowed from the field of information retrieval and adapted for generative models. These metrics are based on the idea that a good generative model should generate samples that cover the diversity of the real data (high recall) while avoiding generating unrealistic samples (high precision). The F1 score is the harmonic mean of precision and recall, providing a single metric that balances the trade-off between the two. These metrics can be computed using nearest-neighbor matching in the feature space of a pre-trained classifier.


Log-likelihood is a fundamental evaluation metric for generative models that measures the probability of the real data given the model. A higher log-likelihood indicates that the model assigns a higher probability to the real data, suggesting a better fit. However, log-likelihood can be difficult to compute for some generative models, such as Generative Adversarial Networks (GANs), due to the lack of an explicit likelihood function.

In conclusion, evaluating generative models is an important aspect of the development and application of these models in various domains. Several evaluation metrics have been proposed to assess the performance of generative models, each with its own strengths and limitations. Understanding these metrics and their underlying principles is crucial for data scientists working with generative models to ensure that they can effectively compare and select the best models for their specific tasks.