Stochastic Weight Averaging (SWA)

Stochastic Weight Averaging (SWA)

Stochastic Weight Averaging (SWA) is a powerful optimization technique in machine learning that often leads to superior generalization performance. It was introduced by Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson in 2018.


SWA is a method for improving the performance of any deep learning model. It works by averaging the weights of the model over the course of its training. Unlike conventional training methods that only consider the final point of the optimization trajectory, SWA considers the entire trajectory to find a solution. This results in a more robust model with improved generalization capabilities.

How it Works

During the training process, instead of using only the final weights, SWA maintains an average of the weights encountered during optimization. This average is computed over different iterations after a certain number of epochs. The resulting averaged weights often correspond to a wider and flatter minimum of the loss function, leading to better generalization on unseen data.


  1. Improved Generalization: SWA often leads to models that generalize better on unseen data. This is because the averaged weights correspond to a wider minimum of the loss function, which is associated with better generalization.

  2. Robustness: SWA provides a robust solution by considering the entire trajectory of the weights during training, rather than just the final point.

  3. Easy to Implement: SWA can be easily added to any existing training procedure without requiring significant changes to the code.


  1. Increased Computation: SWA requires maintaining and updating an additional set of weights during training, which can increase the computational requirements.

  2. Not Always Beneficial: While SWA often improves generalization, there are cases where it may not provide significant benefits, particularly for models that are already well-optimized.


SWA has been successfully applied in various domains, including computer vision and natural language processing. It has been used to improve the performance of models on benchmark datasets like ImageNet and CIFAR-10.


Here’s a simple example of how to implement SWA in PyTorch:

from torchcontrib.optim import SWA

base_optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
optimizer = SWA(base_optimizer, swa_start=10, swa_freq=5, swa_lr=0.05)

for _ in range(100):
    loss_fn(model(input), target).backward()


In this example, swa_start determines the epoch at which to start averaging, swa_freq determines the frequency of averaging, and swa_lr is the learning rate during SWA.


  1. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018). Averaging Weights Leads to Wider Optima and Better Generalization. arXiv preprint arXiv:1803.05407.

  2. PyTorch Contrib: SWA. (n.d.). Retrieved from

Note: This glossary entry is intended for a technical audience of data scientists. It assumes familiarity with machine learning concepts and terminology.