How to Use Multiple GPUs in PyTorch

Effectively decrease your model’s training time and handle larger datasets by leveraging the expanded computational power of multiple GPUs in PyTorch.

Introduction to PyTorch?

In the world of data science and software engineering, it’s a familiar scenario: you’re knee-deep in data - mountains of it - and you’re trying to build a complex model. You can feel your single GPU struggling under the workload, the clock ticking louder, and your deadlines creeping closer. Times like these call for a lifeline - that’s where harnessing the power of multiple GPUs comes in.

By distributing the workload, you can speed up the training process and ultimately realize your computational goals faster and more efficiently. In this article, we’ll explore the why and how of leveraging multi-GPU architectures in PyTorch, a popular deep learning framework, shedding light on its importance in today’s data-driven landscape.

Moving beyond training, employing multiple GPUs also brings considerable advantages during inference. It allows for higher throughput, simultaneously enabling the swift processing of numerous tasks - critical in today’s world, encompassing traffic management, eCommerce recommendations, and real-time analytics. Moreover, a multi-GPU setup adds redundancy, promoting system robustness by ensuring continued operation even if one GPU encounters issues. Together, these advantages of multi-GPU utilization in both training and inference stages constitute a significant shift in enhancing the efficiency and reliability of machine learning (ML) applications. In light of this, understanding the effective use of multiple GPUs is becoming invaluable knowledge.

What is PyTorch?

PyTorch, an open-source ML library backed by Facebook’s AI Research group, is famous for its balance of simplicity and power in ML. It offers a dynamic computational graph that provides exceptional flexibility when building and modifying complex models at runtime.

One characteristic that sets PyTorch apart is its Pythonic nature, making it user-friendly for novice and experienced developers. It provides excellent compatibility with both CPUs and GPUs. It has built-in support for distributed processing and multi-GPU use, facilitating significant model training or prediction speed-ups.

At its core, PyTorch contains a high-level API for neural networks, promoting reusability and clean code. In the following sections, we’ll explore how to harness these robust features for practical multi-GPU usage.

Why Use Multiple GPUs?

You might ask, “My model runs fine on my workstation! Why must I use multiple GPUs?”. However, as ML models increase in complexity and require more compute, models can push even high-performing single-GPU setups to their limits. Additionally, wait times will increase as complexity increases, adding more downtime for you and your team. As such, one can argue for the “opportunity cost” and the “engineering hours” spent waiting for these ML tasks to finish.

This conversation of acceleration leads us to the concept of multiple GPUs. The training process is always one of the longest ML tasks, but can be accelerated significantly by sharing the workload among several GPUs. Developers can split a data source onto different GPUs for processing—a technique known as data parallelism. However, PyTorch arms us with more than just this; it provides other alternative strategies to leverage multiple GPUs:

  • Data parallelism: This strategy simultaneously processes data segments on different GPUs, speeding up computations.

  • Model parallelism: Crucial for colossal models surpassing a single GPU’s memory capacity, this method distributes different parts of the same model across multiple GPUs. Each GPU calculates its section of the model independently and then passes its output to the following GPU.

  • Pipeline parallelism: A combination of data and model parallelism where different parts of the model and data batches process on different GPUs concurrently. It is typically more efficient and can lead to faster training.

In addition to reduced training time, multiple GPUs can process larger models and datasets and add an element of redundancy. The overall acceleration and redundancy improvement validate the investment in setting up multi-GPU environments.

Pros and Cons of Parallelism Methods

As mentioned, PyTorch implements parallelization through three methods: data, model, and pipeline. Each of these provides unique advantages and brings along its complexities. Interestingly, while these strategies are typically employed to speed up operations, the wrong usage in specific contexts may inadvertently lead to slowdowns. For instance, despite its general beneficence, data parallelism might hamper performance with small models or insufficiently large batch sizes due to communication overheads.

In light of this, understanding each parallelism method’s subtle nuances, benefits, and drawbacks becomes crucial. In the following subsections, we compare each method’s pros and cons, paving the path for you to choose the best parallelism strategy for your needs.

Data Parallelism

Data parallelism in PyTorch involves using a singular model replicated across multiple GPUs. The training data gets split into numerous batches, each fed into a separate GPU for simultaneous processing. The results from each GPU are then consolidated and synchronized to yield the final output. This method utilizes the processing power of all involved GPUs for faster computations and a robust training process.

Pros

  • Simplicity: Data parallelism is straightforward to implement in PyTorch with the DataParallel() or DistributedDataParallel() class. Putting a model into a parallel setup can be as simple as wrapping your model into these classes.

  • Scalability: Data parallelism scales well with the number of GPUs since it involves splitting data batches and processing them concurrently.

Cons

  • Communication Overhead: Gradient synchronization must occur across the GPUs during each backpropagation pass, which can become a bottleneck, especially as the number of GPUs increases.

  • GPU Memory Imbalance: All model parameters and intermediate activation maps must be stored on each GPU, increasing total memory usage across the system.

Model Parallelism

Model parallelism spreads the layers of the neural network model across multiple GPUs. Unlike data parallelism, each GPU holds a unique segment of the model and operates on the entire input data. Information flows from one GPU to the next in a pipeline, processed sequentially by each piece of the segmented model. Different model portions run on different GPUs, each handling a specific part of the computations required to train the model.

Pros

  • Handles Large Models: It’s ideal for deploying large models that go beyond the memory capacity of a single GPU, as parts of the model reside on different GPUs.

  • Lower Communication Overhead: Model parallelism can result in less communication overhead than data parallelism, as gradients don’t necessarily need syncing across GPUs.

Cons

  • Complexity: It introduces complexity in implementation as it requires manual model partitioning across the GPUs.

  • Inefficient GPU Utilization: If the model’s layers do not have equal computational complexities, some GPUs might remain idle while others are processing, leading to inefficient GPU utilization.

Pipeline Parallelism

Pipeline parallelism is a strategy combining elements of both data and model parallelism. It segments the model layers across multiple GPUs, similar to model parallelism. Each GPU then processes a different mini-batch of data through its segment of the model, similar to data parallelism. Mini-batches move through the GPUs in a pipelined fashion, with each GPU passing its output to the following GPU in line for further processing, just as data flows through a pipeline. This results in each GPU working on a different mini-batch at any given time, which can enhance computational efficiency and throughput.

Pros

  • Efficiency: This method offers a combination of data and model parallelism, enabling efficient use of GPU resources by dispatching different parts of the model and mini-batches of data across GPUs.

  • Handles Large Models and Batches: Useful when the model and the batch size are too large to fit into a single GPU memory.

Cons

  • Communication Overhead: Similarly to data parallelism, gradients must be synchronized across GPUs.

  • Complexity: It requires careful partitioning of both the model and batches across the GPUs, making its implementation more challenging.

Implementing Parallism Using PyTorch

Once you’ve decided which parallelism method best aligns with your project needs and system configuration, it’s time to implement it. Examples are below to showcase the differences in implementation across the three methods.

Data Parallelism

Step 1: Import PyTorch and Define the Model. In this example, we will use a simple convolutional neural network (CNN) for image classification for the CIFAR-10 dataset.

import torch
import torch.nn as nn
import torch.optim as optim

class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
        self.fc1 = nn.Linear(32 * 8 * 8, 64)
        self.fc2 = nn.Linear(64, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = nn.functional.relu(x)
        x = self.pool(x)
        x = self.conv2(x)
        x = nn.functional.relu(x)
        x = self.pool(x)
        x = x.view(-1, 32 * 8 * 8)
        x = self.fc1(x)
        x = nn.functional.relu(x)
        x = self.fc2(x)
        return x

Step 2: Next, initialize the model and define the loss function and optimizer. This example will use the cross-entropy loss function and the stochastic gradient descent (SGD) optimizer.

model = CNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

Step 3: Create the data loader and move the model to GPUs. In this example, we will use the CIFAR-10 dataset and split it into batches of 64 images. We will also move the model to GPUs using the nn.DataParallel() module. Essentially, nn.DataParallel() wraps the model, and by doing so, it replicates your model on each GPU, splits the input data, and aggregates the output from each GPU.

train_loader = torch.utils.data.DataLoader(
    torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transforms.ToTensor()),
    batch_size=64, shuffle=True, num_workers=2, pin_memory=True)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = nn.DataParallel(model, device_ids=[0, 1])
model.to(device)

Step 4: Train the model. The final step is to train the model using multiple GPUs. We will loop over the dataset, compute the gradients using backpropagation, and update the weights using the optimizer. We will also print the training loss and accuracy for each epoch. Note that since we have wrapped our model using nn.DataParallel(), PyTorch will handle data distribution to other GPUs, leaving the code logic in your training loop unchanged.

for epoch in range(10):
    running_loss = 0.0
    running_corrects = 0.0

    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        _, preds = torch.max(outputs, 1)
        running_loss += loss.item() * inputs.size(0)
        running_corrects += torch.sum(preds == labels.data)

    epoch_loss = running_loss / len(train_loader.dataset)
    epoch_acc = running_corrects.double() / len(train_loader.dataset)
    
    print('Epoch [{}/{}], Loss: {:.4f}, Acc: {:.4f}'.format(epoch+1, 10, epoch_loss, epoch_acc))

Model Parallelism

Step 1: Import PyTorch and define the model. In this case, we will define two parts of the model separately and allocate them to different GPUs.

import torch
import torch.nn as nn
import torch.optim as optim

class ModelPart1(nn.Module):
    def __init__(self):
        super(ModelPart1, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

    def forward(self, x):
        x = self.conv1(x)
        x = nn.functional.relu(x)
        x = self.pool(x)
        return x

class ModelPart2(nn.Module):
    def __init__(self):
        super(ModelPart2, self).__init__()
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
        self.fc1 = nn.Linear(32 * 8 * 8, 64)
        self.fc2 = nn.Linear(64, 10)

    def forward(self, x):
        x = self.conv2(x)
        x = nn.functional.relu(x)
        x = x.view(-1, 32 * 8 * 8)
        x = self.fc1(x)
        x = nn.functional.relu(x)
        x = self.fc2(x)
        return x

Step 2: Initialize the model parts and define the loss function and optimizer.

device1 = torch.device("cuda:0")
device2 = torch.device("cuda:1")

model_part1 = ModelPart1().to(device1)
model_part2 = ModelPart2().to(device2)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(
list(model_part1.parameters())+list(model_part2.parameters()),
lr=0.001, momentum=0.9
)

Step 3: Create the data loader.

train_loader = torch.utils.data.DataLoader(
    torchvision.datasets.CIFAR10(root='./data', train=True, download=True, 
                                 transform=transforms.ToTensor()),
    batch_size=64, shuffle=True, num_workers=2, pin_memory=True)

Step 4: Train the model. Notice that we need to handle the control flow of each model segment and our data for each GPU.

for epoch in range(10):
    running_loss = 0.0
    running_corrects = 0.0
    for inputs, labels in train_loader:
        
        # Ensure we're performing computations on the right device
        inputs = inputs.to(device1)
        labels = labels.to(device2)
      
        optimizer.zero_grad()

        # Pass through the first part of the model
        intermediates = model_part1(inputs).to(device2)
        
        # Intermediate outputs are passed as inputs to the second part
        outputs = model_part2(intermediates)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        _, preds = torch.max(outputs, 1)
        running_loss += loss.item() * inputs.size(0)
        running_corrects += torch.sum(preds == labels.data)
      
    epoch_loss = running_loss / len(train_loader.dataset)
    epoch_acc = running_corrects.double() / len(train_loader.dataset)
    print('Epoch [{}/{}], Loss: {:.4f}, Acc: {:.4f}'.format(epoch+1, 10, epoch_loss, epoch_acc))

Pipeline Parallelism

Step 1: Import libraries and define the model. Divide the model into two main parts and treat them as separate stages in the pipeline.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed.pipeline.sync as sync
import torchvision
from torchvision import transforms

class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.stage1 = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        ).to("cuda:0")
        
        self.stage2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.Linear(32 * 8 * 8, 64),
            nn.ReLU(),
            nn.Linear(64, 10)
        ).to("cuda:1")
        
    def forward(self, x):
        x = self.stage1(x)
        return self.stage2(x)

Step 2: Wrap the multi-stage model with sync.pipe().

model = CNN()

# 8 equals batch_size/number_of_Gpus
model = sync.Pipe(model, chunks=8)

Step 3: Define the loss function and optimizer.

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

Step 4: Create the data loader

train_loader = torch.utils.data.DataLoader(
    torchvision.datasets.CIFAR10(root='./data', train=True, 
    download=True, transform=transforms.ToTensor()),
    batch_size=64, shuffle=True, num_workers=2, pin_memory=True)

Step 5: Train the Model. Notice how we need to pass inputs and labels to different GPUs (cuda:0 and cuda:1). When performing forward and backward passes, the pipeline will automatically manage the execution of each stage on the corresponding GPUs. Note that pipeline parallelism does introduce pipeline stalls, which may cause GPUs to be idle at certain times. It will work best when the computation time is high enough to be worth paying for the overhead of these stalls.

for epoch in range(10):
    running_loss = 0.0
    running_corrects = 0.0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to("cuda:0"), labels.to("cuda:1")
        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, labels)
        
        loss.backward()
        optimizer.step()

        _, preds = torch.max(outputs, 1)
        running_loss += loss.item() * inputs.size(0)
        running_corrects += torch.sum(preds == labels.data)
        
    epoch_loss = running_loss / len(train_loader.dataset)
    epoch_acc = running_corrects.double() / len(train_loader.dataset)
    print('Epoch [{}/{}], Loss: {:.4f}, Acc: {:.4f}'.format(
        epoch + 1, 10, epoch_loss, epoch_acc))

PyTorch Distributed Computing

What if you need more power beyond what’s in your system? PyTorch has native features promoting distributed computing, allowing you to utilize multi-GPUs from multiple nodes. Its DistributedDataParallel() (DDP) module ensures synchronized, multi-GPU training, while the distributed RPC framework and data samplers help manage computation across numerous nodes. Opting for distributed computing over multiple GPUs can confer benefits such as enhanced scalability, better speed, allowance for larger models and datasets, and better resilience in the face of individual node failures.

However, achieving effective distributed computing involves more than just leveraging a single library or framework. Complementary platforms like Dask, Ray, and Saturn Cloud can provide additional functionality and simplify the process. Dask excels in situations requiring out-of-memory computations and complex data preprocessing tasks. Conversely, Ray simplifies implementing distributed ML algorithms with a robust task scheduling system.

Additionally, Saturn Cloud supports the hassle-free setup of distributed environments in the cloud with scalable ML scenarios, including GPU support.

Foremost, it is critical to be mindful of the complexity these platforms can introduce. Practical distributed computing requires careful balancing of scale and complexity. Initiatives should always ensure the benefits of speed, scalability, and resilience demonstrably surpass any operational complexities that distributed computing might introduce.

Conclusion

Harnessing the power of multiple GPUs can dramatically accelerate PyTorch model training times while also handling larger datasets. To that end, PyTorch offers developers several strategies: data parallelism, for when data outgrows the GPU memory; model parallelism, useful for huge models; and, lastly, pipeline parallelism, which works best when the model and data size both exceed the GPU memory.

However, it’s important to remember that these strategies have potential pitfalls, which might inadvertently result in performance slowdowns or bottlenecks. Hence, assessing the specifics of the task and thoughtfully applying the best-fitting technique can prove crucial.

Venturing into distributed computing could also be a potential solution for oversized models and data. Utilizing platforms like Dask, Ray, and Saturn Cloud might provide additional benefits for an easier and more organized implementation.

As the scale and complexity of ML projects continue to increase, effectively using multiple GPUs becomes even more imperative. The ability to nimbly navigate among various strategies in multi-GPU arrangements using PyTorch is fast becoming a fundamental skill in the deep learning domain. Understanding these tools and techniques and their tradeoffs can offer significant benefits, from improving model performance to efficiently managing resources and cost. To learn more about multi-GPU, distributed computing, and how you and your team can utilize these benefits, visit the links below!

Additional Resources


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.