A Field Guide to Crusoe InfiniBand with Terraform

This guide answers the questions that come up when provisioning InfiniBand-connected GPU clusters on Crusoe with Terraform.
Why Crusoe
GPU clouds like Crusoe offer access to top-tier GPUs without the quota approvals and multi-month waitlists common on hyperscalers. Crusoe has H100, H200, GB200, and AMD MI355X instances available. Pricing runs $3.90/GPU-hour for H100 and $4.29/GPU-hour for H200 on-demand, compared to $6-7/GPU-hour on Azure. Spot pricing drops to $1.60/GPU-hour for H100.
Crusoe’s InfiniBand-enabled instances use NVIDIA Quantum-2 NDR networking at 400 Gb/s per port. The fabric is rail-optimized and non-blocking, with GPUDirect RDMA enabled by default. Instances can be grouped into partitions to isolate traffic between workloads. Unlike most GPU cloud providers, Crusoe exposes IB network capacity through their API and Terraform provider, so you can check availability before provisioning rather than discovering capacity issues mid-deployment.
If you'd rather not manage the distributed training infrastructure yourself, Saturn Cloud is an end-to-end code-first AI platform that installs inside your Crusoe account and handles the mechanics out of the box. Chat with our team to learn more.
Why do I need InfiniBand?
Multi-node training synchronizes gradients between GPUs every iteration. The interconnect becomes a bottleneck:
| Interconnect | Bandwidth | All-reduce time (1GB) |
|---|---|---|
| 100GbE | 100 Gbps | ~80ms |
| InfiniBand HDR | 200 Gbps | ~40ms |
| InfiniBand NDR | 400 Gbps | ~20ms |
At thousands of iterations, this compounds. InfiniBand also enables RDMA and GPUDirect, letting GPUs transfer data directly without CPU involvement.
If you’re training on a single node, you don’t need InfiniBand. If you’re scaling beyond one node, you do.
How do I check capacity before provisioning?
You can check IB network capacity in the Crusoe console and select a network manually. If you do that, the Terraform is simpler: just hardcode the ib_network_id and skip the capacity-checking logic.
The examples in this article (and the InfiniBand example in the Crusoe Terraform provider repo) automate network selection, which adds complexity but lets you fail fast in CI if capacity isn’t available.
The crusoe_ib_networks data source shows available InfiniBand networks and their capacity:
data "crusoe_ib_networks" "available" {}
output "ib_networks" {
value = data.crusoe_ib_networks.available.ib_networks
}
Output looks like:
ib_networks = [
{
id = "ib-net-abc123"
location = "us-east1-a"
capacities = [
{
slice_type = "h100-80gb-sxm-ib.8x"
quantity = 128
}
]
}
]
The quantity is the number of available slices. Each 8-GPU instance consumes 8 slices. So quantity = 128 means room for 16 instances (128 รท 8).
How do I make Terraform fail early if there isn’t capacity?
Filter networks by location and capacity, then fail if none match:
variable "vm_count" {
default = 4
}
variable "location" {
default = "us-east1-a"
}
variable "slice_type" {
default = "h100-80gb-sxm-ib.8x"
}
data "crusoe_ib_networks" "available" {}
locals {
required_slices = var.vm_count * 8
suitable_networks = [
for net in data.crusoe_ib_networks.available.ib_networks :
net if net.location == var.location && anytrue([
for cap in net.capacities :
cap.slice_type == var.slice_type && cap.quantity >= local.required_slices
])
]
selected_network = length(local.suitable_networks) > 0 ? local.suitable_networks[0] : null
}
# This will fail at plan time if no suitable network exists
resource "crusoe_ib_partition" "training" {
name = "training-partition"
ib_network_id = local.selected_network.id # Fails if null
}
If you want a clearer error message, add a check block:
check "capacity_available" {
assert {
condition = local.selected_network != null
error_message = "No IB network in ${var.location} has ${local.required_slices} slices of ${var.slice_type}"
}
}
How do I provision instances with InfiniBand?
The key is the host_channel_adapters block that attaches instances to an IB partition:
resource "crusoe_ib_partition" "training" {
name = "training-partition"
ib_network_id = local.selected_network.id
}
resource "crusoe_compute_instance" "node" {
count = var.vm_count
name = "training-node-${count.index}"
type = var.slice_type
image = "ubuntu22.04-nvidia-sxm-docker:latest"
location = var.location
network_interfaces = [{
subnet = crusoe_vpc_subnet.training.id
}]
host_channel_adapters = [{
ib_partition_id = crusoe_ib_partition.training.id
}]
ssh_key = file("~/.ssh/id_ed25519.pub")
}
All instances in the same partition can communicate over InfiniBand. Use separate partitions to isolate different workloads.
If you'd rather not manage the distributed training infrastructure yourself, Saturn Cloud is an end-to-end code-first AI platform that installs inside your Crusoe account and handles the mechanics out of the box. Chat with our team to learn more.
How do I verify InfiniBand is working?
SSH to a node and check:
# 1. Verify the interface exists
ibstat
You should see:
CA 'mlx5_0'
CA type: MT4129
Port 1:
State: Active
Physical state: LinkUp
Rate: 400 Gb/sec
Link layer: InfiniBand
If ibstat returns nothing, you’re on a non-IB instance type or missing the host_channel_adapters config.
# 2. Test bandwidth between two nodes
# On node 0:
ib_write_bw
# On node 1:
ib_write_bw <node-0-ib-ip>
Expected output for NDR (400Gbps): ~48,000 MB/sec.
# 3. Verify NCCL uses InfiniBand
NCCL_DEBUG=INFO python3 -c "import torch.distributed as dist; dist.init_process_group('nccl')"
Look for NCCL INFO Using network IB in the output. On Crusoe’s IB-enabled images, NCCL should detect InfiniBand automatically without additional configuration.
Why is NCCL falling back to Ethernet?
If you see this in NCCL_DEBUG output:
NCCL INFO NET/IB : No device found.
NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>
Check these in order:
- Wrong instance type: Must end in
-ib(e.g.,h100-80gb-sxm-ib.8x) - Missing
host_channel_adaptersin Terraform config - Nodes in different partitions: All nodes in a training job must share a partition
- IB explicitly disabled: Check that
NCCL_IB_DISABLEisn’t set to1in your environment
Full working example
terraform {
required_providers {
crusoe = { source = "crusoecloud/crusoe" }
}
}
variable "vm_count" { default = 4 }
variable "location" { default = "us-east1-a" }
variable "slice_type" { default = "h100-80gb-sxm-ib.8x" }
data "crusoe_ib_networks" "available" {}
locals {
required_slices = var.vm_count * 8
suitable_networks = [
for net in data.crusoe_ib_networks.available.ib_networks :
net if net.location == var.location && anytrue([
for cap in net.capacities :
cap.slice_type == var.slice_type && cap.quantity >= local.required_slices
])
]
selected_network = local.suitable_networks[0]
}
resource "crusoe_vpc_network" "main" {
name = "training-network"
cidr = "10.0.0.0/8"
}
resource "crusoe_vpc_subnet" "main" {
name = "training-subnet"
cidr = "10.0.0.0/16"
location = var.location
network = crusoe_vpc_network.main.id
}
resource "crusoe_ib_partition" "main" {
name = "training-partition"
ib_network_id = local.selected_network.id
}
resource "crusoe_compute_instance" "node" {
count = var.vm_count
name = "node-${count.index}"
type = var.slice_type
image = "ubuntu22.04-nvidia-sxm-docker:latest"
location = var.location
network_interfaces = [{
subnet = crusoe_vpc_subnet.main.id
}]
host_channel_adapters = [{
ib_partition_id = crusoe_ib_partition.main.id
}]
ssh_key = file("~/.ssh/id_ed25519.pub")
}
output "nodes" {
value = [for n in crusoe_compute_instance.node : {
name = n.name
ip = n.network_interfaces[0].public_ipv4.address
}]
}
Running distributed training on your cluster
Once your InfiniBand cluster is provisioned, the next challenge is coordinating multi-node training: injecting rank and leader information, ensuring all workers land on the same IB partition, configuring NCCL, and aggregating logs across nodes.
If you'd rather not manage the distributed training infrastructure yourself, Saturn Cloud is an end-to-end code-first AI platform that installs inside your Crusoe account and handles the mechanics out of the box. Chat with our team to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments
for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without having to switch tools.