A Field Guide to Crusoe InfiniBand with Terraform

Practical answers to the questions you’ll have when provisioning InfiniBand-connected GPU clusters on Crusoe.

This guide answers the questions that come up when provisioning InfiniBand-connected GPU clusters on Crusoe with Terraform.

Why Crusoe

GPU clouds like Crusoe offer access to top-tier GPUs without the quota approvals and multi-month waitlists common on hyperscalers. Crusoe has H100, H200, GB200, and AMD MI355X instances available. Pricing runs $3.90/GPU-hour for H100 and $4.29/GPU-hour for H200 on-demand, compared to $6-7/GPU-hour on Azure. Spot pricing drops to $1.60/GPU-hour for H100.

Crusoe’s InfiniBand-enabled instances use NVIDIA Quantum-2 NDR networking at 400 Gb/s per port. The fabric is rail-optimized and non-blocking, with GPUDirect RDMA enabled by default. Instances can be grouped into partitions to isolate traffic between workloads. Unlike most GPU cloud providers, Crusoe exposes IB network capacity through their API and Terraform provider, so you can check availability before provisioning rather than discovering capacity issues mid-deployment.

Why do I need InfiniBand?

Multi-node training synchronizes gradients between GPUs every iteration. The interconnect becomes a bottleneck:

InterconnectBandwidthAll-reduce time (1GB)
100GbE100 Gbps~80ms
InfiniBand HDR200 Gbps~40ms
InfiniBand NDR400 Gbps~20ms

At thousands of iterations, this compounds. InfiniBand also enables RDMA and GPUDirect, letting GPUs transfer data directly without CPU involvement.

If you’re training on a single node, you don’t need InfiniBand. If you’re scaling beyond one node, you do.

How do I check capacity before provisioning?

You can check IB network capacity in the Crusoe console and select a network manually. If you do that, the Terraform is simpler: just hardcode the ib_network_id and skip the capacity-checking logic.

The examples in this article (and the InfiniBand example in the Crusoe Terraform provider repo) automate network selection, which adds complexity but lets you fail fast in CI if capacity isn’t available.

The crusoe_ib_networks data source shows available InfiniBand networks and their capacity:

data "crusoe_ib_networks" "available" {}

output "ib_networks" {
  value = data.crusoe_ib_networks.available.ib_networks
}

Output looks like:

ib_networks = [
  {
    id        = "ib-net-abc123"
    location  = "us-east1-a"
    capacities = [
      {
        slice_type = "h100-80gb-sxm-ib.8x"
        quantity   = 128
      }
    ]
  }
]

The quantity is the number of available slices. Each 8-GPU instance consumes 8 slices. So quantity = 128 means room for 16 instances (128 รท 8).

How do I make Terraform fail early if there isn’t capacity?

Filter networks by location and capacity, then fail if none match:

variable "vm_count" {
  default = 4
}

variable "location" {
  default = "us-east1-a"
}

variable "slice_type" {
  default = "h100-80gb-sxm-ib.8x"
}

data "crusoe_ib_networks" "available" {}

locals {
  required_slices = var.vm_count * 8

  suitable_networks = [
    for net in data.crusoe_ib_networks.available.ib_networks :
    net if net.location == var.location && anytrue([
      for cap in net.capacities :
      cap.slice_type == var.slice_type && cap.quantity >= local.required_slices
    ])
  ]

  selected_network = length(local.suitable_networks) > 0 ? local.suitable_networks[0] : null
}

# This will fail at plan time if no suitable network exists
resource "crusoe_ib_partition" "training" {
  name          = "training-partition"
  ib_network_id = local.selected_network.id  # Fails if null
}

If you want a clearer error message, add a check block:

check "capacity_available" {
  assert {
    condition     = local.selected_network != null
    error_message = "No IB network in ${var.location} has ${local.required_slices} slices of ${var.slice_type}"
  }
}

How do I provision instances with InfiniBand?

The key is the host_channel_adapters block that attaches instances to an IB partition:

resource "crusoe_ib_partition" "training" {
  name          = "training-partition"
  ib_network_id = local.selected_network.id
}

resource "crusoe_compute_instance" "node" {
  count    = var.vm_count
  name     = "training-node-${count.index}"
  type     = var.slice_type
  image    = "ubuntu22.04-nvidia-sxm-docker:latest"
  location = var.location

  network_interfaces = [{
    subnet = crusoe_vpc_subnet.training.id
  }]

  host_channel_adapters = [{
    ib_partition_id = crusoe_ib_partition.training.id
  }]

  ssh_key = file("~/.ssh/id_ed25519.pub")
}

All instances in the same partition can communicate over InfiniBand. Use separate partitions to isolate different workloads.

How do I verify InfiniBand is working?

SSH to a node and check:

# 1. Verify the interface exists
ibstat

You should see:

CA 'mlx5_0'
    CA type: MT4129
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 400 Gb/sec
        Link layer: InfiniBand

If ibstat returns nothing, you’re on a non-IB instance type or missing the host_channel_adapters config.

# 2. Test bandwidth between two nodes
# On node 0:
ib_write_bw

# On node 1:
ib_write_bw <node-0-ib-ip>

Expected output for NDR (400Gbps): ~48,000 MB/sec.

# 3. Verify NCCL uses InfiniBand
NCCL_DEBUG=INFO python3 -c "import torch.distributed as dist; dist.init_process_group('nccl')"

Look for NCCL INFO Using network IB in the output. On Crusoe’s IB-enabled images, NCCL should detect InfiniBand automatically without additional configuration.

Why is NCCL falling back to Ethernet?

If you see this in NCCL_DEBUG output:

NCCL INFO NET/IB : No device found.
NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>

Check these in order:

  1. Wrong instance type: Must end in -ib (e.g., h100-80gb-sxm-ib.8x)
  2. Missing host_channel_adapters in Terraform config
  3. Nodes in different partitions: All nodes in a training job must share a partition
  4. IB explicitly disabled: Check that NCCL_IB_DISABLE isn’t set to 1 in your environment

Full working example

terraform {
  required_providers {
    crusoe = { source = "crusoecloud/crusoe" }
  }
}

variable "vm_count"   { default = 4 }
variable "location"   { default = "us-east1-a" }
variable "slice_type" { default = "h100-80gb-sxm-ib.8x" }

data "crusoe_ib_networks" "available" {}

locals {
  required_slices = var.vm_count * 8
  suitable_networks = [
    for net in data.crusoe_ib_networks.available.ib_networks :
    net if net.location == var.location && anytrue([
      for cap in net.capacities :
      cap.slice_type == var.slice_type && cap.quantity >= local.required_slices
    ])
  ]
  selected_network = local.suitable_networks[0]
}

resource "crusoe_vpc_network" "main" {
  name = "training-network"
  cidr = "10.0.0.0/8"
}

resource "crusoe_vpc_subnet" "main" {
  name     = "training-subnet"
  cidr     = "10.0.0.0/16"
  location = var.location
  network  = crusoe_vpc_network.main.id
}

resource "crusoe_ib_partition" "main" {
  name          = "training-partition"
  ib_network_id = local.selected_network.id
}

resource "crusoe_compute_instance" "node" {
  count    = var.vm_count
  name     = "node-${count.index}"
  type     = var.slice_type
  image    = "ubuntu22.04-nvidia-sxm-docker:latest"
  location = var.location

  network_interfaces = [{
    subnet = crusoe_vpc_subnet.main.id
  }]

  host_channel_adapters = [{
    ib_partition_id = crusoe_ib_partition.main.id
  }]

  ssh_key = file("~/.ssh/id_ed25519.pub")
}

output "nodes" {
  value = [for n in crusoe_compute_instance.node : {
    name = n.name
    ip   = n.network_interfaces[0].public_ipv4.address
  }]
}

Running distributed training on your cluster

Once your InfiniBand cluster is provisioned, the next challenge is coordinating multi-node training: injecting rank and leader information, ensuring all workers land on the same IB partition, configuring NCCL, and aggregating logs across nodes.