Moving Gen AI Workloads from Hyperscalers to Crusoe Cloud: A Practical Migration Guide

A step-by-step guide for migrating production gen AI workloads from AWS, GCP, or Azure to Crusoe Cloud, covering planning, execution, optimization, and common challenges.

If you’re running gen AI workloads on AWS, GCP, or Azure, you’re likely experiencing the GPU availability crunch: multi-month waitlists for H100s, capacity reservations that require long-term commitments, and pricing that can reach $12+ per GPU-hour. Crusoe Cloud offers immediate access to NVIDIA’s latest GPUs (H100, H200, B200, GB200) and AMD’s MI300X/MI355X starting at $3.90/hour per GPU for on-demand and $1.60/hour for spot instances, with managed Kubernetes, managed inference, and 99.98% uptime. Operating in US, Canada, and EU regions.

The Hyperscaler GPU Bottleneck

The challenge with running AI workloads on traditional hyperscalers comes down to access and economics. AWS, GCP, and Azure all offer powerful GPU instances, but getting access to them requires navigating quota systems, capacity reservations, and often multi-year commitments. An 8-GPU H100 instance costs $98.32/hour on Azure, $88.49/hour on GCP, and requires capacity reservations or UltraClusters on AWS just to guarantee availability. Over a month of continuous usage, you’re looking at $64,000-$72,000 per instance.

Crusoe provides immediate on-demand access to H100 instances at $3.90 per GPU-hour without sales calls or quota approvals. An 8-GPU cluster costs $31.20/hour, or about $22,800/month (a savings of roughly $41,000-$49,000 per month compared to hyperscaler on-demand pricing). For fault-tolerant workloads, spot pricing brings H100s down to $1.60/hour per GPU ($12.80/hour for 8 GPUs, or $9,300/month), with 7-day advance notice before interruptions.

Unlike most GPU-focused providers, Crusoe includes managed Kubernetes, managed inference for model deployment, and Slurm workload managers. Crusoe also offers both NVIDIA and AMD GPUs, with AMD MI300X at $3.45/hour on-demand and $0.95/hour spot, providing high-performance alternatives for teams looking to optimize costs or avoid NVIDIA-only lock-in.

Understanding What Crusoe Offers

GPU Hardware and Availability

Crusoe provides access to both NVIDIA’s latest GPU architectures (H100, H200, B200, GB200) and AMD’s high-performance compute GPUs (MI300X, MI355X). All instances include RDMA networking with InfiniBand standard across GPU nodes. You can provision 8-16 GPU clusters immediately on-demand or via spot pricing. Larger clusters are available through reserved capacity contracts with deeper discounts (up to 81% savings on 3-year terms).

Crusoe operates in US (Virginia, Texas), Canada (Alberta), and EU (Iceland, Norway) regions. Instance configurations range from single GPU instances to multi-node clusters with InfiniBand for low-latency gradient synchronization. Crusoe’s AutoClusters technology provides automatic node recovery for 99.98% uptime.

AMD GPU Advantage: AMD MI300X GPUs offer 192GB of HBM3 memory (versus 80GB on H100), making them particularly well-suited for large language models and memory-intensive workloads. At $3.45/hour on-demand and $0.95/hour spot, MI300X provides compelling price-performance for teams willing to optimize for AMD’s ROCm platform.

Managed Services

Crusoe Managed Kubernetes (CMK) handles cluster provisioning, upgrades, and control plane management. You define GPU node pools through Terraform or the API, and Crusoe manages the Kubernetes control plane, node scaling, and GPU driver installation. Your existing Kubernetes manifests, Helm charts, and operators work without modification. CMK supports autoscaling, integrated monitoring, and topology-aware GPU placement for optimal performance.

Crusoe Managed Inference provides a fully managed deployment solution for model serving that handles infrastructure, maintenance, and security. This service delivers low-latency, high-throughput predictions without requiring you to manage inference infrastructure directly (a direct alternative to SageMaker Endpoints or Vertex AI Prediction).

Slurm workload manager is available as a managed service for teams migrating from HPC environments or needing batch job scheduling with advanced resource allocation capabilities beyond standard Kubernetes batch jobs.

Infrastructure Capabilities

Storage options include:

  • Persistent disks (block storage): Powered by Lightbits Labs technology, optimized for AI/ML workloads with lower and more consistent latency, higher throughput, and linear scalability compared to traditional network block storage.

  • Shared disks: High-performance shared network filesystem powered by VAST Data, designed for concurrent access from multiple VMs. Delivers up to 200 MBps read and 40 MBps write throughput per TiB of provisioned storage. Ideal for training datasets that need to be accessed from multiple GPU nodes simultaneously.

  • Object storage: Crusoe doesn’t currently offer a native managed object storage service. Teams typically deploy self-managed MinIO for S3-compatible object storage on Crusoe compute, use existing hyperscaler object storage during hybrid migration, or leverage third-party S3-compatible services.

RDMA networking with InfiniBand comes standard on all GPU instances, providing the high-bandwidth, low-latency networking essential for distributed training. Crusoe’s topology-aware placement ensures GPU instances are optimally positioned to minimize communication latency.

Automation and Infrastructure-as-Code

Crusoe offers an official Terraform provider with comprehensive resource coverage. If you’re managing hyperscaler infrastructure through Terraform, your migration involves translating resource definitions rather than rewriting automation logic. Crusoe also provides a CLI (available via brew and apt), Go client library, and REST APIs for scripting and automation.

Access Management

Crusoe Cloud uses an organization-based access control model with three predefined roles:

  • Admin: Full control over all organization resources and member management
  • Editor: Can create, modify, and delete resources but cannot manage organization members
  • Reader: View-only access to resources

Organizations can enforce Single Sign-On (SSO) via OIDC-based authentication for secure console access. API access is managed through access keys and secret keys, and audit logs provide 90-day activity history. The simplicity of three roles (versus hundreds of AWS IAM policies) reduces misconfiguration risks and onboarding complexity.

Pre-Migration Assessment

Identify Migration Blockers and Dependencies

Before migrating, assess these potential blockers:

Hyperscaler service dependencies: Catalog which AWS, GCP, or Azure services your workloads use. SageMaker, DynamoDB, Step Functions, and similar services don’t have Crusoe equivalents. You’ll need to replace them with open-source alternatives, use third-party managed services, or maintain hybrid connectivity.

Data residency constraints: Crusoe operates in US (Virginia, Texas), Canada (Alberta), and EU (Iceland, Norway). Check if these regions satisfy your compliance requirements. If you need presence in Asia-Pacific, Middle East, or other regions, Crusoe may not work yet.

Existing cloud contracts: Calculate early termination fees for reserved instances or enterprise agreements. Factor these into migration ROI. Sometimes it makes financial sense to wait until commitments expire or run hybrid temporarily.

Infrastructure management skills: Crusoe doesn’t offer managed databases or object storage. Your team needs Kubernetes experience to deploy and manage these services, or budget for third-party managed providers. Saturn Cloud on Crusoe reduces this burden by providing a managed AI platform layer.

AMD GPU compatibility: If cost optimization through AMD MI300X is part of your strategy, audit your code for CUDA-only libraries. Most PyTorch and TensorFlow code runs on ROCm without changes, but custom CUDA kernels require porting.

Migration Planning

Choosing Your Migration Strategy

Lift-and-shift works if your workloads already run on Kubernetes. Move your containers to Crusoe clusters without rewriting application logic. Best for teams with minimal dependencies on hyperscaler-specific services.

Hybrid lets you use Crusoe GPUs immediately while keeping databases and object storage on your current provider. You’ll pay egress costs, but GPU savings usually offset this. Good for testing Crusoe before full commitment.

Spot-first takes advantage of Crusoe’s spot pricing with 7-day advance notice. Design training jobs to checkpoint frequently and resume automatically. With H100 spot at $1.60/hour versus $3.90 on-demand, this strategy saves 59% on compute for fault-tolerant workloads.

Full migration moves all infrastructure to Crusoe. Requires deploying your own object storage (MinIO) or using third-party services, plus managing databases via Kubernetes operators or external providers. Maximizes cost savings but increases operational complexity.

Setting Up Your Crusoe Environment

Account setup and access management:

  1. Create your Crusoe account: Sign up at crusoe.ai and create your organization. The account creator automatically gets the Admin role.

  2. Install and configure the Crusoe CLI:

# Install CLI on macOS
brew install crusoe

# Install CLI on Ubuntu/Debian
curl -fsSL https://console.crusoecloud.com/downloads/crusoe-cli_latest_amd64.deb -o crusoe-cli.deb
sudo apt install ./crusoe-cli.deb

# Initialize CLI configuration and authenticate
crusoe config init
crusoe auth login

# Verify installation
crusoe --version
  1. Set up user authentication: Add team members to your organization and assign them Admin, Editor, or Reader roles. For enterprise teams, configure SSO through your identity provider (Okta, Azure AD, etc.) for OIDC-based authentication.

  2. Create API keys for automation:

# Generate API access key and secret key via the console
# Store credentials in ~/.crusoe/config

# Verify API access
crusoe projects list
  1. Configure service account credentials for CI/CD: Store Crusoe API keys as secrets in your CI/CD platform:

GitHub Actions example:

- name: Authenticate with Crusoe
  env:
    CRUSOE_ACCESS_KEY: ${{ secrets.CRUSOE_ACCESS_KEY }}
    CRUSOE_SECRET_KEY: ${{ secrets.CRUSOE_SECRET_KEY }}
  run: |
    mkdir -p ~/.crusoe
    echo "[default]" > ~/.crusoe/config
    echo "access_key_id = $CRUSOE_ACCESS_KEY" >> ~/.crusoe/config
    echo "secret_key = $CRUSOE_SECRET_KEY" >> ~/.crusoe/config    

GitLab CI example:

before_script:
  - mkdir -p ~/.crusoe
  - echo "[default]" > ~/.crusoe/config
  - echo "access_key_id = $CRUSOE_ACCESS_KEY" >> ~/.crusoe/config
  - echo "secret_key = $CRUSOE_SECRET_KEY" >> ~/.crusoe/config

Storage configuration: Provision persistent disks for instance storage and shared disks for datasets that need concurrent access from multiple GPU nodes. For object storage, either deploy MinIO on Crusoe compute, keep using hyperscaler object storage during hybrid operation, or use third-party S3-compatible services.

Mapping Hyperscaler Services to Crusoe Equivalents

Hyperscaler ServiceCrusoe EquivalentNotes
EKS, GKE, AKSCrusoe Managed Kubernetes (CMK)Your Kubernetes manifests work without changes. Update Terraform for cluster provisioning and GPU node pools.
SageMaker Training, Vertex AI Training, Azure ML JobsKubernetes Jobs or Kubeflow operatorsReplace managed training with containerized jobs on CMK. Or use Saturn Cloud for a managed experience.
SageMaker Endpoints, Vertex AI PredictionCrusoe Managed Inference or self-hosted Triton/TorchServeCrusoe Managed Inference provides similar fully-managed model serving.
RDS, Cloud SQL, Azure DatabaseSelf-hosted via Kubernetes operators or third-party (Aiven, etc.)Crusoe doesn’t offer managed databases. Deploy PostgreSQL/MySQL operators or use external managed services.
S3, GCS, Azure Blob StorageMinIO, third-party S3-compatible services, or hybrid accessNo native managed object storage. Deploy MinIO yourself, use Backblaze B2/Wasabi/Cloudflare R2, or access hyperscaler storage during migration.
SQS, Pub/Sub, Azure Queue StorageSelf-hosted RabbitMQ/Kafka on Kubernetes or Confluent CloudNo managed queue service. Run your own or use cloud-agnostic providers.
Step Functions, Cloud Composer, Data FactoryAirflow/Prefect/Argo Workflows on KubernetesNo managed workflow orchestration. Deploy open-source tools on CMK.

Step-by-Step Migration Process

Data Migration

For teams using hyperscaler object storage (S3, GCS, Azure Blob), you have several migration strategies:

Strategy 1: Deploy MinIO on Crusoe

# Deploy MinIO on Crusoe Kubernetes for S3-compatible storage
kubectl create namespace minio
helm install minio bitnami/minio \
  --namespace minio \
  --set auth.rootUser=admin \
  --set auth.rootPassword=$(openssl rand -base64 32) \
  --set persistence.size=10Ti \
  --set resources.requests.memory=32Gi

# Transfer data using rclone
rclone copy s3:my-aws-bucket/training-data minio:ml-training-data/training-data \
  --progress \
  --transfers 16 \
  --checkers 32

Strategy 2: Hybrid storage approach

Leave data on hyperscaler storage initially and access it from Crusoe GPU instances. You’ll pay egress costs from AWS/GCP/Azure, but this lets you start using Crusoe GPUs immediately. For workloads where compute costs dwarf data transfer costs, this can be net positive.

# Access AWS S3 from Crusoe GPU instances
import boto3

s3 = boto3.client('s3',
    aws_access_key_id='YOUR_AWS_KEY',
    aws_secret_access_key='YOUR_AWS_SECRET'
)

# Load training data directly from S3
# Network egress costs apply but GPU savings may offset this

Strategy 3: Third-party object storage

Use cloud-agnostic S3-compatible services:

# Transfer to Backblaze B2, Wasabi, or Cloudflare R2
rclone copy s3:my-aws-bucket/training-data b2:ml-training-data/training-data \
  --progress \
  --transfers 16 \
  --checkers 32

For large datasets, prioritize migrating frequently accessed data (active training sets, recent checkpoints) first. Cold archival data can stay on hyperscaler storage until needed.

Workload Migration

Containerized training jobs: Update your Kubernetes manifests to point to Crusoe clusters, update data paths to reference Crusoe storage or shared disks, and update image registries if you’re moving those as well.

Example Kubernetes Job manifest adaptation:

apiVersion: batch/v1
kind: Job
metadata:
  name: llm-training
spec:
  template:
    spec:
      containers:
      - name: trainer
        image: your-registry/llm-trainer:latest
        resources:
          limits:
            nvidia.com/gpu: 8  # or amd.com/gpu for MI300X
        volumeMounts:
        - name: training-data
          mountPath: /data
        - name: checkpoints
          mountPath: /checkpoints
      volumes:
      - name: training-data
        persistentVolumeClaim:
          claimName: shared-disk-pvc  # Crusoe Shared Disk
      - name: checkpoints
        persistentVolumeClaim:
          claimName: checkpoint-pvc  # Crusoe Persistent Disk
      nodeSelector:
        crusoe.ai/gpu-type: h100  # or mi300x for AMD
      restartPolicy: Never

Model serving infrastructure: For inference, you have two main options on Crusoe:

Crusoe Managed Inference: Deploy your models using Crusoe’s fully managed inference platform, which handles infrastructure, scaling, and maintenance. This provides a similar experience to SageMaker Endpoints or Vertex AI Prediction without managing inference infrastructure yourself.

Self-hosted inference: Deploy your model serving framework (Triton Inference Server, TorchServe, TensorFlow Serving, or custom FastAPI services) on Crusoe Kubernetes with Horizontal Pod Autoscaler and ingress controllers.

Example self-hosted Triton deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.01-py3
        ports:
        - containerPort: 8000
          name: http
        - containerPort: 8001
          name: grpc
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-repository
          mountPath: /models
      volumes:
      - name: model-repository
        persistentVolumeClaim:
          claimName: models-pvc
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triton-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton-inference
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Leveraging Spot Instances

Crusoe’s spot pricing offers up to 90% savings with the unique advantage of 7-day advance notice before interruptions. Design your training workloads to leverage spot instances:

Checkpointing strategy:

import torch
import os
from datetime import datetime, timedelta

class SpotTrainer:
    def __init__(self, model, checkpoint_dir='/checkpoints'):
        self.model = model
        self.checkpoint_dir = checkpoint_dir
        self.checkpoint_frequency = 1  # Save every epoch

    def save_checkpoint(self, epoch, optimizer, loss):
        checkpoint = {
            'epoch': epoch,
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss,
            'timestamp': datetime.now()
        }
        path = os.path.join(self.checkpoint_dir, f'checkpoint_epoch_{epoch}.pt')
        torch.save(checkpoint, path)
        print(f"Checkpoint saved: {path}")

    def load_checkpoint(self, checkpoint_path):
        checkpoint = torch.load(checkpoint_path)
        self.model.load_state_dict(checkpoint['model_state_dict'])
        return checkpoint['epoch'], checkpoint['optimizer_state_dict']

    def train(self, dataloader, optimizer, num_epochs, resume_from=None):
        start_epoch = 0

        # Resume from checkpoint if available
        if resume_from and os.path.exists(resume_from):
            start_epoch, optimizer_state = self.load_checkpoint(resume_from)
            optimizer.load_state_dict(optimizer_state)
            print(f"Resumed from epoch {start_epoch}")

        for epoch in range(start_epoch, num_epochs):
            # Training loop
            for batch_idx, (data, target) in enumerate(dataloader):
                # Training step
                loss = self.train_step(data, target, optimizer)

            # Save checkpoint every epoch
            if epoch % self.checkpoint_frequency == 0:
                self.save_checkpoint(epoch, optimizer, loss)

Handling spot interruptions:

With 7 days advance notice, you can:

  1. Complete training runs that are within 7 days of finishing
  2. Migrate to on-demand instances if needed
  3. Checkpoint and resume on new spot instances

Cost optimization: Use spot instances for development, experimentation, and fault-tolerant training workloads. Reserve on-demand or reserved capacity for production inference and time-sensitive training.

Optimizing for Crusoe

Taking Advantage of Crusoe-Specific Features

RDMA networking with InfiniBand for distributed training: Crusoe includes high-speed GPU interconnects standard on all GPU instances. For multi-node distributed training, ensure your training framework is configured to use InfiniBand (RDMA over Converged Ethernet).

Verify that NCCL (NVIDIA Collective Communications Library) is using InfiniBand:

# Set environment variables for NCCL to prefer InfiniBand
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5
export NCCL_NET_GDR_LEVEL=5
export NCCL_SOCKET_IFNAME=ib0

For large model training where gradient synchronization is a bottleneck, InfiniBand’s lower latency versus standard Ethernet can reduce training time by 20-40%.

Shared Disks for data loading: Crusoe Shared Disks (powered by VAST Data) deliver up to 200 MBps read throughput per TiB, making them well-suited for training data that needs to be accessed from multiple GPU nodes simultaneously. Mount the shared disk on all GPU nodes and load data directly rather than copying to local disk first.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-training-data
spec:
  accessModes:
    - ReadWriteMany  # Multiple nodes can mount simultaneously
  resources:
    requests:
      storage: 10Ti
  storageClassName: crusoe-shared-disk

AMD MI300X for memory-intensive workloads: For large language models that require more GPU memory, AMD MI300X provides 192GB HBM3 (versus 80GB on H100) at $3.45/hour on-demand or $0.95/hour spot. Ensure your framework supports ROCm:

import torch

# PyTorch with ROCm support
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# On AMD GPUs with ROCm, torch.cuda still works

model = LargeLanguageModel().to(device)
print(f"Using device: {device}")
print(f"GPU memory available: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Cost Optimization Strategies

Spot-first for training: Design training workloads to run on spot instances with checkpointing. With H100 spot at $1.60/hour (versus $3.90 on-demand), you save 59% on compute while the 7-day advance notice minimizes disruption risk.

Reserved capacity for production: For stable production inference workloads with predictable GPU usage, negotiate reserved capacity contracts with Crusoe. They offer discounts up to 81% for 3-year commitments. Calculate your baseline GPU usage and commit to that capacity, while using on-demand or spot for variable experimental workloads.

AMD for cost-performance optimization: For compatible workloads, AMD MI300X at $3.45/hour on-demand (or $0.95/hour spot) provides compelling performance per dollar, especially for memory-intensive tasks where the 192GB of HBM3 delivers value.

Right-size GPU resources: Use smaller GPU configurations (L40S at $1.00/hour, A40 at $0.90/hour) for development, debugging, and small-scale experiments. Reserve H100/H200 for production training and large-scale experiments.

Performance Tuning

Network throughput optimization: For distributed training using Crusoe’s InfiniBand networking, use NCCL tests to benchmark all-reduce operations across nodes:

# Run NCCL all-reduce benchmark across nodes
mpirun -np 16 --host node1:8,node2:8 \
  -x NCCL_IB_DISABLE=0 \
  -x NCCL_NET_GDR_LEVEL=5 \
  /usr/local/bin/all_reduce_perf -b 8 -e 8G -f 2 -g 1

If you see lower-than-expected bandwidth, verify InfiniBand configuration or contact Crusoe support for network topology optimization.

Storage throughput optimization: For data-intensive training, use Crusoe Shared Disks with sufficient provisioned capacity to achieve target throughput. At 200 MBps read per TiB, a 10 TiB shared disk delivers 2 GBps read throughput.

GPU utilization monitoring: Use Crusoe’s built-in observability tools along with NVIDIA DCGM or AMD ROCm tools to monitor GPU utilization, identify bottlenecks, and optimize batch sizes and data loading:

# Monitor NVIDIA GPU utilization
nvidia-smi dmon -s pucvmet -d 5

# Monitor AMD GPU utilization (if using MI300X)
rocm-smi --showuse

Common Migration Challenges and Solutions

Dealing with Hyperscaler-Specific Dependencies

Problem: Code tightly coupled to AWS SageMaker APIs, GCP Vertex AI, or Azure ML services.

Solution: Abstract hyperscaler-specific APIs behind interfaces. For example, instead of calling SageMaker APIs directly throughout your codebase, create a training orchestration layer that calls SageMaker in your current implementation but can be swapped for Kubernetes Job submission when migrating to Crusoe.

If you’re using SageMaker Processing for data preprocessing, replace it with containerized preprocessing jobs on Kubernetes. SageMaker Training Jobs become Kubernetes Jobs or KubeFlow PyTorchJobs/TFJobs. SageMaker Endpoints become Crusoe Managed Inference or self-hosted Triton/TorchServe deployments.

Problem: Code using AWS-specific services (DynamoDB, SQS, Step Functions).

Solution: Three options:

  1. Keep using these services from Crusoe (hybrid architecture) via VPN or site-to-site connectivity.
  2. Replace with open-source equivalents running on Crusoe (PostgreSQL instead of DynamoDB, RabbitMQ/Kafka instead of SQS, Airflow instead of Step Functions).
  3. Use cloud-agnostic managed services (MongoDB Atlas, Confluent Cloud, Astronomer for Airflow).

Handling Object Storage Migration

Problem: Large datasets in S3/GCS/Azure Blob Storage need to be accessible from Crusoe, but Crusoe doesn’t have native managed object storage.

Solution: Choose based on your requirements:

Hybrid approach (fastest to implement): Keep data in hyperscaler object storage and access from Crusoe. Pay egress costs but avoid migration effort. Works well if compute costs » data transfer costs.

Self-managed MinIO (full control): Deploy MinIO on Crusoe Kubernetes for S3-compatible storage. Requires managing the MinIO deployment but gives you full control and keeps all infrastructure on Crusoe.

Third-party S3-compatible services (managed): Use Backblaze B2, Wasabi, or Cloudflare R2 for S3-compatible storage without managing infrastructure. Typically costs $5-6/TB/month versus $23/TB/month for S3.

Managing Secrets and Credentials

Problem: Secrets stored in AWS Secrets Manager, GCP Secret Manager, or Azure Key Vault need to be accessible from Crusoe.

Solution: Migrate secrets to Kubernetes Secrets or use a cloud-agnostic secret management solution (HashiCorp Vault, Doppler, 1Password). For Kubernetes Secrets, use external-secrets operator to sync secrets from your existing secret manager during transition, then gradually migrate them to your target solution.

# Using external-secrets operator to sync from AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        secretRef:
          accessKeyIDSecretRef:
            name: aws-credentials
            key: access-key-id
          secretAccessKeySecretRef:
            name: aws-credentials
            key: secret-access-key
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: ml-training-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets
    kind: SecretStore
  target:
    name: training-credentials
  data:
  - secretKey: api-key
    remoteRef:
      key: ml-training/api-key

Porting CUDA Code to ROCm (for AMD GPUs)

Problem: If you want to use AMD MI300X GPUs for cost savings, existing CUDA code needs to run on ROCm.

Solution: For most PyTorch and TensorFlow code, ROCm compatibility is transparent (no code changes needed). For custom CUDA kernels, use HIP (Heterogeneous-compute Interface for Portability) to port CUDA code:

# Use hipify-perl to automatically convert CUDA code to HIP
hipify-perl my_cuda_kernel.cu > my_hip_kernel.cpp

# Compile for AMD GPUs
hipcc -o my_kernel my_hip_kernel.cpp

Most common CUDA operations have direct HIP equivalents. For complex custom kernels, manual porting may be required. Test thoroughly on AMD hardware before committing to production workloads.

When Crusoe Makes Sense

Strong fit for:

  • Teams spending $50k+/month on GPU compute where Crusoe’s 60%+ savings justify migration effort
  • Workloads already containerized on Kubernetes (minimal refactoring needed)
  • Training jobs that can leverage spot instances (59% cheaper with 7-day notice)
  • Large language models needing 192GB GPU memory (AMD MI300X advantage)
  • Organizations comfortable managing object storage and databases via Kubernetes operators or third-party services
  • Teams needing H100/H200/B200 access without hyperscaler waitlists or multi-year commitments

Consider staying on hyperscalers if:

  • GPU costs are under $10k/month (migration effort may not pay off)
  • Deeply integrated with SageMaker/Vertex AI/Azure ML (significant replatforming required)
  • Need presence in Asia-Pacific, Middle East, or Latin America (Crusoe only in US/Canada/EU)
  • Require managed databases, object storage, and other PaaS services without operational overhead
  • Lack Kubernetes expertise and can’t adopt Saturn Cloud or similar managed platforms
  • Compliance mandates specific certifications Crusoe doesn’t hold yet

Resources