Moving Gen AI Workloads from Hyperscalers to Crusoe Cloud: A Practical Migration Guide

If you’re running gen AI workloads on AWS, GCP, or Azure, you’re likely experiencing the GPU availability crunch: multi-month waitlists for H100s, capacity reservations that require long-term commitments, and pricing that can reach $12+ per GPU-hour. Crusoe Cloud offers immediate access to NVIDIA’s latest GPUs (H100, H200, B200, GB200) and AMD’s MI300X/MI355X starting at $3.90/hour per GPU for on-demand and $1.60/hour for spot instances, with managed Kubernetes, managed inference, and 99.98% uptime. Operating in US, Canada, and EU regions.
The Hyperscaler GPU Bottleneck
The challenge with running AI workloads on traditional hyperscalers comes down to access and economics. AWS, GCP, and Azure all offer powerful GPU instances, but getting access to them requires navigating quota systems, capacity reservations, and often multi-year commitments. An 8-GPU H100 instance costs $98.32/hour on Azure, $88.49/hour on GCP, and requires capacity reservations or UltraClusters on AWS just to guarantee availability. Over a month of continuous usage, you’re looking at $64,000-$72,000 per instance.
Crusoe provides immediate on-demand access to H100 instances at $3.90 per GPU-hour without sales calls or quota approvals. An 8-GPU cluster costs $31.20/hour, or about $22,800/month (a savings of roughly $41,000-$49,000 per month compared to hyperscaler on-demand pricing). For fault-tolerant workloads, spot pricing brings H100s down to $1.60/hour per GPU ($12.80/hour for 8 GPUs, or $9,300/month), with 7-day advance notice before interruptions.
Unlike most GPU-focused providers, Crusoe includes managed Kubernetes, managed inference for model deployment, and Slurm workload managers. Crusoe also offers both NVIDIA and AMD GPUs, with AMD MI300X at $3.45/hour on-demand and $0.95/hour spot, providing high-performance alternatives for teams looking to optimize costs or avoid NVIDIA-only lock-in.
Migrating from SageMaker, Vertex AI, or Azure ML?
Saturn Cloud on Crusoe provides a managed AI development platform with development workspaces, distributed training, and model deployment so you can keep a managed experience while getting Crusoe's GPU availability and pricing.
Understanding What Crusoe Offers
GPU Hardware and Availability
Crusoe provides access to both NVIDIA’s latest GPU architectures (H100, H200, B200, GB200) and AMD’s high-performance compute GPUs (MI300X, MI355X). All instances include RDMA networking with InfiniBand standard across GPU nodes. You can provision 8-16 GPU clusters immediately on-demand or via spot pricing. Larger clusters are available through reserved capacity contracts with deeper discounts (up to 81% savings on 3-year terms).
Crusoe operates in US (Virginia, Texas), Canada (Alberta), and EU (Iceland, Norway) regions. Instance configurations range from single GPU instances to multi-node clusters with InfiniBand for low-latency gradient synchronization. Crusoe’s AutoClusters technology provides automatic node recovery for 99.98% uptime.
AMD GPU Advantage: AMD MI300X GPUs offer 192GB of HBM3 memory (versus 80GB on H100), making them particularly well-suited for large language models and memory-intensive workloads. At $3.45/hour on-demand and $0.95/hour spot, MI300X provides compelling price-performance for teams willing to optimize for AMD’s ROCm platform.
Managed Services
Crusoe Managed Kubernetes (CMK) handles cluster provisioning, upgrades, and control plane management. You define GPU node pools through Terraform or the API, and Crusoe manages the Kubernetes control plane, node scaling, and GPU driver installation. Your existing Kubernetes manifests, Helm charts, and operators work without modification. CMK supports autoscaling, integrated monitoring, and topology-aware GPU placement for optimal performance.
Crusoe Managed Inference provides a fully managed deployment solution for model serving that handles infrastructure, maintenance, and security. This service delivers low-latency, high-throughput predictions without requiring you to manage inference infrastructure directly (a direct alternative to SageMaker Endpoints or Vertex AI Prediction).
Slurm workload manager is available as a managed service for teams migrating from HPC environments or needing batch job scheduling with advanced resource allocation capabilities beyond standard Kubernetes batch jobs.
Infrastructure Capabilities
Storage options include:
Persistent disks (block storage): Powered by Lightbits Labs technology, optimized for AI/ML workloads with lower and more consistent latency, higher throughput, and linear scalability compared to traditional network block storage.
Shared disks: High-performance shared network filesystem powered by VAST Data, designed for concurrent access from multiple VMs. Delivers up to 200 MBps read and 40 MBps write throughput per TiB of provisioned storage. Ideal for training datasets that need to be accessed from multiple GPU nodes simultaneously.
Object storage: Crusoe doesn’t currently offer a native managed object storage service. Teams typically deploy self-managed MinIO for S3-compatible object storage on Crusoe compute, use existing hyperscaler object storage during hybrid migration, or leverage third-party S3-compatible services.
RDMA networking with InfiniBand comes standard on all GPU instances, providing the high-bandwidth, low-latency networking essential for distributed training. Crusoe’s topology-aware placement ensures GPU instances are optimally positioned to minimize communication latency.
Automation and Infrastructure-as-Code
Crusoe offers an official Terraform provider with comprehensive resource coverage. If you’re managing hyperscaler infrastructure through Terraform, your migration involves translating resource definitions rather than rewriting automation logic. Crusoe also provides a CLI (available via brew and apt), Go client library, and REST APIs for scripting and automation.
Access Management
Crusoe Cloud uses an organization-based access control model with three predefined roles:
- Admin: Full control over all organization resources and member management
- Editor: Can create, modify, and delete resources but cannot manage organization members
- Reader: View-only access to resources
Organizations can enforce Single Sign-On (SSO) via OIDC-based authentication for secure console access. API access is managed through access keys and secret keys, and audit logs provide 90-day activity history. The simplicity of three roles (versus hundreds of AWS IAM policies) reduces misconfiguration risks and onboarding complexity.
Pre-Migration Assessment
Identify Migration Blockers and Dependencies
Before migrating, assess these potential blockers:
Hyperscaler service dependencies: Catalog which AWS, GCP, or Azure services your workloads use. SageMaker, DynamoDB, Step Functions, and similar services don’t have Crusoe equivalents. You’ll need to replace them with open-source alternatives, use third-party managed services, or maintain hybrid connectivity.
Data residency constraints: Crusoe operates in US (Virginia, Texas), Canada (Alberta), and EU (Iceland, Norway). Check if these regions satisfy your compliance requirements. If you need presence in Asia-Pacific, Middle East, or other regions, Crusoe may not work yet.
Existing cloud contracts: Calculate early termination fees for reserved instances or enterprise agreements. Factor these into migration ROI. Sometimes it makes financial sense to wait until commitments expire or run hybrid temporarily.
Infrastructure management skills: Crusoe doesn’t offer managed databases or object storage. Your team needs Kubernetes experience to deploy and manage these services, or budget for third-party managed providers. Saturn Cloud on Crusoe reduces this burden by providing a managed AI platform layer.
AMD GPU compatibility: If cost optimization through AMD MI300X is part of your strategy, audit your code for CUDA-only libraries. Most PyTorch and TensorFlow code runs on ROCm without changes, but custom CUDA kernels require porting.
Migration Planning
Choosing Your Migration Strategy
Lift-and-shift works if your workloads already run on Kubernetes. Move your containers to Crusoe clusters without rewriting application logic. Best for teams with minimal dependencies on hyperscaler-specific services.
Hybrid lets you use Crusoe GPUs immediately while keeping databases and object storage on your current provider. You’ll pay egress costs, but GPU savings usually offset this. Good for testing Crusoe before full commitment.
Spot-first takes advantage of Crusoe’s spot pricing with 7-day advance notice. Design training jobs to checkpoint frequently and resume automatically. With H100 spot at $1.60/hour versus $3.90 on-demand, this strategy saves 59% on compute for fault-tolerant workloads.
Full migration moves all infrastructure to Crusoe. Requires deploying your own object storage (MinIO) or using third-party services, plus managing databases via Kubernetes operators or external providers. Maximizes cost savings but increases operational complexity.
Setting Up Your Crusoe Environment
Account setup and access management:
Create your Crusoe account: Sign up at crusoe.ai and create your organization. The account creator automatically gets the Admin role.
Install and configure the Crusoe CLI:
# Install CLI on macOS
brew install crusoe
# Install CLI on Ubuntu/Debian
curl -fsSL https://console.crusoecloud.com/downloads/crusoe-cli_latest_amd64.deb -o crusoe-cli.deb
sudo apt install ./crusoe-cli.deb
# Initialize CLI configuration and authenticate
crusoe config init
crusoe auth login
# Verify installation
crusoe --version
Set up user authentication: Add team members to your organization and assign them Admin, Editor, or Reader roles. For enterprise teams, configure SSO through your identity provider (Okta, Azure AD, etc.) for OIDC-based authentication.
Create API keys for automation:
# Generate API access key and secret key via the console
# Store credentials in ~/.crusoe/config
# Verify API access
crusoe projects list
- Configure service account credentials for CI/CD: Store Crusoe API keys as secrets in your CI/CD platform:
GitHub Actions example:
- name: Authenticate with Crusoe
env:
CRUSOE_ACCESS_KEY: ${{ secrets.CRUSOE_ACCESS_KEY }}
CRUSOE_SECRET_KEY: ${{ secrets.CRUSOE_SECRET_KEY }}
run: |
mkdir -p ~/.crusoe
echo "[default]" > ~/.crusoe/config
echo "access_key_id = $CRUSOE_ACCESS_KEY" >> ~/.crusoe/config
echo "secret_key = $CRUSOE_SECRET_KEY" >> ~/.crusoe/config
GitLab CI example:
before_script:
- mkdir -p ~/.crusoe
- echo "[default]" > ~/.crusoe/config
- echo "access_key_id = $CRUSOE_ACCESS_KEY" >> ~/.crusoe/config
- echo "secret_key = $CRUSOE_SECRET_KEY" >> ~/.crusoe/config
Storage configuration: Provision persistent disks for instance storage and shared disks for datasets that need concurrent access from multiple GPU nodes. For object storage, either deploy MinIO on Crusoe compute, keep using hyperscaler object storage during hybrid operation, or use third-party S3-compatible services.
Mapping Hyperscaler Services to Crusoe Equivalents
| Hyperscaler Service | Crusoe Equivalent | Notes |
|---|---|---|
| EKS, GKE, AKS | Crusoe Managed Kubernetes (CMK) | Your Kubernetes manifests work without changes. Update Terraform for cluster provisioning and GPU node pools. |
| SageMaker Training, Vertex AI Training, Azure ML Jobs | Kubernetes Jobs or Kubeflow operators | Replace managed training with containerized jobs on CMK. Or use Saturn Cloud for a managed experience. |
| SageMaker Endpoints, Vertex AI Prediction | Crusoe Managed Inference or self-hosted Triton/TorchServe | Crusoe Managed Inference provides similar fully-managed model serving. |
| RDS, Cloud SQL, Azure Database | Self-hosted via Kubernetes operators or third-party (Aiven, etc.) | Crusoe doesn’t offer managed databases. Deploy PostgreSQL/MySQL operators or use external managed services. |
| S3, GCS, Azure Blob Storage | MinIO, third-party S3-compatible services, or hybrid access | No native managed object storage. Deploy MinIO yourself, use Backblaze B2/Wasabi/Cloudflare R2, or access hyperscaler storage during migration. |
| SQS, Pub/Sub, Azure Queue Storage | Self-hosted RabbitMQ/Kafka on Kubernetes or Confluent Cloud | No managed queue service. Run your own or use cloud-agnostic providers. |
| Step Functions, Cloud Composer, Data Factory | Airflow/Prefect/Argo Workflows on Kubernetes | No managed workflow orchestration. Deploy open-source tools on CMK. |
Step-by-Step Migration Process
Data Migration
For teams using hyperscaler object storage (S3, GCS, Azure Blob), you have several migration strategies:
Strategy 1: Deploy MinIO on Crusoe
# Deploy MinIO on Crusoe Kubernetes for S3-compatible storage
kubectl create namespace minio
helm install minio bitnami/minio \
--namespace minio \
--set auth.rootUser=admin \
--set auth.rootPassword=$(openssl rand -base64 32) \
--set persistence.size=10Ti \
--set resources.requests.memory=32Gi
# Transfer data using rclone
rclone copy s3:my-aws-bucket/training-data minio:ml-training-data/training-data \
--progress \
--transfers 16 \
--checkers 32
Strategy 2: Hybrid storage approach
Leave data on hyperscaler storage initially and access it from Crusoe GPU instances. You’ll pay egress costs from AWS/GCP/Azure, but this lets you start using Crusoe GPUs immediately. For workloads where compute costs dwarf data transfer costs, this can be net positive.
# Access AWS S3 from Crusoe GPU instances
import boto3
s3 = boto3.client('s3',
aws_access_key_id='YOUR_AWS_KEY',
aws_secret_access_key='YOUR_AWS_SECRET'
)
# Load training data directly from S3
# Network egress costs apply but GPU savings may offset this
Strategy 3: Third-party object storage
Use cloud-agnostic S3-compatible services:
# Transfer to Backblaze B2, Wasabi, or Cloudflare R2
rclone copy s3:my-aws-bucket/training-data b2:ml-training-data/training-data \
--progress \
--transfers 16 \
--checkers 32
For large datasets, prioritize migrating frequently accessed data (active training sets, recent checkpoints) first. Cold archival data can stay on hyperscaler storage until needed.
Workload Migration
Containerized training jobs: Update your Kubernetes manifests to point to Crusoe clusters, update data paths to reference Crusoe storage or shared disks, and update image registries if you’re moving those as well.
Example Kubernetes Job manifest adaptation:
apiVersion: batch/v1
kind: Job
metadata:
name: llm-training
spec:
template:
spec:
containers:
- name: trainer
image: your-registry/llm-trainer:latest
resources:
limits:
nvidia.com/gpu: 8 # or amd.com/gpu for MI300X
volumeMounts:
- name: training-data
mountPath: /data
- name: checkpoints
mountPath: /checkpoints
volumes:
- name: training-data
persistentVolumeClaim:
claimName: shared-disk-pvc # Crusoe Shared Disk
- name: checkpoints
persistentVolumeClaim:
claimName: checkpoint-pvc # Crusoe Persistent Disk
nodeSelector:
crusoe.ai/gpu-type: h100 # or mi300x for AMD
restartPolicy: Never
Model serving infrastructure: For inference, you have two main options on Crusoe:
Crusoe Managed Inference: Deploy your models using Crusoe’s fully managed inference platform, which handles infrastructure, scaling, and maintenance. This provides a similar experience to SageMaker Endpoints or Vertex AI Prediction without managing inference infrastructure yourself.
Self-hosted inference: Deploy your model serving framework (Triton Inference Server, TorchServe, TensorFlow Serving, or custom FastAPI services) on Crusoe Kubernetes with Horizontal Pod Autoscaler and ingress controllers.
Example self-hosted Triton deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-inference
spec:
replicas: 3
selector:
matchLabels:
app: triton
template:
metadata:
labels:
app: triton
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.01-py3
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: grpc
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: model-repository
mountPath: /models
volumes:
- name: model-repository
persistentVolumeClaim:
claimName: models-pvc
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: triton-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triton-inference
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Leveraging Spot Instances
Crusoe’s spot pricing offers up to 90% savings with the unique advantage of 7-day advance notice before interruptions. Design your training workloads to leverage spot instances:
Checkpointing strategy:
import torch
import os
from datetime import datetime, timedelta
class SpotTrainer:
def __init__(self, model, checkpoint_dir='/checkpoints'):
self.model = model
self.checkpoint_dir = checkpoint_dir
self.checkpoint_frequency = 1 # Save every epoch
def save_checkpoint(self, epoch, optimizer, loss):
checkpoint = {
'epoch': epoch,
'model_state_dict': self.model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
'timestamp': datetime.now()
}
path = os.path.join(self.checkpoint_dir, f'checkpoint_epoch_{epoch}.pt')
torch.save(checkpoint, path)
print(f"Checkpoint saved: {path}")
def load_checkpoint(self, checkpoint_path):
checkpoint = torch.load(checkpoint_path)
self.model.load_state_dict(checkpoint['model_state_dict'])
return checkpoint['epoch'], checkpoint['optimizer_state_dict']
def train(self, dataloader, optimizer, num_epochs, resume_from=None):
start_epoch = 0
# Resume from checkpoint if available
if resume_from and os.path.exists(resume_from):
start_epoch, optimizer_state = self.load_checkpoint(resume_from)
optimizer.load_state_dict(optimizer_state)
print(f"Resumed from epoch {start_epoch}")
for epoch in range(start_epoch, num_epochs):
# Training loop
for batch_idx, (data, target) in enumerate(dataloader):
# Training step
loss = self.train_step(data, target, optimizer)
# Save checkpoint every epoch
if epoch % self.checkpoint_frequency == 0:
self.save_checkpoint(epoch, optimizer, loss)
Handling spot interruptions:
With 7 days advance notice, you can:
- Complete training runs that are within 7 days of finishing
- Migrate to on-demand instances if needed
- Checkpoint and resume on new spot instances
Cost optimization: Use spot instances for development, experimentation, and fault-tolerant training workloads. Reserve on-demand or reserved capacity for production inference and time-sensitive training.
Optimizing for Crusoe
Taking Advantage of Crusoe-Specific Features
RDMA networking with InfiniBand for distributed training: Crusoe includes high-speed GPU interconnects standard on all GPU instances. For multi-node distributed training, ensure your training framework is configured to use InfiniBand (RDMA over Converged Ethernet).
Verify that NCCL (NVIDIA Collective Communications Library) is using InfiniBand:
# Set environment variables for NCCL to prefer InfiniBand
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5
export NCCL_NET_GDR_LEVEL=5
export NCCL_SOCKET_IFNAME=ib0
For large model training where gradient synchronization is a bottleneck, InfiniBand’s lower latency versus standard Ethernet can reduce training time by 20-40%.
Shared Disks for data loading: Crusoe Shared Disks (powered by VAST Data) deliver up to 200 MBps read throughput per TiB, making them well-suited for training data that needs to be accessed from multiple GPU nodes simultaneously. Mount the shared disk on all GPU nodes and load data directly rather than copying to local disk first.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-training-data
spec:
accessModes:
- ReadWriteMany # Multiple nodes can mount simultaneously
resources:
requests:
storage: 10Ti
storageClassName: crusoe-shared-disk
AMD MI300X for memory-intensive workloads: For large language models that require more GPU memory, AMD MI300X provides 192GB HBM3 (versus 80GB on H100) at $3.45/hour on-demand or $0.95/hour spot. Ensure your framework supports ROCm:
import torch
# PyTorch with ROCm support
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# On AMD GPUs with ROCm, torch.cuda still works
model = LargeLanguageModel().to(device)
print(f"Using device: {device}")
print(f"GPU memory available: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
Cost Optimization Strategies
Spot-first for training: Design training workloads to run on spot instances with checkpointing. With H100 spot at $1.60/hour (versus $3.90 on-demand), you save 59% on compute while the 7-day advance notice minimizes disruption risk.
Reserved capacity for production: For stable production inference workloads with predictable GPU usage, negotiate reserved capacity contracts with Crusoe. They offer discounts up to 81% for 3-year commitments. Calculate your baseline GPU usage and commit to that capacity, while using on-demand or spot for variable experimental workloads.
AMD for cost-performance optimization: For compatible workloads, AMD MI300X at $3.45/hour on-demand (or $0.95/hour spot) provides compelling performance per dollar, especially for memory-intensive tasks where the 192GB of HBM3 delivers value.
Right-size GPU resources: Use smaller GPU configurations (L40S at $1.00/hour, A40 at $0.90/hour) for development, debugging, and small-scale experiments. Reserve H100/H200 for production training and large-scale experiments.
Performance Tuning
Network throughput optimization: For distributed training using Crusoe’s InfiniBand networking, use NCCL tests to benchmark all-reduce operations across nodes:
# Run NCCL all-reduce benchmark across nodes
mpirun -np 16 --host node1:8,node2:8 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_NET_GDR_LEVEL=5 \
/usr/local/bin/all_reduce_perf -b 8 -e 8G -f 2 -g 1
If you see lower-than-expected bandwidth, verify InfiniBand configuration or contact Crusoe support for network topology optimization.
Storage throughput optimization: For data-intensive training, use Crusoe Shared Disks with sufficient provisioned capacity to achieve target throughput. At 200 MBps read per TiB, a 10 TiB shared disk delivers 2 GBps read throughput.
GPU utilization monitoring: Use Crusoe’s built-in observability tools along with NVIDIA DCGM or AMD ROCm tools to monitor GPU utilization, identify bottlenecks, and optimize batch sizes and data loading:
# Monitor NVIDIA GPU utilization
nvidia-smi dmon -s pucvmet -d 5
# Monitor AMD GPU utilization (if using MI300X)
rocm-smi --showuse
Common Migration Challenges and Solutions
Dealing with Hyperscaler-Specific Dependencies
Problem: Code tightly coupled to AWS SageMaker APIs, GCP Vertex AI, or Azure ML services.
Solution: Abstract hyperscaler-specific APIs behind interfaces. For example, instead of calling SageMaker APIs directly throughout your codebase, create a training orchestration layer that calls SageMaker in your current implementation but can be swapped for Kubernetes Job submission when migrating to Crusoe.
If you’re using SageMaker Processing for data preprocessing, replace it with containerized preprocessing jobs on Kubernetes. SageMaker Training Jobs become Kubernetes Jobs or KubeFlow PyTorchJobs/TFJobs. SageMaker Endpoints become Crusoe Managed Inference or self-hosted Triton/TorchServe deployments.
Problem: Code using AWS-specific services (DynamoDB, SQS, Step Functions).
Solution: Three options:
- Keep using these services from Crusoe (hybrid architecture) via VPN or site-to-site connectivity.
- Replace with open-source equivalents running on Crusoe (PostgreSQL instead of DynamoDB, RabbitMQ/Kafka instead of SQS, Airflow instead of Step Functions).
- Use cloud-agnostic managed services (MongoDB Atlas, Confluent Cloud, Astronomer for Airflow).
Handling Object Storage Migration
Problem: Large datasets in S3/GCS/Azure Blob Storage need to be accessible from Crusoe, but Crusoe doesn’t have native managed object storage.
Solution: Choose based on your requirements:
Hybrid approach (fastest to implement): Keep data in hyperscaler object storage and access from Crusoe. Pay egress costs but avoid migration effort. Works well if compute costs » data transfer costs.
Self-managed MinIO (full control): Deploy MinIO on Crusoe Kubernetes for S3-compatible storage. Requires managing the MinIO deployment but gives you full control and keeps all infrastructure on Crusoe.
Third-party S3-compatible services (managed): Use Backblaze B2, Wasabi, or Cloudflare R2 for S3-compatible storage without managing infrastructure. Typically costs $5-6/TB/month versus $23/TB/month for S3.
Managing Secrets and Credentials
Problem: Secrets stored in AWS Secrets Manager, GCP Secret Manager, or Azure Key Vault need to be accessible from Crusoe.
Solution: Migrate secrets to Kubernetes Secrets or use a cloud-agnostic secret management solution (HashiCorp Vault, Doppler, 1Password). For Kubernetes Secrets, use external-secrets operator to sync secrets from your existing secret manager during transition, then gradually migrate them to your target solution.
# Using external-secrets operator to sync from AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: aws-secrets
spec:
provider:
aws:
service: SecretsManager
region: us-east-1
auth:
secretRef:
accessKeyIDSecretRef:
name: aws-credentials
key: access-key-id
secretAccessKeySecretRef:
name: aws-credentials
key: secret-access-key
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: ml-training-secrets
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets
kind: SecretStore
target:
name: training-credentials
data:
- secretKey: api-key
remoteRef:
key: ml-training/api-key
Porting CUDA Code to ROCm (for AMD GPUs)
Problem: If you want to use AMD MI300X GPUs for cost savings, existing CUDA code needs to run on ROCm.
Solution: For most PyTorch and TensorFlow code, ROCm compatibility is transparent (no code changes needed). For custom CUDA kernels, use HIP (Heterogeneous-compute Interface for Portability) to port CUDA code:
# Use hipify-perl to automatically convert CUDA code to HIP
hipify-perl my_cuda_kernel.cu > my_hip_kernel.cpp
# Compile for AMD GPUs
hipcc -o my_kernel my_hip_kernel.cpp
Most common CUDA operations have direct HIP equivalents. For complex custom kernels, manual porting may be required. Test thoroughly on AMD hardware before committing to production workloads.
Want a managed AI platform on Crusoe?
Saturn Cloud on Crusoe runs directly in your Crusoe account, providing development workspaces, distributed training, and deployment infrastructure. Get the simplicity of SageMaker with Crusoe's GPU availability and cost savings.
When Crusoe Makes Sense
Strong fit for:
- Teams spending $50k+/month on GPU compute where Crusoe’s 60%+ savings justify migration effort
- Workloads already containerized on Kubernetes (minimal refactoring needed)
- Training jobs that can leverage spot instances (59% cheaper with 7-day notice)
- Large language models needing 192GB GPU memory (AMD MI300X advantage)
- Organizations comfortable managing object storage and databases via Kubernetes operators or third-party services
- Teams needing H100/H200/B200 access without hyperscaler waitlists or multi-year commitments
Consider staying on hyperscalers if:
- GPU costs are under $10k/month (migration effort may not pay off)
- Deeply integrated with SageMaker/Vertex AI/Azure ML (significant replatforming required)
- Need presence in Asia-Pacific, Middle East, or Latin America (Crusoe only in US/Canada/EU)
- Require managed databases, object storage, and other PaaS services without operational overhead
- Lack Kubernetes expertise and can’t adopt Saturn Cloud or similar managed platforms
- Compliance mandates specific certifications Crusoe doesn’t hold yet
Resources
- Crusoe Cloud Documentation
- Crusoe Terraform Provider
- Crusoe Managed Kubernetes
- Saturn Cloud on Crusoe
- Crusoe Support
Saturn Cloud provides customizable, ready-to-use cloud environments
for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without having to switch tools.