Release 2025.10.01

Saturn Cloud release notes for 2025.10.01

Saturn Helm Operator

Saturn Cloud components are now managed by a Kubernetes operator instead of direct Helm installs from the Python installer. The operator manages 14 Saturn components as custom resources under charts.saturncloud.io/v1alpha1, reconciling every 2 minutes.

Beyond Helm reconciliation, the operator handles:

  • Secret lifecycle: Generates and rotates Atlas secrets, JWT key pairs, SSH keys, monitoring auth tokens, Docker registry credentials, and TLS secrets
  • Auth token management: Acquires and refreshes Saturn Cloud API tokens, stored as Kubernetes secrets and mounted as files instead of environment variables
  • Docker credential refresh: Runs on an 8-hour loop to refresh ECR tokens and distribute them across namespaces (replaces the previous CronJob approach)
  • DNS monitoring: Watches LoadBalancer services and reports DNS information back to the Saturn Cloud manager API
  • CRD dependency checking: Verifies cert-manager CRDs and webhooks are ready before reconciling Traefik
  • Stuck release cleanup: Detects and resolves Helm releases stuck in a failed deletion state

Migration tooling is available:

  • convert_to_operator.py converts existing config.yaml to operator values.yaml format
  • migrate_secrets_to_operator.py migrates K8s secrets to operator-managed format

Resource Quotas

  • Operators can define Kubernetes ResourceQuota limits per user or org, enforced at the K8s level via PriorityClass scoping
  • Applied to all pod types: workspaces, deployments, jobs, Dask schedulers, and Dask workers
  • Prevents any single user or team from monopolizing cluster GPU/CPU/memory resources

Configurable Node Selectors and Affinity

  • Instance sizes now accept custom node_affinity_config for clusters where node labeling doesn’t follow Saturn’s default node_role convention
  • New system_resources_node_selector and system_resources_tolerations settings control where system pods (Dask controllers, operator) run
  • All Helm charts accept configurable nodeSelector and tolerations instead of hardcoding node.saturncloud.io/role: system
  • Required for k0rdent and other non-standard cluster topologies

Custom Scheduler Support

  • New custom_scheduler setting applies a custom schedulerName to all Saturn pod specs
  • Required for clusters running Volcano, YuniKorn, KAI scheduler, or other custom schedulers that implement gang scheduling or fair-share policies

k0rdent Support

  • Load balancer provider can now be configured separately from cloud provider, since k0rdent manages the Kubernetes cluster but a different cloud handles load balancers
  • Traefik and SSH proxy services use loadBalancerProvider for LB annotations and network CIDR selection

Faster Pod Startup

  • All pod security contexts now set fsGroupChangePolicy: OnRootMismatch, skipping the recursive chown on mounted volumes when permissions already match
  • Significantly faster startup for workspaces with large PVCs (model weights, datasets)

Removed Pending Timeout

  • Removed the 10-minute timeout that marked unschedulable pods as errored
  • Pods now stay in PENDING status until Kubernetes actually schedules or fails them
  • Prevents false failure states in clusters with autoscaling or custom schedulers where scale-up can take longer than 10 minutes

Resource Templates

  • New in_gallery parameter decouples gallery visibility from access level
  • Operators can hide the resource creation bar for managed deployments where users should only launch from pre-defined templates

Resource Migration

  • New admin tooling for migrating resource ownership between users and groups
  • Handles dependent objects: secret attachments, external repo attachments, shared folders, and image tag access

Node Metrics

  • kube-state-metrics now collects node-level metrics (instance type, region, zone, node role)
  • Enables abuse monitoring and cost attribution at the node level

Trust Bundle Support

  • Charts communicating with Pandora over mTLS now support custom CA trust bundles via pandora.tls.trustBundleSecretName
  • Required for environments with private PKI

Performance

  • Workspace list endpoint no longer holds database connections while making K8s API calls for IDE info, preventing connection pool exhaustion under load
  • Stopped resources skip unnecessary database transactions for label computation

Bug Fixes

  • Fixed SSH service name conflicts where the deployment subdomain could collide with the HTTP service name
  • Fixed SSH username fallback for deployments and jobs without a subdomain

Infrastructure

  • Database secrets consolidated into a single operator-managed atlas-secrets secret
  • StorageClass creation is now optional (storageClass.create toggle) for operators providing their own
  • Docker secrets script simplified: removed ECR Public-specific handling, added generic Docker registry support
  • Removed cluster-info CronJob (functionality moved to operator)
  • Atlas can load custom environment variables from an optional saturn-user-secrets secret