Release 2025.10.01
Saturn Cloud release notes for 2025.10.01
Saturn Helm Operator
Saturn Cloud components are now managed by a Kubernetes operator instead of direct Helm installs from the Python installer. The operator manages 14 Saturn components as custom resources under charts.saturncloud.io/v1alpha1, reconciling every 2 minutes.
Beyond Helm reconciliation, the operator handles:
- Secret lifecycle: Generates and rotates Atlas secrets, JWT key pairs, SSH keys, monitoring auth tokens, Docker registry credentials, and TLS secrets
- Auth token management: Acquires and refreshes Saturn Cloud API tokens, stored as Kubernetes secrets and mounted as files instead of environment variables
- Docker credential refresh: Runs on an 8-hour loop to refresh ECR tokens and distribute them across namespaces (replaces the previous CronJob approach)
- DNS monitoring: Watches LoadBalancer services and reports DNS information back to the Saturn Cloud manager API
- CRD dependency checking: Verifies cert-manager CRDs and webhooks are ready before reconciling Traefik
- Stuck release cleanup: Detects and resolves Helm releases stuck in a failed deletion state
Migration tooling is available:
convert_to_operator.pyconverts existing config.yaml to operator values.yaml formatmigrate_secrets_to_operator.pymigrates K8s secrets to operator-managed format
Resource Quotas
- Operators can define Kubernetes ResourceQuota limits per user or org, enforced at the K8s level via PriorityClass scoping
- Applied to all pod types: workspaces, deployments, jobs, Dask schedulers, and Dask workers
- Prevents any single user or team from monopolizing cluster GPU/CPU/memory resources
Configurable Node Selectors and Affinity
- Instance sizes now accept custom
node_affinity_configfor clusters where node labeling doesn’t follow Saturn’s defaultnode_roleconvention - New
system_resources_node_selectorandsystem_resources_tolerationssettings control where system pods (Dask controllers, operator) run - All Helm charts accept configurable
nodeSelectorandtolerationsinstead of hardcodingnode.saturncloud.io/role: system - Required for k0rdent and other non-standard cluster topologies
Custom Scheduler Support
- New
custom_schedulersetting applies a customschedulerNameto all Saturn pod specs - Required for clusters running Volcano, YuniKorn, KAI scheduler, or other custom schedulers that implement gang scheduling or fair-share policies
k0rdent Support
- Load balancer provider can now be configured separately from cloud provider, since k0rdent manages the Kubernetes cluster but a different cloud handles load balancers
- Traefik and SSH proxy services use
loadBalancerProviderfor LB annotations and network CIDR selection
Faster Pod Startup
- All pod security contexts now set
fsGroupChangePolicy: OnRootMismatch, skipping the recursivechownon mounted volumes when permissions already match - Significantly faster startup for workspaces with large PVCs (model weights, datasets)
Removed Pending Timeout
- Removed the 10-minute timeout that marked unschedulable pods as errored
- Pods now stay in PENDING status until Kubernetes actually schedules or fails them
- Prevents false failure states in clusters with autoscaling or custom schedulers where scale-up can take longer than 10 minutes
Resource Templates
- New
in_galleryparameter decouples gallery visibility from access level - Operators can hide the resource creation bar for managed deployments where users should only launch from pre-defined templates
Resource Migration
- New admin tooling for migrating resource ownership between users and groups
- Handles dependent objects: secret attachments, external repo attachments, shared folders, and image tag access
Node Metrics
- kube-state-metrics now collects node-level metrics (instance type, region, zone, node role)
- Enables abuse monitoring and cost attribution at the node level
Trust Bundle Support
- Charts communicating with Pandora over mTLS now support custom CA trust bundles via
pandora.tls.trustBundleSecretName - Required for environments with private PKI
Performance
- Workspace list endpoint no longer holds database connections while making K8s API calls for IDE info, preventing connection pool exhaustion under load
- Stopped resources skip unnecessary database transactions for label computation
Bug Fixes
- Fixed SSH service name conflicts where the deployment subdomain could collide with the HTTP service name
- Fixed SSH username fallback for deployments and jobs without a subdomain
Infrastructure
- Database secrets consolidated into a single operator-managed
atlas-secretssecret - StorageClass creation is now optional (
storageClass.createtoggle) for operators providing their own - Docker secrets script simplified: removed ECR Public-specific handling, added generic Docker registry support
- Removed
cluster-infoCronJob (functionality moved to operator) - Atlas can load custom environment variables from an optional
saturn-user-secretssecret