What It Takes to Build an ML Platform on Kubernetes

A use-case-by-use-case comparison of production-grade ML infrastructure on raw Kubernetes vs Saturn Cloud

Most AI teams need the same handful of workload types: dev workspaces, training jobs, scheduled pipelines, and model deployments. All of these can be built on raw Kubernetes. This page documents what that actually takes at production quality, then shows the Saturn Cloud equivalent.

This document is split into two parts:

  1. The Production Baseline: the cross-cutting infrastructure (authentication, RBAC, cost tracking, idle detection, audit logging, image management, etc.) that every workload depends on. This is the foundation you need before you deploy a single workspace or training job.
  2. The Workloads: six specific use cases (dev workspaces, multi-node training, scheduled jobs, Streamlit apps, LLM inference, model APIs), each showing the Kubernetes resources required and the Saturn Cloud equivalent.

Some teams adopt tools like JupyterHub, Kubeflow, or Argo Workflows to address parts of this. Those are valid options with their own operational requirements. This doc focuses on the underlying Kubernetes infrastructure that any solution needs to provide.

Part 1: The Production Baseline

Before you build any individual workload type, you need a production-grade foundation. The sections below cover eight cross-cutting concerns that apply to every use case in this document. None of them are optional if you have more than a handful of users. For each one, we describe what it takes on raw Kubernetes, then how Saturn Cloud handles it.

Authentication and SSO

Kubernetes has no built-in concept of a “user” in the way an ML platform needs one. It has ServiceAccounts for pods and OIDC token validation for API access, but neither gives you what you actually need: a login page where data scientists sign in with their corporate credentials and get access to their workspaces, jobs, and deployments.

To build this, you need to deploy an identity-aware proxy in front of every user-facing service. The most common approach is OAuth2 Proxy, which sits in front of your ingress and handles the OIDC flow with your identity provider (Okta, Azure AD, Google Workspace, etc.). You need to register an OAuth application with your IdP, configure redirect URIs for every service endpoint, manage session cookies, handle token refresh, and decide what happens when a token expires mid-session.

This is not a one-time setup. Every new service you expose (a new Streamlit app, a new API endpoint, a new workspace type) needs its own OAuth2 Proxy configuration or route registration. You need to decide whether to run one shared proxy or one per service, each with different tradeoffs around failure isolation and configuration complexity.

You also need a mapping between IdP identities and Kubernetes RBAC. When a user logs in via Okta, how does the system know which namespace they can access? This requires either a custom admission webhook that maps OIDC claims to Kubernetes RBAC bindings, or a custom controller that watches your IdP (via SCIM or API) and creates/deletes RoleBindings when users join or leave.

In Saturn Cloud, an auth-server component integrates with your IdP (Okta, Azure AD, or Google Workspace) and issues RS256-signed JWTs. Traefik’s ForwardAuth middleware validates the JWT on every request and injects the user identity into headers before the request reaches the workload. All routes go through the same ForwardAuth chain, so adding a new deployment or workspace does not require any additional auth configuration.

RBAC and Multi-Tenancy

When multiple users share a Kubernetes cluster, you need to ensure that Alice cannot see Bob’s workspaces, read Bob’s environment variables, or accidentally delete Bob’s training job. Kubernetes provides the primitives for this (Namespaces, Roles, RoleBindings, NetworkPolicies, ResourceQuotas), but assembling them into a working multi-tenant system is a significant engineering project.

The typical pattern is one namespace per user or per team. For each namespace, you need:

  • A RoleBinding granting the user access to their own namespace
  • NetworkPolicies that block traffic from other user namespaces (by default, all pods in a Kubernetes cluster can communicate with all other pods)
  • ResourceQuotas to prevent a single user from consuming all GPUs or memory in the cluster
  • LimitRanges to set default resource requests and limits so users cannot create pods without resource constraints

You also need a controller that creates all of this when a new user is provisioned and cleans it up when they leave. None of this is handled automatically by Kubernetes. If you skip the NetworkPolicies, any user can curl another user’s JupyterLab instance. If you skip the ResourceQuotas, one runaway training job can starve the entire cluster.

Beyond basic RBAC, most organizations need policy enforcement that Kubernetes RBAC cannot express. For example: “users can only select from approved base images,” “GPU workspaces must have idle detection enabled,” or “jobs cannot run for more than 72 hours.” These require a policy engine like OPA/Gatekeeper or Kyverno, which is another component to deploy, configure, and maintain.

Saturn Cloud does not rely on Kubernetes RBAC for isolation. Users never interact with the Kubernetes API directly. Instead, the control plane (Atlas) enforces isolation at the application layer: when a user makes a request, Atlas only returns and acts on resources that belong to that user. At the network level, Saturn Cloud applies Kubernetes NetworkPolicies to every user workload, so pods belonging to one user cannot reach pods belonging to another. Resource quotas (GPU hours, CPU hours, memory) are configured per user and per project in Atlas. Policy enforcement (approved base images, maximum instance sizes, idle detection requirements) is also enforced by Atlas before a workload is created, not by Kubernetes admission controllers.

TLS and Certificate Management

Every user-facing endpoint needs TLS. This means deploying cert-manager, configuring ClusterIssuers for Let’s Encrypt (or your internal CA), and ensuring certificate renewal works reliably. cert-manager is well-maintained software, but it still requires:

  • DNS challenge solvers configured for your DNS provider (Route53, Cloud DNS, Azure DNS, etc.)
  • Monitoring for certificate expiration (cert-manager can fail silently if DNS challenges stop working)
  • Wildcard certificates or per-service certificates (wildcard is simpler but some security policies prohibit it)
  • Handling the edge case where cert-manager’s webhook is unavailable during a cluster upgrade, which blocks all Ingress changes

This is not the hardest problem on this list, but it is another component that needs monitoring, upgrades, and occasional debugging when certificates fail to renew.

Saturn Cloud deploys cert-manager as one of its infrastructure components, with ClusterIssuers and DNS challenge solvers pre-configured for the cluster’s cloud provider. Certificate renewal happens automatically. If renewal fails, the platform’s monitoring surfaces the failure before certificates expire.

Cost Tracking and Chargeback

When GPUs cost $2-30+/hr, “who is using what” is not a nice-to-have. Finance and management need to know which team or project is responsible for each dollar of compute spend. Kubernetes does not track this.

The most common approach is deploying Kubecost or OpenCost, which watch pod resource usage and map it to cost using cloud provider pricing data. This gives you cluster-wide cost visibility, but mapping costs to individual users or projects requires consistent labeling of all pods, namespaces, and nodes, plus custom dashboards (usually Grafana) that aggregate the data in ways your finance team can understand.

The labeling requirement is deceptively hard. Every pod, job, and deployment must be labeled with the user, team, and project that owns it. If any workload is unlabeled (which happens the moment someone creates a pod via kubectl instead of your platform), it shows up as “unallocated” in your cost reports and you lose visibility.

You also need to decide the granularity. Do you track by namespace? By pod label? By node pool? Each choice has different accuracy and implementation complexity. And you need historical retention, because finance wants month-over-month trends, not just current spend.

Saturn Cloud records hourly usage for every workload: duration, owning user, owning project, and hardware consumed (CPU, GPU type, GPU count, memory). Because Atlas creates all workloads, every record is automatically tagged with the correct user and project. There is no labeling discipline to enforce. This data is available in reports within the platform and can be exported to a custom Snowflake view if you want to integrate it with your existing BI or finance tooling. Usage limits and quotas can be configured per user, team, or project, and Atlas enforces them at workload creation time (you cannot launch a workspace that would exceed your quota).

Idle Resource Detection

A data scientist starts a JupyterLab workspace with a 4xA100 GPU instance on Monday morning, runs some experiments, then goes to lunch. Then goes to a meeting. Then goes home. The workspace is still running on Tuesday morning, burning $100+/hr. This happens constantly.

Kubernetes has no built-in concept of “idle.” The scheduler knows if a pod is running, but not if anyone is actually using it. To detect idle workspaces, you need a custom controller that:

  1. Monitors CPU utilization, network I/O, and (optionally) Jupyter kernel activity per pod
  2. Applies a configurable threshold (e.g., “less than 5% CPU and no network traffic for 30 minutes”)
  3. Sends a warning to the user (via what channel? You need to build that too)
  4. Shuts down the workspace after a grace period
  5. Handles edge cases: a long-running training job has low CPU but is actively using the GPU. A data scientist is reading documentation in their browser but hasn’t touched the kernel. An ETL job has low CPU during an S3 download phase

This controller does not exist as an off-the-shelf solution. You write it yourself, deploy it, monitor it, and handle the support tickets when it incorrectly shuts down someone’s workspace during a long download.

Saturn Cloud’s idle detection works at the proxy layer. All user traffic (HTTP, SSH, WebSocket) passes through a proxy before reaching the workload. If no traffic reaches the proxy for a configurable period, the workspace is shut down. This avoids the edge cases that plague pod-level monitoring: the proxy does not care whether the pod has high or low CPU usage, only whether a user is actively interacting with it.

Secret Management

Users need access to cloud credentials (S3 buckets, database connections, API keys) without those credentials being visible to other users or hardcoded in images. Kubernetes Secrets are the basic primitive, but production requires more:

  • External Secrets Operator or Sealed Secrets to sync secrets from your source of truth (AWS Secrets Manager, HashiCorp Vault, Azure Key Vault) into Kubernetes Secrets
  • KMS encryption for secrets at rest (enabled via the EncryptionConfiguration on the API server, which is already handled by most managed Kubernetes services)
  • Per-pod IAM roles so workloads can access cloud resources without long-lived credentials. On AWS this is IRSA (IAM Roles for Service Accounts), on GCP it is Workload Identity, on Azure it is Managed Identity. Each requires configuring a trust relationship between the Kubernetes service account and the cloud IAM role, per user or per workload
  • Secret rotation: when a credential is rotated, all pods using it need to pick up the new value. Kubernetes does not restart pods when a mounted Secret changes (unless you use a sidecar like Reloader)
  • Audit trail: who accessed which secret, when? Kubernetes audit logging captures API calls but not the actual secret values read by pods

In Saturn Cloud, secrets are stored as Kubernetes Secrets encrypted at rest by the cloud provider’s KMS. Users configure environment variables and secrets through the UI or API, and Atlas injects them into only that user’s pods. For cloud resource access, Saturn Cloud configures per-pod IAM roles (IRSA on AWS, Workload Identity on GCP, Managed Identity on Azure), so workloads authenticate to cloud services without long-lived credentials. Secret rotation is handled by updating the secret in Saturn Cloud and restarting the affected workloads.

Audit Logging

For compliance and security, you need to know: who launched what workload, when, with what resources, and what it did. Kubernetes audit logging captures API server events (pod created, secret read, deployment scaled), but the raw audit log is extremely verbose and not useful without processing.

A production audit system requires:

  • Kubernetes audit policy configured to capture the events you care about at the right verbosity level (too verbose and you drown in noise, too quiet and you miss important events)
  • Log aggregation: an EFK stack (Elasticsearch, Fluent Bit, Kibana) or similar (Grafana Loki, Datadog, Splunk) to collect, store, and query logs from all pods and the audit log
  • Retention policies: audit logs and pod logs need different retention periods (audit logs for compliance may need 1+ years, pod logs for debugging may only need days)
  • Storage management: Elasticsearch indices grow fast. You need index lifecycle management, rollover policies, and monitoring for disk usage
  • Access control on logs: users should be able to see their own pod logs but not other users' logs. Kibana has no concept of Kubernetes RBAC, so you either build a custom log viewer or configure Kibana multi-tenancy (which is its own project)

Saturn Cloud deploys Elasticsearch and Fluent Bit as infrastructure components. Fluent Bit runs as a DaemonSet and forwards all pod logs to Elasticsearch. When a user views logs in the platform UI, Atlas filters the results to only show logs from that user’s workloads, so there is no need for a separate log access control layer. For runtime security auditing, the cluster is compatible with tools like Falco and CrowdStrike, which can be deployed alongside Saturn Cloud’s components.

Image Management

Data scientists need custom environments: specific Python packages, CUDA versions, system libraries, pre-loaded model weights. This means building and managing container images.

On raw Kubernetes, you need:

  • A private container registry (ECR, ACR, GCR, Docker Hub, or self-hosted Harbor)
  • Image pull secrets propagated to every namespace where users run workloads (if a new user namespace does not have the pull secret, their pods fail with ImagePullBackOff)
  • A CI pipeline for building images. Data scientists submit a Dockerfile or requirements.txt, and something needs to build the image, push it to the registry, and make it available. This is typically a Jenkins, GitHub Actions, or GitLab CI pipeline, but GPU-enabled builds (needed for compiling CUDA extensions like Flash Attention) require GPU runners, which most CI systems do not provide
  • Image scanning for vulnerabilities (Trivy, Snyk, or your registry’s built-in scanner)
  • Image garbage collection to prevent the registry from growing unbounded

Saturn Cloud solves this at two levels. Docker-in-Docker is exposed inside workspaces, so users who know Docker can build and push images directly from their dev environment. For users who do not know Docker (which is most data scientists), there is a separate image build tool. Users specify their environment using requirements.txt, environment.yaml, lists of apt packages, and/or arbitrary bash scripts. The build tool assembles these into a Docker image, runs the build (with GPU access for compiling CUDA extensions like Flash Attention), and pushes the result to the configured registry (ECR, ACR, GCR, Docker Hub, or any Docker-compatible registry). Image pull secrets are created and propagated by Atlas, so users never encounter ImagePullBackOff errors from missing credentials.

The Baseline is the Hard Part

The individual workloads in Part 2 are relatively straightforward Kubernetes resources (Pods, Jobs, Deployments). The production baseline described above is where most of the engineering time goes. A StatefulSet for a JupyterLab workspace is 30 lines of YAML. The auth, RBAC, cost tracking, idle detection, and audit infrastructure around it is months of engineering work.

Maintenance Outpaces the Initial Build

Building the baseline is a one-time project that you can estimate and staff for. What’s harder to anticipate is the ongoing maintenance, which in most organizations ends up consuming more engineering time than the initial build did. Here’s what that looks like in practice.

Kubernetes version upgrades. Kubernetes releases a new minor version roughly every four months, and each version is only supported for about 14 months. Upgrading means testing every custom controller, admission webhook, CRD, and operator against the new version. API versions get deprecated and removed (PodSecurityPolicy was removed in 1.25, for example). If your idle detection controller uses a beta API that graduates or changes, you need to update it. If you defer upgrades, you fall out of support and stop receiving security patches. Most teams find that Kubernetes upgrades alone consume a significant amount of engineering time each quarter.

Component version upgrades. Each tool in your stack has its own release cycle. cert-manager, OAuth2 Proxy, Prometheus, Elasticsearch, Fluent Bit, the NVIDIA device plugin, Kubecost, Volcano, and any operators you run all release independently. Some of these releases contain breaking changes. Prometheus 2.x to 3.x changed the storage format. Elasticsearch major versions require index migrations. The NVIDIA device plugin needs to stay in sync with your GPU driver versions. You can defer these upgrades, but the longer you wait, the harder they get, and you accumulate known vulnerabilities in the meantime.

Security patches and CVEs. When a critical CVE drops in an ingress controller, container runtime, or base image, someone on your team needs to triage it (does it affect us?), patch it (rebuild images, update Helm values), test it (does the patch break anything?), and roll it out (across all environments, with rollback plans). This is not a scheduled activity. It happens when it happens, and critical CVEs do not wait for your next sprint. The more components in your stack, the more CVEs you need to track.

Scaling failures. Infrastructure that works at one scale breaks at another, and the breakpoints are hard to predict. The idle detection controller that worked fine for 20 users starts missing pods at 200 because it is polling the Kubernetes API too aggressively and getting rate-limited. The Elasticsearch cluster that stored a month of logs with headroom runs out of disk at six months because log volume grew with the user base. The OAuth2 Proxy that handled 50 concurrent sessions starts timing out at 500 because the session store was never designed for that load. Each of these is a debugging session that pulls an engineer away from other work for days.

On-call burden. When a GPU node fails at 2am and a multi-node training job is stuck with three nodes running and one in CrashLoopBackOff, someone needs to debug it. When cert-manager stops renewing certificates because a DNS provider API changed and TLS certs expire on a Friday afternoon, someone needs to fix it before every user-facing service goes down. When Elasticsearch runs out of disk and logging stops, someone needs to expand the volume, clean up old indices, and figure out why the retention policy did not trigger. These are not hypothetical scenarios. They are the normal operational reality of running a multi-component platform.

User support. Data scientists will open tickets: “my workspace won’t start,” “my training job is stuck in Pending,” “my image build failed,” “I can’t access my deployment.” Each of these requires someone who understands both the platform internals and Kubernetes to diagnose. A workspace that won’t start could be a ResourceQuota limit, a node scheduling issue, an image pull failure, a PVC binding problem, or a misconfigured network policy. Debugging these is not difficult for an experienced engineer, but it is time-consuming, and it scales linearly with your user count.

The compounding effect. None of these individually is unmanageable. The problem is that they all happen concurrently, they are unpredictable, and they compete for the same engineers' time. A team that planned to spend 20% of their time maintaining the platform often finds that maintenance expands to 50% or more, leaving less time for the infrastructure work that actually differentiates the business.

With Saturn Cloud, Kubernetes version upgrades, component updates, security patches, and scaling are handled by the Saturn Cloud team as part of the deployment. User support tickets about workspaces, jobs, and deployments go through Saturn Cloud’s support, not your on-call rotation.

Part 2: The Workloads

With the production baseline in place (authentication, RBAC, TLS, cost tracking, idle detection, secrets, audit logging, and image management), you can start deploying actual workloads. The six use cases below each follow the same structure: what you need, what it takes on raw Kubernetes (assuming the baseline already exists), and the Saturn Cloud equivalent.

GPU Dev Workspaces

What you need: JupyterLab (or similar) with configurable CPU, memory, and GPU allocation. Persistent home directory that survives restarts. SSH access for VS Code and PyCharm remote development.

On Kubernetes

Each user needs a dedicated pod with a PersistentVolumeClaim for their home directory. You cannot use a Deployment for this because you need stable PVC binding, so you either use a StatefulSet per user or write a custom controller.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: workspace-alice
  namespace: user-alice
spec:
  replicas: 1
  selector:
    matchLabels:
      app: workspace
      user: alice
  template:
    spec:
      containers:
      - name: jupyter
        image: your-registry/jupyter-gpu:latest
        ports:
        - containerPort: 8888
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 32Gi
          requests:
            cpu: "4"
            memory: 16Gi
        volumeMounts:
        - name: home
          mountPath: /home/alice
  volumeClaimTemplates:
  - metadata:
      name: home
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3
      resources:
        requests:
          storage: 100Gi

Additional infrastructure required:

  • NVIDIA device plugin DaemonSet on every GPU node, plus node labels for GPU type selection
  • Ingress per workspace with auth middleware (OAuth2 Proxy sidecar or annotation-based). Each user’s JupyterLab needs its own route
  • SSH access requires either an SSH bastion with a custom routing layer that maps users to pods, or a NodePort Service per workspace (which does not scale and exposes ports externally)
  • Idle detection requires a custom controller. No Kubernetes primitive monitors whether a JupyterLab session is actively in use. You need to poll CPU and network counters per pod and delete or scale-to-zero the StatefulSet after a configurable timeout. On GPU instances costing $2-30+/hr, idle detection is not optional
  • Workspace lifecycle management: provisioning new workspaces, deleting old ones, letting users pick hardware specs. This is a custom web application or API on top of the Kubernetes API

With Saturn Cloud

A user creates a Python Server resource and selects the hardware they need (CPU, memory, GPU type, GPU count) through the UI, API, or a YAML recipe. Atlas creates the pod with the requested resources, attaches a PersistentVolume for the home directory, and creates the Ingress route with ForwardAuth. SSH access goes through a dedicated SSH Proxy component that routes connections to the correct pod through a cloud load balancer. Idle detection works at the proxy layer: if no HTTP, SSH, or WebSocket traffic reaches the workspace for a configurable period, Atlas shuts it down. When the user starts the workspace again, their home directory is still there because the PV persists independently of the pod.

Multi-Node PyTorch Training

What you need: N nodes, each with M GPUs, running torchrun with the correct MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and NODE_RANK environment variables injected into every pod.

On Kubernetes

Kubernetes Indexed Jobs (requires K8s 1.24+) give you a JOB_COMPLETION_INDEX per pod, but you still need to solve the networking and environment variable problem yourself.

apiVersion: batch/v1
kind: Job
metadata:
  name: training-run-42
spec:
  completions: 4
  parallelism: 4
  completionMode: Indexed
  template:
    spec:
      subdomain: training-run-42
      containers:
      - name: trainer
        image: your-registry/training:latest
        env:
        - name: MASTER_ADDR
          value: "training-run-42-0.training-run-42.default.svc.cluster.local"
        - name: MASTER_PORT
          value: "29500"
        - name: WORLD_SIZE
          value: "4"
        - name: NODE_RANK
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
        resources:
          limits:
            nvidia.com/gpu: 8
      restartPolicy: Never
---
apiVersion: v1
kind: Service
metadata:
  name: training-run-42
spec:
  clusterIP: None
  selector:
    job-name: training-run-42

This gets you the basic structure, but production requires more:

  • Gang scheduling: all N pods must be scheduled simultaneously. If the cluster only has capacity for 3 of 4 nodes, the default scheduler will start 3 and leave 1 pending. The 3 running pods burn GPU time waiting. You need Volcano, Coscheduling, or a similar scheduler plugin for all-or-nothing scheduling
  • NCCL configuration: inter-node GPU communication requires NCCL_SOCKET_IFNAME set to the correct network interface. Network policies must allow traffic on the NCCL port range between pods. Some cloud providers require additional configuration for RDMA/EFA
  • Coordinated failure handling: if one node fails mid-training, the remaining nodes must be terminated and the entire job restarted from the last checkpoint. The native Job controller does not coordinate this. You either write a custom controller or use the Kubeflow Training Operator
  • DNS resolution timing: the headless Service needs time to propagate DNS records for all pods. If rank-0 starts torchrun before other pods are DNS-resolvable, initialization fails. You need an init container or startup script that polls DNS

With Saturn Cloud

A user creates a Job resource with an instance_count parameter (e.g., 4 for a 4-node training run). Atlas creates all pods with a headless Service for DNS resolution and injects the environment variables that torchrun needs: SATURN_JOB_RANK (the node index, 0 through N-1), SATURN_JOB_LEADER (the DNS address of the rank-0 pod), SATURN_INTERNAL_SUBDOMAIN, and SATURN_NAMESPACE. NetworkPolicies are configured to allow all traffic between pods in the same job. The user’s start command calls torchrun with these variables:

torchrun \
  --nproc_per_node=8 \
  --nnodes=$SATURN_RESOURCE_INSTANCE_COUNT \
  --node_rank=$SATURN_JOB_RANK \
  --master_addr=$SATURN_JOB_LEADER \
  --master_port=29500 \
  train.py

For more detail, see the parallel training guide.

Scheduled Jobs (ETL and Training Pipelines)

What you need: Run containers on a schedule (cron) or triggered via API. Handle failures with retries. Retain logs. Clean up completed pods.

On Kubernetes

The native CronJob resource handles simple scheduling:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-etl
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 5
  successfulJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 3
      template:
        spec:
          containers:
          - name: etl
            image: your-registry/etl-pipeline:latest
          restartPolicy: Never

This works for simple cases. Production problems emerge at scale:

  • No centralized job dashboard: Kubernetes has no built-in UI showing job history, success/failure rates, duration trends, or log access across all jobs. You build this yourself or deploy a tool like Argo
  • Pod accumulation: completed and failed Job pods remain in the cluster. failedJobsHistoryLimit and successfulJobsHistoryLimit help, but at high volume (hundreds of jobs per day), you need a cleanup controller
  • Log access for users: data scientists need to read logs from their jobs without kubectl access. This requires an EFK stack plus a log viewer UI, or forwarding logs to a centralized system
  • API-triggered jobs: CronJobs only support schedules. For on-demand execution (triggered by a CI pipeline or API call), you need a service that creates Job resources via the Kubernetes API, which means building an API server with authentication

With Saturn Cloud

A user creates a Job resource with a cron schedule, an API trigger, or both. Atlas creates the Kubernetes Job when the schedule fires or the API is called, and cleans up completed pods automatically. Failed jobs are retried according to a configurable backoff policy. Job logs are collected by Fluent Bit and stored in Elasticsearch, and users view their own job logs through the platform UI without needing kubectl access. Jobs can also be triggered programmatically via the Saturn Cloud HTTP API or the saturn-client CLI, which means they can be integrated into existing CI/CD pipelines or called from other jobs.

Streamlit App Deployments

What you need: A long-running Streamlit application behind authentication, with TLS, accessible to authorized users.

On Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dashboard
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dashboard
  template:
    spec:
      containers:
      - name: streamlit
        image: your-registry/dashboard:latest
        command: ["streamlit", "run", "app.py",
                  "--server.port=8501",
                  "--server.headless=true"]
        ports:
        - containerPort: 8501
        livenessProbe:
          tcpSocket:
            port: 8501
          initialDelaySeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: dashboard
spec:
  ports:
  - port: 8501
    targetPort: 8501
  selector:
    app: dashboard

Additional production concerns:

  • WebSocket support: Streamlit uses WebSockets for real-time updates. Nginx Ingress requires proxy-read-timeout and proxy-send-timeout annotations. Traefik handles WebSockets natively but needs configuration for sticky sessions if you run multiple replicas
  • Authentication: Streamlit has no built-in auth. You need an OAuth2 Proxy sidecar or ingress-level auth middleware in front of the app
  • Health checks: Streamlit does not expose a /health endpoint. You are limited to TCP socket probes, which only verify the port is open, not that the application is healthy
  • Code updates: when a developer pushes a new version, you need a CI/CD pipeline that rebuilds the image, pushes it to the registry, and triggers a rollout. This is standard but it is another pipeline to maintain per application
  • Per-app ingress and TLS: each Streamlit app needs its own Ingress resource with a hostname or path, plus a TLS certificate (or a wildcard cert)

With Saturn Cloud

A user creates a Deployment resource and sets the start command to streamlit run app.py --server.port 8000. Atlas creates the Kubernetes Deployment, Service, and IngressRoute. Traefik’s ForwardAuth middleware handles authentication using the same SSO chain as the rest of the platform, so there is no per-app OAuth2 Proxy to configure. TLS is handled by the existing cert-manager setup. The deployment can pull code from a git repository, pinned to a specific branch or tag. Production deployments are typically pinned to a release tag, while development deployments track a branch. Updating which branch or tag a deployment uses can be done through the UI, the API, or your existing CI/CD pipeline. No image rebuild is required for code-only changes.

LLM Inference (vLLM, TGI)

What you need: Serve a large language model with GPU acceleration, model weight caching, and the ability to scale based on request load.

On Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-serving
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm-serving
  template:
    spec:
      terminationGracePeriodSeconds: 120
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model=meta-llama/Llama-3.1-70B-Instruct"
        - "--tensor-parallel-size=4"
        - "--port=8000"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 4
            memory: 256Gi
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc

Production concerns specific to LLM serving:

  • Model weight management: a 70B parameter model is ~140GB in fp16. You have three options: bake weights into the image (very large, slow to pull, hard to update), download at startup from S3 or HuggingFace (adds minutes to startup time, needs an init container), or use a PVC with pre-loaded weights (fastest startup, but you need to manage the PVC lifecycle and pre-populate it). Each approach has tradeoffs around cold start time, storage cost, and operational complexity
  • Autoscaling on request queue depth: Kubernetes HPA only works with CPU/memory by default. Scaling based on pending requests or token generation throughput requires the Prometheus Adapter, a custom metrics API registration, and Prometheus scraping the vLLM /metrics endpoint. This is 3-4 additional components to deploy and configure
  • Graceful shutdown: LLM inference requests can take 30+ seconds (long generation, streaming). You need terminationGracePeriodSeconds set high enough that in-flight requests complete before the pod is killed. The default 30 seconds is often too short
  • GPU memory management: vLLM and TGI pre-allocate GPU memory for the KV cache. If resource limits are misconfigured, the pod OOM-kills or the KV cache is undersized, reducing throughput. This requires benchmarking per model and GPU type

With Saturn Cloud

A user creates a Deployment resource and selects GPU hardware (e.g., 4xA100 or 4xH100). The start command runs vLLM or TGI. Model weights can be downloaded via a startup script, loaded from a shared NFS volume, or baked into a custom image built with the platform’s image builder. For common open-source models (Llama, Mistral, Qwen, DeepSeek, etc.), Saturn Cloud provides templates with pre-configured vLLM flags, quantization settings, and GPU memory allocation, so a user can deploy a working LLM endpoint without tuning these parameters from scratch. Atlas creates the Deployment, Service, and IngressRoute. Authentication goes through the same ForwardAuth chain. Multiple replicas can be configured for scaling.

Model Serving APIs

What you need: A REST API (FastAPI, Flask, or similar) serving model predictions. Load balanced, with health checks, autoscaling, and zero-downtime deployments.

On Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-api
spec:
  replicas: 3
  strategy:
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app: model-api
  template:
    spec:
      containers:
      - name: api
        image: your-registry/model-api:latest
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 15
          periodSeconds: 20
        resources:
          requests:
            cpu: "2"
            memory: 4Gi
          limits:
            memory: 8Gi
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

This is the most straightforward use case on raw Kubernetes, but production still requires:

  • Service + Ingress: a ClusterIP Service, an Ingress or IngressRoute with TLS, and DNS configuration. Each API endpoint needs its own route
  • Rolling update tuning: maxUnavailable: 0 ensures zero-downtime, but you need the readiness probe to accurately reflect when the new pod is ready to serve traffic. If your model takes 30 seconds to load into memory, the initialDelaySeconds must account for this
  • Load testing and capacity planning: you need to benchmark the API under expected load to determine appropriate replica counts and resource requests. Under-provisioned requests cause throttling; over-provisioned requests waste resources
  • Canary deployments (optional): if you want to gradually shift traffic to a new model version, you need a service mesh (Istio, Linkerd) or Traefik traffic splitting middleware. This adds significant operational complexity

With Saturn Cloud

A user creates a Deployment resource with a start command that runs their API server on port 8000. Atlas creates the Kubernetes Deployment with a rolling update strategy, a ClusterIP Service, and an IngressRoute through Traefik. Health check endpoints are configurable per deployment. Multiple replicas can be set for high availability, and Atlas configures the Deployment’s rolling update to maintain availability during updates. Code changes propagate through git integration (branch or tag pinning, same as Streamlit deployments) or image rebuilds.

Further Reading