Architecture

Technical architecture of Saturn Cloud Enterprise

Saturn Cloud is a Kubernetes-native MLOps platform that provides the infrastructure layer AI teams need to develop, train, and deploy models. This document covers the system architecture, components, and design patterns.

System Overview

Saturn Cloud runs on any compliant Kubernetes cluster (EKS, AKS, GKE, OCI, or self-managed). The architecture uses a Kubernetes operator pattern to deploy and manage 14 independent components that together provide development workspaces, job orchestration, model deployments, and distributed computing infrastructure.

All container images are hosted in the public registry at public.ecr.aws/saturncloud. The platform source code is available in two primary repositories: saturn-k8s contains the Helm charts for all components, and saturncloud-reference-terraform provides example Terraform configurations for various cloud providers.

Core Architecture Layers

1. Operator Orchestration Layer

The Saturn Helm Operator is the Saturn Cloud installer. It manages the complete platform lifecycle through Kubernetes Custom Resource Definitions (CRDs). Built with the Operator SDK using the Helm-based pattern, it provides declarative configuration and automated reconciliation.

The operator manages 14 distinct components through CRDs: Atlas (application control plane), AuthServer (JWT-based authentication), Traefik (HTTP/HTTPS ingress), SSHProxy (SSH gateway for IDE access), Monitoring (Prometheus-based metrics), Logging (Elasticsearch/Fluent Bit stack), ClusterSetup (GPU drivers and storage classes), NetworkFilesystem (NFS/FSx integration), Calico (network policies), ClusterAutoscaler (node scaling), CoreDNS (custom DNS), Pandora (cost allocation and billing), CertManager (TLS certificate management), and HttpReqWebhook (ACME DNS challenge solver). The last six components are optional and typically disabled on managed Kubernetes services where the cloud provider supplies equivalent functionality.

2. Application Control Plane (Atlas)

Atlas is the primary orchestrator for all user-facing workloads. It manages the lifecycle of development workspaces (JupyterLab and RStudio servers), batch jobs, and long-running model deployments. The service handles custom Docker image building through Docker-in-Docker with GPU support, maintains workspace snapshots, and integrates with cloud storage systems including S3, Azure Blob, and GCS.

The Atlas application is written in Python and backed by a PostgreSQL database.

Storage is handled through three mechanisms. PostgreSQL data lives on a persistent volume, workspace snapshots are stored in object storage (S3/Azure Blob/GCS), and ephemeral workspace storage uses cloud-native block storage (EBS/Azure Disk/GCE PD).

3. Authentication & Authorization

The authentication layer uses JWT tokens signed with RS256 (RSA 2048-bit keys). The system handles token generation, validation, refresh, and revocation. Signing keys are stored in Kubernetes secrets. The system supports key rotation by retaining the previous key during transition periods. Bootstrap tokens enable initial operator setup, and token revocation is implemented via CRD.

4. Ingress & Load Balancing

HTTP/HTTPS (Traefik)

Traefik serves as the production-grade reverse proxy handling all HTTP/HTTPS traffic to Saturn Cloud services. It automatically manages TLS certificates via cert-manager, integrating with Let’s Encrypt through both ACME HTTP-01 and DNS-01 challenge methods. Traffic routing is configured through IngressRoute custom resources. The system supports both internal and external load balancers depending on deployment requirements.

Cloud provider integration varies by platform. On AWS, Traefik works with ALB or NLB paired with ACM certificates. Azure deployments use Azure Load Balancer, while GCP uses Google Cloud Load Balancer. OCI uses the OCI Load Balancer. Nebius has native load balancer support, and bare-metal deployments can use MetalLB. The loadBalancerProvider parameter enables decoupling of the Kubernetes infrastructure provider from the load balancer implementation, which is particularly useful for scenarios like running a k0rdent-managed cluster with a Nebius load balancer.

SSH Access (SSH Proxy)

The SSH proxy provides direct terminal access to workspaces for IDE integration with tools like VS Code Remote SSH and PyCharm. It integrates with cloud provider load balancers for external access.

5. Observability Stack

Monitoring (Prometheus)

The monitoring stack centers on Prometheus for metrics storage, augmented by Kube-State-Metrics for Kubernetes object metrics and a custom kube-stats service for Saturn-specific metrics. Automatic scraping is configured through ServiceMonitor CRDs. The system collects cluster resource utilization, application performance metrics, user workspace consumption, and cost/billing data when Pandora is enabled. Optional Datadog integration is available for organizations with existing Datadog infrastructure.

Ingress to Prometheus is disabled by default but can be enabled. Storage and compute resources are configurable based on cluster size and retention requirements. The system supports Prometheus federation for multi-cluster deployments.

Logging (EFK Stack)

Logs are collected via Fluent Bit and stored in Elasticsearch, with optional Kibana for visualization and Falco for security audit logs. Log retention is configurable. Fluent Bit runs as a DaemonSet to collect logs from all pods, including Kubernetes system logs, user workspace logs, and infrastructure component logs.

6. Cluster Infrastructure (ClusterSetup)

The ClusterSetup component configures foundational cluster resources required for Saturn Cloud operation. It creates the saturn-default-storage storage class, deploys the NVIDIA GPU device plugin, and handles GPU node labeling. Docker registry secrets are managed centrally and propagated across namespaces. On AWS, the component supports IRSA (IAM Roles for Service Accounts) for secure credential management.

Storage backends vary by cloud provider. AWS uses the EBS CSI driver, Azure uses the Azure Disk CSI driver, GCP uses the GCE Persistent Disk CSI driver, and OCI uses the OCI Block Volume CSI driver. GPU support includes the NVIDIA Device Plugin with multiple CUDA versions and AMD ROCm.

Docker registry integration supports ECR Private (with IRSA or access keys) and generic registries including Docker Hub, ghcr.io, and custom registries.

7. Certificate Management

Cert-manager handles automated certificate issuance and renewal using the Let’s Encrypt ACME protocol. Both HTTP-01 and DNS-01 challenge methods are supported. A custom HTTP Request Webhook component acts as an ACME DNS solver, communicating with the Let’s Encrypt production API for DNS-based challenges.

8. Optional Infrastructure Components

Network Filesystem

For workloads requiring shared filesystem access, Saturn Cloud can deploy NFS integration supporting AWS FSx for Lustre, AWS EFS, or generic NFS servers.

Pandora (Cost & Billing)

The Pandora component tracks resource usage by user and project, integrating with the Prometheus metrics stack to export cost data for chargeback or showback purposes. It includes Stripe webhook support for billing integrations.

Other Optional Components

Calico provides advanced network policy capabilities but is typically disabled on managed Kubernetes where native network policies suffice. The Cluster Autoscaler enables automatic node scaling based on workload but is also typically disabled on managed services. CoreDNS allows custom DNS configuration and cloud-specific DNS forwarding, disabled by default on managed Kubernetes.

User-Facing Resource Model

Saturn Cloud abstracts Kubernetes complexity through four primary resource types, all backed by Kubernetes pods:

Resource TypeDescriptionPrimary Use Case
Python ServerJupyterLab-based development workspace with SSH integrationInteractive development, exploration, debugging
R ServerRStudio-based development workspace with SSH integrationR statistical computing and analysis
DeploymentLong-running HTTP serviceModel APIs, web services, real-time inference
JobRun-to-completion workload with schedulingBatch processing, training runs, ETL pipelines

Uniform Attachment Model

All resource types support the same attachment model, enabling seamless development to production workflows. Custom Docker images can be attached as base images or fully custom environments. Git repositories sync code with credential management. Kubernetes secrets provide environment variables, API keys, and credentials. IAM roles leverage cloud provider service accounts (IRSA, Workload Identity, Managed Identity). Shared NFS folders enable persistent data sharing across workspaces.

A typical workflow starts with a data scientist developing a model in a Python Server. The same Docker image, git repo, and secrets get attached to a Job for the training run. The trained model is then deployed as a Deployment with identical configuration. All three resources share access to the same NFS-mounted datasets, eliminating environment drift between development and production.

Resource Specifications

Python and R Servers offer configurable CPU, memory, and GPU allocation. Each server has a persistent home directory backed by cloud block storage (EBS/Azure Disk/GCE PD) and will auto-shutdown after an idle period. SSH access enables integration with local IDEs like VS Code, PyCharm, and RStudio Desktop.

Deployments auto-restart on failure, integrate with load balancers, and support Horizontal Pod Autoscaling. Health check endpoints enable reliable traffic routing, and replica count can be configured based on load requirements.

Jobs support cron-like scheduling and API-triggered execution. Automatic cleanup runs after completion. Distributed training is supported through torchrun and DeepSpeed. Configurable retry policies handle transient failures.

Multi-Cloud Abstraction

The architecture separates application logic from infrastructure concerns, enabling deployment across cloud providers with minimal configuration changes.

Cloud Provider Integrations

ProviderLoad BalancerStorageAuthenticationNode Scaling
AWSALB/NLB with ACMEBS CSI, EFS, FSxIRSA (IAM roles per pod)Auto Scaling Groups
AzureAzure Load BalancerAzure Disk CSIManaged IdentityVirtual Machine Scale Sets
GCPGoogle Cloud LBGCE Persistent Disk CSIWorkload IdentityManaged Instance Groups
OCIOCI Load BalancerOCI Block Volume CSIInstance principalsOCI autoscaler
NebiusNebius Load BalancerProvider-specificProvider-specificProvider-specific
On-PremMetalLBCeph/NFSN/AN/A

Configuration Management

Cloud-specific configuration is isolated in Terraform modules (see saturncloud-reference-terraform), Helm chart values files, and Operator CRD specifications. Switching from AWS to GCP requires updating the storage class backend from EBS to GCE PD, changing load balancer annotations, modifying the IAM role mechanism from IRSA to Workload Identity, and adjusting VPC/subnet configuration. Application code and user workspaces remain unchanged.

Security Architecture

Authentication & Authorization

Users authenticate via SSO providers (Okta, Azure AD, Google). The Auth Server issues a JWT signed with the RS256 RSA 2048-bit private key. This token is included in subsequent API requests. Services validate the token using the public key. Tokens refresh before expiration and can be revoked via CRD when needed. Bootstrap tokens handle initial operator installation.

Network Security

All HTTP traffic is encrypted via Traefik with cert-manager automation. Let’s Encrypt provides certificates for public-facing services, while internal CA support is available for private services. Automatic certificate renewal prevents expiration issues. Optional Calico deployment enables advanced network segmentation. Namespace-level pod isolation and per-component ingress/egress rules control traffic flow.

Pod Security

Components run with strict security contexts. The operator executes as non-root with a read-only root filesystem where possible. All capabilities are dropped and privilege escalation is disabled. RBAC provides minimal permissions per component through dedicated service accounts per namespace.

Secrets Management

Kubernetes secrets (encrypted at rest by cloud provider KMS) store JWT signing keys, SSH host keys, Atlas secret keys, database credentials, Docker registry credentials, and cloud provider credentials when not using IAM roles. AWS IRSA, Azure Managed Identity, and GCP Workload Identity enable per-pod IAM roles for fine-grained access control without long-lived credentials.

Data Persistence

Databases

PostgreSQL backs the Atlas control plane. Data persists to volumes using the saturn-default-storage storage class. Automated snapshots provide point-in-time recovery.

Elasticsearch stores logs with persistent volume storage via saturn-default-storage. Snapshots can be sent to object storage (S3/Azure Blob/GCS) for long-term retention.

Persistent Volumes

The saturn-default-storage storage class enables volume expansion and uses topology-aware scheduling. Cloud-specific provisioners handle the underlying block storage (EBS on AWS, Azure Disk on Azure, Persistent Disk on GCP, Block Volume on OCI).

Volumes back user workspace home directories (persistent across server restarts), database storage for PostgreSQL and Elasticsearch, and optional shared NFS mounts via the Network Filesystem component.

Workspace Snapshots

Workspace snapshots are stored in object storage with configurable retention. The snapshot system supports workspace cloning and sharing across team members.

Deployment Model

Installation Flow

Deployment begins with a Helm install of the Saturn Helm Operator, which registers 14 Custom Resource Definitions. During initialization, secrets are generated for JWT keys, SSH keys, and database passwords. The operator then instantiates component CRDs, reconciling each component via Helm. Health checks monitor component status and reconcile drift from the desired state.

GitOps Compatibility

The operator pattern enables GitOps workflows through declarative CRD specifications committed to version control. Drift detection automatically reconciles to the desired state. Multi-environment management uses separate CRD specs per environment (dev/staging/prod). Tools like ArgoCD, Flux, and Jenkins X integrate naturally with this model.

Component Reconciliation

The operator continuously reconciles components to stay in sync with their CRD specifications. Automated Helm upgrades trigger on CRD changes. Stateless components like Traefik and Auth Server use rolling updates, while stateful components like PostgreSQL and Elasticsearch are handled carefully with PVC retention. Helm rollback occurs on failed upgrades, with component health checks via liveness/readiness probes logging failure events and triggering retries.

Upgrade Process

Platform upgrades start with updating the operator image version. The operator automatically upgrades component Helm charts, using rolling updates to minimize downtime. Database migrations run automatically in Atlas.

Container Image Distribution

All images are published to public.ecr.aws/saturncloud. Core infrastructure images include saturn-enterprise for the Atlas control plane, auth-server for JWT authentication, ssh-proxy for the SSH gateway, and postgres for the PostgreSQL database.

Platform component images include traefik for HTTP/HTTPS ingress, elasticsearch for log storage, fluent-bit for log collection, prometheus for metrics storage, and kube-state-metrics for Kubernetes object metrics. GPU and monitoring images include nvidia-k8s-device-plugin for NVIDIA GPU support, kube-stats for Saturn-specific metrics, and bandwidth for network monitoring.

Runtime Images (User Workspaces)

Python images include saturnbase-python for CPU workloads and saturnbase-python-gpu-* variants supporting multiple CUDA versions. Development variants with saturnbase-python-gpu-devel-* include CUDA development tools. R images include saturnbase-r for CPU, saturnbase-r-gpu for GPU support, and saturnbase-r-bioconductor with Bioconductor packages pre-installed.

Pre-configured images bundle common libraries: saturn-python includes standard ML packages, saturn-python-pytorch comes with PyTorch pre-installed, saturn-python-rapids includes NVIDIA RAPIDS for GPU-accelerated data science, and saturn-r bundles common R packages.

Development images support custom workflows. The dind image provides Docker-in-Docker for custom image builds, while dind-nvidia enables GPU-enabled Docker builds. The saturn2docker image contains custom build tooling.

Custom Image Building

Saturn Cloud supports custom Docker image building via Docker-in-Docker. Standard DIND handles CPU images, while DIND-NVIDIA supports CUDA during build processes. Built images automatically push to configured registries (ECR, ACR, GCR, Docker Hub).

Key Architectural Patterns

1. Operator Pattern

The Kubernetes operator manages the entire lifecycle through declarative CRDs. This enables GitOps workflows, automated reconciliation to desired state, simplified multi-environment management, and reduced operational complexity.

2. Multi-Cloud Abstraction

A single codebase runs across different cloud providers via configuration. Cloud-specific logic is isolated in Terraform modules, Helm chart values, and Operator CRD parameters.

3. Resource Uniformity

The same attachment model (images, git repos, secrets, IAM roles) across all resource types enables seamless development to production workflow. Developers work in Python Servers, train in Jobs with identical configuration, and deploy as Deployments without environment drift.

4. Namespace Isolation

Multi-user isolation via Kubernetes namespaces and network policies keeps workspaces separated. User workspaces run in dedicated namespaces with network policies preventing cross-namespace access. Resource quotas per namespace are optional, and RBAC controls namespace access.

5. Stateless + Stateful Separation

Stateless components (Traefik, Auth Server, SSH Proxy) support horizontal scaling, rolling updates, and carry no persistent state. Stateful components (PostgreSQL, Elasticsearch) use persistent volumes, automated snapshots, and require careful upgrade handling.

6. Cloud-Native Security

TLS encryption runs everywhere via cert-manager automation. Containers run as non-root with read-only filesystems. RBAC enforces minimal permissions. Secrets encrypt at rest using cloud provider KMS. IAM role integration (IRSA, Workload Identity, Managed Identity) eliminates long-lived credentials.

7. Modular Components

The 14 independently managed components allow flexible deployment topologies. Optional components (Calico, Autoscaler, CoreDNS) can be disabled on managed Kubernetes. Pandora enables cost tracking when needed. Network Filesystem adds shared data access. Each component versions and upgrades independently.

Scalability Characteristics

Horizontal Scaling

User workspaces are limited only by Kubernetes cluster capacity. Typical deployments handle 100-1000+ concurrent workspaces. Autoscaling via Cluster Autoscaler adds nodes as needed. Jobs support massively parallel execution (1000+ concurrent jobs) with automatic failure handling and retry. Atlas manages the job queue internally. Deployments support Horizontal Pod Autoscaling with multiple replicas per deployment and load balancing via Kubernetes Services.

Vertical Scaling

Workspace resources are configurable based on available node instance types. Cluster size has been tested from 3 nodes to 100+ nodes. Elasticsearch scales with additional replicas as needed.

Performance Characteristics

Workspace startup time varies by scenario. Existing images start quickly (seconds), while custom image builds take longer depending on complexity. GPU workspaces require additional time for NVIDIA driver loading.

Job execution latency is low, with API-triggered jobs reaching the running state in seconds. Scheduled jobs maintain sub-second scheduling accuracy.

Disaster Recovery

Backup Strategy

PostgreSQL (Atlas Database) backups use persistent volume snapshots via cloud provider snapshot features (EBS/Azure Disk/GCE PD). Point-in-time recovery leverages cloud provider snapshot capabilities. Optional pg_dump to S3/Azure Blob/GCS provides additional protection.

Elasticsearch logs use snapshots to object storage. User workspaces store snapshots in object storage. Persistent home directories back to cloud block storage, with cloud provider snapshots available for recovery.

Recovery Procedures

Atlas database failure recovery involves restoring the PostgreSQL persistent volume from snapshot, restarting the Atlas pod, and allowing user workspaces to automatically reconnect. Elasticsearch can restore from snapshot if needed. Fluent Bit automatically resumes log forwarding.

Complete cluster failure recovery provisions a new Kubernetes cluster, restores PostgreSQL from snapshot, deploys the Saturn Helm Operator, and lets the operator reconcile all components. User workspaces automatically recreate from workspace definitions stored in the database.

Monitoring & Alerting

Key Metrics

Cluster health monitoring tracks node CPU/memory/disk utilization, pod restart count, PersistentVolumeClaim status, and Kubernetes API server latency. Application health covers Atlas API response time, database connection pool utilization, active user workspace count, job success/failure rate, and deployment uptime. User experience metrics include workspace startup time, job execution latency, deployment response time, and SSH connection success rate.

Alerting Rules

Recommended Prometheus alerts include node disk approaching capacity, PostgreSQL connection pool utilization, Atlas API error rates, pod restart patterns, PersistentVolumeClaim binding failures, and certificate expiration warnings.

Log Aggregation

Fluent Bit collects logs from all pods as a DaemonSet. Elasticsearch indexes logs with timestamp, namespace, pod name, and container metadata. Optional Kibana provides log search and filtering, user workspace log isolation, and alerting on log patterns.

Cost Optimization

Idle Detection

Workspaces automatically shut down after a configurable idle period. Idle detection monitors CPU and network activity. Users can configure the timeout per workspace. Persistent home directories are retained after shutdown, so restarting a workspace is fast.

Resource Quotas

Per-user quotas control max concurrent workspaces, max CPU/memory/GPU per workspace, and max total CPU/memory/GPU across all workspaces. Per-project quotas use Kubernetes ResourceQuota per namespace. Cost allocation via Pandora enables chargeback.

Cost Allocation

The optional Pandora component tracks resource usage by user and project, integrating with Prometheus metrics to export cost data for chargeback or showback. Stripe integration handles billing. Metrics tracked include CPU-hours, memory-hours, GPU-hours, and storage utilization, all segmented by user and project.

Operational Considerations

Day-to-Day Operations

Common operational tasks include user management (SSO integration handles authentication), resource quota adjustments, monitoring cluster health, reviewing logs for errors, and managing custom Docker images. The platform is designed for low-touch operations with automatic certificate renewal via cert-manager, automatic workspace shutdown through idle detection, automatic job cleanup, and automatic log rotation.

Upgrade Cadence

New platform releases ship monthly, with security patches as needed. The upgrade process updates the operator image, which then handles component upgrades automatically. Stateless components achieve zero downtime through rolling updates. User workspaces see no impact since upgrades only affect the control plane.

Troubleshooting

Common troubleshooting involves checking pod events, verifying node capacity, checking image pull credentials, and confirming persistent volume binding. Database connection errors typically relate to connection pool utilization or network policies. TLS certificate issues often involve DNS propagation or Let’s Encrypt rate limits. SSH connection failures relate to load balancer configuration or workspace SSH key settings.

Integration Points

SSO Integration

Saturn Cloud supports Okta (SAML, OIDC), Azure AD (SAML, OIDC), Google Workspace (OAuth 2.0), and generic SAML 2.0 and OIDC providers. SSO settings are configured in the Atlas database. JWT tokens are issued after successful SSO authentication. Group and role mapping from SSO providers to Saturn Cloud roles enables fine-grained access control.

Version Control

Git repository integration supports GitHub, GitLab, and Bitbucket with credential management via Kubernetes secrets. Repositories automatically sync on workspace start.

Cloud Storage

Object storage integration covers AWS S3 (via IAM role or access keys), Azure Blob Storage (via Managed Identity or connection string), GCP Cloud Storage (via Workload Identity or service account key), and S3-compatible storage systems like MinIO, Wasabi, and DigitalOcean Spaces. Access methods include IAM role per pod (IRSA, Workload Identity, Managed Identity), environment variables with access keys, and mounted credentials files.

Container Registries

Supported registries include AWS ECR (public and private), Azure Container Registry, Google Container Registry and Artifact Registry, Docker Hub, GitHub Container Registry (ghcr.io), and any Docker-compatible custom registry. Authentication for ECR uses IAM role (IRSA) or access keys. ACR uses Managed Identity or service principal. GCR uses Workload Identity or service account key. Generic registries use username/password or token authentication.

Reference Architecture Diagrams

Component Interaction

                                    ┌─────────────────┐
                                    │   End Users     │
                                    └────────┬────────┘
                                             │
                                    ┌────────▼────────┐
                                    │  Load Balancer  │
                                    │  (Cloud Provider)│
                                    └────────┬────────┘
                                             │
                        ┌────────────────────┼────────────────────┐
                        │                    │                    │
                ┌───────▼──────┐    ┌───────▼──────┐    ┌───────▼──────┐
                │   Traefik    │    │  SSH Proxy   │    │              │
                │ (HTTP/HTTPS) │    │    (SSH)     │    │              │
                └───────┬──────┘    └───────┬──────┘    │              │
                        │                    │           │              │
                ┌───────▼────────────────────▼─────┐     │              │
                │         Atlas (Control Plane)    │     │              │
                │  ┌────────────────────────────┐  │     │              │
                │  │    PostgreSQL Database     │  │     │  Kubernetes  │
                │  └────────────────────────────┘  │     │   Cluster    │
                └──────┬───────────────────────────┘     │              │
                       │                                 │              │
           ┌───────────┼───────────────┬─────────┐       │              │
           │           │               │         │       │              │
    ┌──────▼─────┐ ┌──▼───────┐ ┌────▼────┐ ┌──▼────┐  │              │
    │  Python    │ │    R     │ │  Jobs   │ │Deploy │  │              │
    │  Servers   │ │ Servers  │ │         │ │ments  │  │              │
    │ (Jupyter)  │ │(RStudio) │ │         │ │       │  │              │
    └──────┬─────┘ └──────────┘ └─────────┘ └───────┘  │              │
           │                                             │              │
    ┌────────────────────────────────────────────────┐  │              │
    │         Observability Stack                    │  │              │
    │  ┌────────────┐  ┌──────────────────────────┐ │  │              │
    │  │Prometheus  │  │  Elasticsearch/Fluent Bit│ │  │              │
    │  └────────────┘  └──────────────────────────┘ │  │              │
    └────────────────────────────────────────────────┘  └──────────────┘

Data Flow

Development Workflow:

Developer → JupyterLab → Cloud Storage (S3/Azure/GCS)
                ↓
         Git Repository
                ↓
          Custom Image
                ↓
               Job (Training)
                ↓
          Model Artifacts → Cloud Storage
                ↓
          Deployment (API)
                ↓
        Production Traffic

Authentication Flow:

User → SSO Provider (Okta/Azure AD) → Auth Server → JWT Token → Atlas API

Logging Flow:

User Workspace → stdout/stderr → Fluent Bit (DaemonSet) → Elasticsearch → Kibana

Technical Specifications Summary

CategorySpecification
Kubernetes Version1.24+
Container Runtimecontainerd, CRI-O, Docker (deprecated)
Operator PatternHelm-based (Operator SDK)
DatabasePostgreSQL
LoggingElasticsearch, Fluent Bit
MonitoringPrometheus, Kube-State-Metrics
IngressTraefik
TLSCert-Manager with Let’s Encrypt
GPU SupportNVIDIA (multiple CUDA versions), AMD ROCm
StorageEBS, Azure Disk, GCE PD, OCI Block Volume, NFS/FSx
AuthenticationJWT (RS256), SSO (SAML, OIDC, OAuth 2.0)
Registrypublic.ecr.aws/saturncloud

Additional Resources