A node scales up. Kubernetes schedules a pod onto it. The kubelet starts pulling the container image from its upstream registry. Several gigabytes later, the pod starts.
This is fine when it happens once. It is less fine when the same image was already pulled on three sibling nodes in the same cluster, on the same private network, ten seconds earlier. The local copy on a peer node is faster than anything the registry can serve, and every node is going to pull the same image eventually. That is the gap Spegel fills.
We recently turned Spegel on for our Nebius deployment. This post covers what it does, why we picked it, how we wired it into our Helm operator, and the one containerd-configuration step that bit us before it worked.
What Spegel does
Spegel runs as a DaemonSet on every node. Each instance does three things:
- Watches the local containerd content store for images already pulled on this node, and announces them on a peer-to-peer network so other nodes know who has what.
- Runs a local HTTP registry on the node, serving layers from containerd’s content store to other nodes that ask for them.
- Writes containerd registry-mirror config (under
/etc/containerd/certs.d) pointing the configured registries atlocalhost:5000(Spegel) first, with the upstream as the fallback.
The result: when a node needs an image whose layers are already cached on a sibling, containerd fetches those layers from the sibling instead of the upstream registry. If no peer has the image, the request falls through to the upstream as normal. There is no central server, no separate registry to operate, and nothing to garbage-collect. Spegel does not store any images of its own. It only shares what containerd already has.
The peer-to-peer layer uses libp2p Kademlia DHT for discovery and HTTP for transfer. Each layer is identified by its content digest, which is what containerd uses internally too, so there is no ambiguity about whether a peer has the exact bytes the kubelet asked for.
Why we picked it
We looked at the usual options:
| Option | What it is | Trade-off |
|---|---|---|
| Pull-through cache (Harbor, Distribution registry) | A central cache the cluster pulls through | Single bottleneck, separate service to operate, needs persistent storage |
| Dragonfly | Peer-to-peer image distribution with a supernode | More moving parts, needs a manager and seed peer |
| Spegel | Stateless P2P, one DaemonSet, no central service | Slightly less aggressive than a true CDN, no cross-cluster sharing |
For our use case (one cluster, fast intra-cluster network, autoscaled nodes pulling the same handful of large GPU images), Spegel is the simplest thing that solves the problem. The fact that it has no state and no central component means there is nothing to back up, nothing to restart on failure, and nothing to size. If a Spegel pod dies, containerd transparently falls through to the upstream registry.
What it actually requires from containerd
Spegel cannot configure containerd itself. The relevant containerd changes need a containerd restart to take effect, and Spegel has no way to safely restart the container runtime it is running on top of. The Spegel docs are explicit about this.
You need two settings in /etc/containerd/config.toml:
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
This tells containerd to look in /etc/containerd/certs.d for per-registry mirror configuration. Spegel writes hosts.toml files in there at runtime. Without this line, containerd ignores everything Spegel writes and just keeps pulling from the upstream.
discard_unpacked_layers = false
This keeps the layer blobs in the content store after they are unpacked into snapshots. With discard_unpacked_layers = true (which some distributions ship as the default), containerd throws the blob away as soon as it is unpacked, so Spegel has nothing left to serve to peers. The disk savings are real but they defeat the entire point of running Spegel.
Most managed Kubernetes products do not set config_path for you. GKE and DigitalOcean explicitly do not. Nebius does not. EKS lets you set it via nodeadm. K0S and Talos have their own configuration mechanisms.
In our case, we handle this with a small DaemonSet that runs on every node, patches both settings into the host’s config.toml, and restarts containerd. It is idempotent (it only restarts containerd if the config actually changed), and it ships as part of our Spegel chart so it goes wherever Spegel goes.
The pause:3.9 mistake
When we first deployed our containerd-prep DaemonSet, we copied a pattern from a Nebius example: the init container runs the patch, then a long-lived pause container keeps the DaemonSet pod scheduled so the patch runs again on node replacement. The original example used:
containers:
- name: pause
image: gcr.io/google-containers/pause:3.9
Two days later every node had an ImagePullBackOff on gcr.io/google-containers/pause:3.9. The init container had run fine and containerd was patched, but the pod kept the original event surfacing because the keepalive container could not start.
The cause is structural: gcr.io/google-containers is the legacy Google Container Registry path, which Google shut down in March 2025. The replacement path is registry.k8s.io/pause. Anyone copying an old example that hardcodes the legacy GCR path will hit this, and it is genuinely confusing because the Spegel pods look healthy, the patch ran, the cluster is technically working, but kubectl describe on the prep pod is a wall of failed pulls.
The fix in our chart was to drop the upstream pause image entirely and use one of our own already-mirrored images for the keepalive. We had saturn-k8s-utils (an ubuntu:jammy image we already build and mirror to ECR Public) sitting right there. It has the shell and the nsenter binary the init container needs anyway, so it does double duty. One image to pull, no dependency on any external registry path.
How we wired it into our operator
Our enterprise install is driven by a single Helm operator (saturn-helm-operator) that reconciles CRs into Helm releases. Each installable component (Atlas, logging, monitoring, etc.) is a CR kind. Adding Spegel meant:
- Vendoring the upstream Spegel chart under the operator’s
helm-charts/spegel/directory, the same way we vendor cert-manager. - Registering a
SpegelCR kind in the operator’swatches.yamland adding the CRD. - Adding
images.spegelto the operator’s central image map. This is the canonical Saturn convention: every infrastructure image referenced by the operator lives underimages:as a flat string, and a singleimageMirrorsetting rewrites those paths through a customer’s mirror registry for air-gapped deployments. Spegel picks up that behavior for free. - Mirroring
ghcr.io/spegel-org/spegel:v0.7.1into our own registry path. Oursaturn-mirrorrepo holds one-line Dockerfiles per third-party image (FROM ghcr.io/spegel-org/spegel:v0.7.1), and ourrelease-imagesbuild pipeline builds and pushes them on every release. - Adding
saturnComponents.spegelto the operator values, disabled by default, with a configurablemirroredRegistrieslist. Enabled by default in our Nebius overrides because that is where it actually helps.
The Spegel CR the operator emits looks like this:
apiVersion: charts.saturncloud.io/v1alpha1
kind: Spegel
metadata:
name: spegel
namespace: spegel
spec:
image:
repository: <your-mirror>/spegel-org/spegel
tag: v0.7.1
containerdPrep:
enabled: true
image: <your-mirror>/saturn-k8s-utils:<tag>
mirroredRegistries:
- https://<registry-you-pull-from>
- https://<another-registry-you-pull-from>
If you scope mirroredRegistries (rather than leaving it empty, which would have Spegel try to mirror every registry), Spegel only writes hosts.toml files for the listed registries. Pulls from registries not on the list go straight to the upstream as if Spegel were not installed. Scope the list to the registries your cluster actually pulls images from in volume.
What changed in practice
Cold-start pull times on our larger GPU images dropped substantially once two or three nodes had pulled the image. The first node still has to fetch from the upstream, so the first scale-up event is the same as before. Every subsequent node pulls from a peer.
The other change is in failure modes. With a pull-through cache, the cache going down means every pull misses and slows down. With Spegel down, every pull falls through to the upstream registry. This is the same path that existed before Spegel was installed, so the worst case is “no faster than before,” which is a much better failure mode than “everything is slower until the cache is back.”
Enable it yourself
If you run Saturn Cloud Enterprise, see the Spegel docs for the operator values to set and the prerequisites to check.
If you are just looking to run Spegel on your own cluster, the upstream Helm chart is straightforward. The two things to get right are the containerd config_path setting (do not skip this) and scoping mirroredRegistries to registries Spegel can actually help with.


