Spot management

Spot management reschedules interrupted pods before node termination, allowing automatic and safe migration of more workloads from On-Demand to Spot instances with no risk of service disruption.

The following figure shows an example of how Spot management launches a hibernated node much faster than the current boot time:

Comparison of boot times with and without Zesty, highlighting significant time differences.

Prerequisites

  • To protect against downtime, Spot management can be activated only on workloads where a Pod Disruption Budget (PDB) is configured.

    If you try to activate Spot management for a workload without a PDB, the code for adding PDB to the specific workload is displayed for your convenience.
    Alternatively, you can use this generic code. Change the environment variables as needed:

    apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata:
      name: “{{ $deployment.name }}-pdb”
      namespace: {{name space of the deployment }}
    spec:
      maxAvailable: 90%
      selector:
        matchLabels:
          {
            service: workloadName  
          }
  • Spot management can be activated only on workloads running on clusters where Karpenter is running.

The magic behind the scenes

When you activate Spot management on a workload, the following automatic steps ensure protection:

  1. Kompass ensures that the workload and Karpenter configurations enable using Spot nodes, modifying configurations if necessary.

  2. Hibernated nodes start warming up.
    For more information about hibernated nodes, see HiberScale technology.

    Interruption protection begins immediately for all Pods from protected workloads hosted on Spot instance nodes.

  3. When AWS notifies about a Spot interruption, Kompass reactivates pre-baked hibernated nodes to replace the interrupted nodes.

  4. At the same time, Karpenter evicts the Pods running on the current nodes and launches new nodes.

    (Pods are evicted according to the limits in the PDB.)

  5. When the Kompass nodes are ready to host Pods, Pods are scheduled to those nodes.

    Kompass assigns a negative deletion cost to those Pods (using the pod-deletion-cost annotation), and they continue to run until they are deleted.

  6. 5 minutes after being reactivated, Kompass hibernated nodes are cordoned.

  7. Every 5 minutes, Kompass checks which Pods have not yet been deleted and duplicates those Pods that are still running on its nodes.

  8. When the Karpenter nodes are ready to host Pods, the duplicated Pods are scheduled to those nodes.

  9. After a few minutes, HPA deletes the Pods on the Kompass nodes (based on their lower deletion cost), then Kompass drains its nodes and returns them to hibernation.

The following diagram illustrates the process: