Spot management

Prev Next

Kompass Spot management uses HiberScale-enabled launch speed to ensure evicted Pods are rescheduled without any downtime.

Kompass boots HiberScale nodes much faster than any autoscaler, hosting evicted Pods until replacement nodes are ready.

This allows you to migrate more workloads to Spot with no risk of service disruption.

The following figure shows an example of how Kompass launches a hibernated node much faster than the current boot time:

Prerequisites

  • To protect against downtime, Spot management can be activated only on workloads where a Pod Disruption Budget (PDB) is configured.

    If you try to activate Spot management for a workload without a PDB, the code for adding PDB to the specific workload is displayed for your convenience.
    Alternatively, you can use this generic code. Change the environment variables as needed:

    apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata:
      name: "{{ .Values.deployment.name }}-pdb"
    spec:
      maxUnavailable: 20%
      selector:
        matchLabels:
            service: "{{ .Values.deployment.name }}"
  • Spot management can be activated only on workloads running on clusters where Karpenter is running.

The magic behind the scenes

When you activate Spot management on a workload, the following automatic steps ensure protection:

  1. Kompass ensures that the workload and Karpenter configurations enable using Spot nodes, modifying configurations if necessary.

  2. Hibernated nodes start warming up.
    For more information about hibernated nodes, see HiberScale technology.

    Interruption protection begins immediately for all Pods from protected workloads hosted on Spot instance nodes.

  3. When AWS notifies about a Spot interruption, Kompass reactivates pre-baked hibernated nodes to replace the interrupted nodes.

  4. At the same time, Karpenter evicts the Pods running on the current nodes and launches new nodes.

    (Pods are evicted according to the limits in the PDB.)

  5. When the Kompass nodes are ready to host Pods, Pods are scheduled to those nodes.

  6. 5 minutes after being reactivated, Kompass hibernated nodes are cordoned.

  7. Kompass nodes are gradually drained, until all Pods are hosted by autoscaler nodes.

    To ensure smooth draining, the number of Pods may exceed the number that were running before the interruption.

  8. When Kompass nodes are empty (no more Pods being hosted), the nodes are terminated.

The following diagram illustrates the process: