Testing HiberScale with AWS FIS
You can test HiberScale with the AWS Fault Injection Service (FIS) by simulating Spot interruptions. This validates that Spot protection correctly resumes hibernated nodes.
How Spot interruption simulation works
AWS FIS uses an ExperimentTemplate (configuration) and an Experiment (execution).
Experiments can select Karpenter-managed instances using the
karpenter.sh/nodepool
tag.Spot interruptions generated by FIS behave the same as production events, including a two-minute warning before termination.
IAM role for the ExperimentTemplate
The role must include the AWSFaultInjectionSimulatorEC2Access
policy and the following trust relationship:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "fis.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Experiment template
Create an experiment template that specifies cluster, nodepool, role, and selection parameters.
aws fis create-experiment-template \
--cli-input-json '{
"description": "test",
"targets": {
"SpotInstances-Target-1": {
"resourceType": "aws:ec2:spot-instance",
"resourceTags": {
"kubernetes.io/cluster/<cluster name>": "owned",
"karpenter.sh/nodepool": "<nodepool name>"
},
"selectionMode": "COUNT(5)"
}
},
"actions": {
"spot": {
"actionId": "aws:ec2:send-spot-instance-interruptions",
"parameters": {
"durationBeforeInterruption": "PT2M"
},
"targets": {
"SpotInstances": "SpotInstances-Target-1"
}
}
},
"stopConditions": [
{
"source": "none"
}
],
"roleArn": "<Role ARN>",
"tags": {},
"experimentOptions": {
"accountTargeting": "single-account",
"emptyTargetResolutionMode": "fail"
}
}'
"
COUNT(5)
" targets five instances."
ALL
" targets all Spot instances in the nodepool.
Experiment execution
Start experiments by referencing the template ID:
aws fis start-experiment --experiment-template-id <Experiment Template ID>
Expected results
When the experiment runs:
AWS FIS issues a Spot interruption notice with a two-minute warning.
Kompass resumes hibernated nodes allocated for the workload.
Pods move to the resumed HiberScale nodes and continue running without downtime.