Spot Instances 101: Safely Slash Compute Costs up to 90%
What Spot Instances Are and When They Make Sense
Spot instances are spare capacity offered by the cloud providers at deep discounts (often 70‑90%). The trade‑off is that the provider can reclaim the capacity with little warning. They are ideal for workloads that are: - Stateless or easily restartable (batch jobs, CI pipelines, data processing). - Can tolerate interruption (machine learning training with checkpointing, image rendering farms). - Behind a resilient architecture (auto‑scaling groups, Kubernetes deployments, or queue‑driven services). If your application meets any of these criteria, you can start moving a portion of its baseline compute to spot without sacrificing availability.
AWS Spot: From Request to Production
1. Choose the Right Launch Method
- Spot Fleet – lets you define a target capacity and let the service pick the cheapest instance types across regions.
- EC2 Auto Scaling with Mixed Instances Policy – combines On‑Demand, Spot, and optionally Savings Plans in a single ASG.
- EC2 Spot Instances via the CLI – quick for ad‑hoc jobs.
2. Set a Maximum Price (or use the default)
aws ec2 request-spot-instances \
--instance-count 4 \
--type "one-time" \
--launch-specification file://spec.json \
--spot-price "0.03"
The --spot-price is optional; if omitted, AWS uses the current Spot market price. Keeping a ceiling protects you from sudden price spikes.
3. Handle Interruption Notices
AWS sends a two‑minute warning via the instance metadata URL http://169.254.169.254/latest/meta-data/spot/termination-time. A simple daemon can poll this endpoint and gracefully stop services or checkpoint state.
while true; do
curl -s http://169.254.169.254/latest/meta-data/spot/termination-time && break
sleep 5
done
# trigger graceful shutdown here
Integrate this script with your init system or container entrypoint.
4. Use Capacity Rebalancing (newer feature)
Add the flag --instance-interruption-behavior terminate and enable Capacity Rebalancing on the ASG. The service will proactively launch replacement Spot instances before the actual interruption, reducing churn.
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name my-asg \
--capacity-rebalance-enabled
5. Monitor Spot Utilization
- CloudWatch metric
SpotInstanceRequestFulfilledCapacityshows how many Spot instances are active. - Set an alarm if the metric drops below a threshold, then automatically scale up On‑Demand capacity.
GCP Preemptible VMs: Simplicity with Limits
GCP’s preemptible VMs are the equivalent of Spot, but they have a fixed maximum lifetime of 24 hours and a 30‑second termination notice.
1. Create with gcloud
gcloud compute instances create preemptible‑worker \
--machine-type n1-standard-4 \
--preemptible \
--image-family debian-11 \
--image-project debian-cloud \
--metadata shutdown-script-url=gs://my‑scripts/handle‑preempt.sh
The --preemptible flag applies the discount automatically.
2. Use Managed Instance Groups (MIG) for Auto‑Scaling
gcloud compute instance-groups managed create preempt‑mig \
--template preemptible‑template \
--size 5 \
--target-size 10 \
--autoscaling-policy max-num-replicas=20,scale-based-on-cpu-utilization=0.6
MIG will replace preempted VMs automatically, keeping the target size constant.
3. Capture the 30‑second Notice
GCP writes a preempted flag to the instance metadata service. Poll it in a background process:
while true; do
curl -s -H "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/preempted && break
sleep 5
done
# graceful shutdown actions
4. Cost Visibility
Enable Billing Export to BigQuery and query cost where service.description = "Compute Engine" AND sku.id LIKE "%Preemptible%". This gives you a precise dollar impact.
Azure Spot VMs: Leveraging Eviction Policies
Azure Spot VMs let you choose an eviction policy that determines what happens when capacity is reclaimed.
1. Deploy via Azure CLI
az vm create \
--resource-group rg-prod \
--name spot‑worker \
--image UbuntuLTS \
--size Standard_D4s_v3 \
--priority Spot \
--eviction-policy Deallocate \
--max-price -1 # -1 means pay the current Spot price
Deallocate keeps the OS disk, allowing a quick restart; Delete removes the VM entirely.
2. Use Scale Sets for Resilience
az vmss create \
--resource-group rg-prod \
--name spot‑ss \
--image UbuntuLTS \
--upgrade-policy-mode Automatic \
--instance-count 3 \
--priority Spot \
--eviction-policy Deallocate \
--max-price -1
The scale set automatically provisions replacement instances when evictions occur.
3. React to Eviction Events
Azure writes an event to the Azure Activity Log. Set up an Event Grid subscription that triggers an Azure Function to log the event and optionally spin up a fallback On‑Demand VM.
{
"subject": "/subscriptions/<sub>/resourceGroups/rg-prod/providers/Microsoft.Compute/virtualMachineScaleSets/spot-ss/eviction",
"eventType": "Microsoft.Resources.ResourceWriteSuccess",
"data": { "status": "Succeeded" }
}
4. Budget Alerts
Create a cost alert that fires when Spot spend exceeds a percentage of your total compute budget:
az consumption budget create \
--budget-name SpotAlert \
--amount 500 \
--time-grain Monthly \
--category Cost \
--notifications "{\"operator\":\"GreaterThan\",\"threshold\":80,\"contactEmails\":[\"finops@example.com\"]}"
Best‑Practice Checklist Across All Clouds
- Tag Spot resources (
Environment=prod,CostCategory=Spot) so you can filter them in cost reports. - Automate checkpointing (e.g., TensorFlow checkpoints, Spark savepoints) to survive preemptions.
- Mix Spot with On‑Demand: aim for 70‑80% Spot capacity, keep a small On‑Demand buffer for burst.
- Use region‑wide instance pools: Spot price varies by AZ; a broader pool reduces the chance of out‑of‑capacity.
- Set explicit max‑price only when you have a hard ceiling; otherwise let the provider manage pricing.
- Monitor eviction rates: high eviction frequency (>20% per hour) indicates you’re over‑committing; dial back Spot proportion.
- Leverage provider‑specific tools: AWS EC2 Fleet, GCP MIG, Azure Scale Sets – they handle replacement automatically.
- Validate with a pilot: start with a non‑critical workload, measure cost vs. interruption, then expand.
Calculating the Dollar Impact
- Export the last 30 days of compute spend to a CSV or BigQuery table.
- Filter rows where
sku.descriptioncontainsSpot,Preemptible, orSpot VM. - Sum the
costcolumn – that is the amount you saved compared to the On‑Demand price for the same usage. - Divide by the total compute cost to get the percentage reduction.
How CloudBudgetMaster Helps
CloudBudgetMaster continuously scans your AWS, GCP, and Azure accounts, automatically flags Spot‑eligible workloads, and shows the exact dollar impact of moving them to Spot. It surfaces interruption‑risk metrics and recommends the optimal Spot‑On‑Demand mix, letting you act on savings without manual digging.
CloudBudgetMaster