Detecting Zombie Infrastructure: Find Forgotten Cloud Resources
What is a "zombie" resource?
A zombie (or orphan) resource is any cloud asset that is still running, allocated, or reserved but provides no business value. It often lives in a project or account that no longer has an owner, and because it is billable, it silently inflates the monthly spend.
Typical signs: - No recent CloudWatch/Stackdriver metrics. - No tags or tags that belong to a decommissioned team. - Stale creation dates (6+ months old). - No attached workloads (e.g., a load balancer with zero targets).
Detecting these items requires systematic inventory, filtering by activity, and then validation with the responsible team.
1. Scan AWS for common zombies
a) EC2 instances without traffic
aws ec2 describe-instances \
--filters Name=instance-state-name,Values=running \
--query 'Reservations[].Instances[?NetworkInterfaces[0].Attachment.Status==`attached`].{ID:InstanceId,Launch:LaunchTime,AZ:Placement.AvailabilityZone}' \
--output table
Cross‑reference the output with VPC Flow Logs (or CloudWatch metric NetworkIn). Any instance with NetworkIn < 1 KB for the last 30 days is a candidate.
b) EBS volumes not attached for >30 days
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[?CreateTime<`$(date -d "30 days ago" -Iseconds)`].{ID:VolumeId,Size:Size,AZ:AvailabilityZone,Created:CreateTime}' \
--output table
If the volume size is >10 GiB, consider snapshotting and deleting.
c) Elastic IPs (EIPs) that are allocated but not associated
aws ec2 describe-addresses \
--query 'Addresses[?AssociationId==null].{PublicIP:PublicIp,AllocationId:AllocationId,Created:AllocationId}' \
--output table
Each unattached EIP costs $0.005 / hour. Multiply by 720 hours to see the monthly impact.
d) NAT Gateways with zero bytes processed
aws cloudwatch get-metric-statistics \
--namespace AWS/NATGateway \
--metric-name BytesProcessed \
--statistics Sum \
--period 86400 \
--start-time $(date -d "30 days ago" -u +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--dimensions Name=NatGatewayId,Value=<gw-id>
If the sum is zero for the whole period, the NAT Gateway is idle and can be deleted.
2. Scan GCP for orphaned assets
a) Compute Engine VMs with no CPU usage
gcloud compute instances list --format="json" | jq '.[] | select(.status=="RUNNING") | {name, zone, creationTimestamp}' > running.json
gcloud monitoring time-series list \
--filter='metric.type="compute.googleapis.com/instance/cpu/utilization" AND metric.labels.instance_name=~".*"' \
--format=json > cpu.json
Cross‑join the two JSON files; any instance with average cpu/utilization < 0.01 over the last 30 days is a zombie.
b) Persistent disks not attached
gcloud compute disks list --filter='users:[]' --format='table(name,zone,sizeGb,creationTimestamp)'
If the disk is >100 GB, snapshot it before deletion.
c) Static external IPs that are not in use
gcloud compute addresses list --filter='status=RESERVED' --format='table(name,region,address,creationTimestamp)'
Each reserved IP costs $0.004 / hour. Identify those older than 90 days.
3. Scan Azure for forgotten resources
a) Virtual machines with low network I/O
Get-AzVM -Status | Where-Object {$_.PowerState -eq "VM running"} | ForEach-Object {
$metrics = Get-AzMetric -ResourceId $_.Id -MetricName "Network In Total" -TimeGrain "PT1H" -StartTime (Get-Date).AddDays(-30) -EndTime (Get-Date)
$avg = ($metrics.Data | Measure-Object -Property Average -Average).Average
if ($avg -lt 1KB) { $_ }
}
b) Unattached managed disks
Get-AzDisk | Where-Object {$_.ManagedBy -eq $null} | Select-Object Name, DiskSizeGB, CreationData
c) Public IPs not associated with a NIC or Load Balancer
Get-AzPublicIpAddress | Where-Object {$_.IpConfiguration -eq $null} | Select-Object Name, IpAddress, Location, AllocationMethod
Each idle Standard Public IP costs $0.005 / hour.
4. Validate before deletion
- Tag check – Ensure the resource has a tag like
ownerorcost-center. If missing, ping the Slack channel#cloud-costswith the resource ID and ask for ownership. - Snapshot/backup – For storage (EBS, Persistent Disk, Managed Disk), create a snapshot:
- AWS:
aws ec2 create-snapshot --volume-id vol-12345678 --description "pre‑cleanup snapshot"- GCP:gcloud compute disks snapshot my-disk --snapshot-names my-disk-snap- Azure:az snapshot create --resource-group rg-prod --source /subscriptions/.../myDisk --name myDiskSnap - Dry‑run delete – Most CLIs support a
--dry-runflag (AWS) or you can list the IDs and manually confirm. - Document – Record the action in a shared spreadsheet or in your IaC repo (e.g., add a comment to the Terraform state file).
5. Automate the hunt
- Scheduled Lambda / Cloud Function – Run the AWS snippets daily, push results to an S3 bucket, and send a Slack webhook if any candidate exceeds a cost threshold.
- Terraform import guard – Add a
lifecycle { prevent_destroy = true }to critical resources, then useterraform state rmonly after the zombie check passes. - Policy as code – Use AWS Config rule
ec2-instance-no-public-ipcombined with a custom rule that flags instances withNetworkIn< 1 KB for 30 days. - Cross‑cloud dashboard – Export all findings to a CSV, ingest into a Grafana panel, and set alerts on “zombie count > 0”.
6. Ongoing governance
- Tag enforcement – Require every new resource to have
owner,environment, andttltags via IAM policies or Service Catalog. - Quarterly review – Run the detection scripts before each fiscal quarter and retire any lingering zombies.
- Cost allocation reports – Use AWS Cost Explorer, GCP Billing Export, and Azure Cost Management to verify that the monthly spend for the identified resource types drops after cleanup.
- Education – Add a short “Zombie Awareness” slide to onboarding for engineers and product managers.
Even with disciplined processes, manual checks slip. CloudBudgetMaster continuously scans AWS, GCP, and Azure accounts, flags zombie resources in real time, and shows the exact dollar impact per item, letting you remediate with a single click.
CloudBudgetMaster