Skip to main content
ZSoftly logo
DevOps

GitOps + Kubernetes Cost Optimization: Reducing Cloud Spend by 80% Without Sacrificing Performance

January 20, 2025
9 min read
Share:
GitOps + Kubernetes Cost Optimization: Reducing Cloud Spend by 80% Without Sacrificing Performance - Featured image

The Kubernetes Cost Crisis

Kubernetes promised operational efficiency. But for many mid-market companies, it delivered a different reality: cloud bills that grew 300% faster than usage.

A sobering statistic: Organizations waste an average of 32% of their Kubernetes spend on idle resources, oversized pods, and inefficient scheduling. For a company spending $100,000/month on EKS, that's $38,400 annually thrown away.

But here's the opportunity: Spot instances cut compute costs by up to 90%, and GitOps-driven optimization reduces overall Kubernetes spend by 60-80% while improving performance.


TL;DR

Kubernetes promised efficiency but delivered cost chaos—organizations waste 32% of spend on idle resources while bills grow 300% faster than usage. We helped companies reduce Kubernetes costs by 60-80% using Spot instances, GitOps-driven right-sizing, and automated cleanup—without sacrificing performance.

The Strategy: Mix Spot and on-demand nodes, right-size resources based on actual usage data, embed cost awareness in CI/CD workflows, and automate cleanup of idle resources.

Key Takeaways:

  1. Spot instances for 60-70% of workloads - 70-90% cost savings for fault-tolerant applications. Diversify across 4+ instance types to reduce interruption risk below 2%.
  2. GitOps-driven right-sizing eliminates waste - VPA analyzes actual usage and recommends optimal requests. Deploy changes via ArgoCD/Flux in continuous optimization loop.
  3. Shift-left cost awareness prevents overruns - Integrate cost estimation in PRs, set namespace budgets, block deployments exceeding limits before production.
  4. Automated cleanup recovers 40% of spend - Delete unattached volumes, unused secrets, scale dev clusters to zero nights/weekends. Save 65% on non-production costs.

Real Results: SaaS startup reduced costs 79% ($15K to $3.2K/month, $141,600 annual savings). E-commerce cut 69% ($22K to $6.8K/month, $182,400 savings). Zero performance degradation.

Core Principle: Visibility before optimization—you cannot reduce costs you don't measure and understand.


Understanding the Cost Problem

Where Kubernetes Costs Hide

1. Compute Waste (40% of total cost)

  • Pods requesting 4 CPU cores but using 0.5
  • Development clusters running 24/7
  • Failed deployments leaving orphaned resources

2. Storage Inefficiency (25% of total cost)

  • Unattached PersistentVolumes
  • Unnecessary snapshot retention
  • Over-provisioned EBS volumes

3. Network Costs (15% of total cost)

  • Cross-AZ traffic
  • NAT Gateway data transfer
  • Load balancer proliferation

4. Licensing & Add-ons (20% of total cost)

  • Monitoring tools with per-node pricing
  • Unused premium features
  • Duplicate observability stacks

The principle is visibility before optimization. You cannot reduce costs you don't understand.

The Spot Instance Opportunity

Understanding Spot Economics

AWS Spot instances offer spare EC2 capacity at 70-90% discounts compared to on-demand pricing. The catch? They are interrupted with 2 minutes notice when AWS needs the capacity back.

Real-World Pricing Example:

Instance TypeOn-DemandSpotSavings
m5.xlarge$0.192/hr$0.058/hr70%
m5.2xlarge$0.384/hr$0.115/hr70%
r5.xlarge$0.252/hr$0.076/hr70%
c5.2xlarge$0.340/hr$0.102/hr70%

Monthly Savings for 10 nodes:

  • On-Demand: $1,382
  • Spot: $414
  • Savings: $968/month ($11,616/year)

Which Workloads Fit Spot?

Perfect for Spot:

  • Stateless web applications
  • Batch processing jobs
  • CI/CD build runners
  • Data processing pipelines
  • Non-critical microservices

Not Suitable for Spot:

  • Databases (use RDS instead)
  • Stateful applications without HA
  • Single-replica critical services
  • Long-running transactions

The principle is design for failure. Workloads on Spot must tolerate interruption.

Implementation Principles

1. Mixed Node Group Architecture

Run critical workloads on on-demand nodes, everything else on Spot. Set up a critical node group with on-demand instances (2-5 nodes) and a general node group with spot instances (3-20 nodes). Use taints and tolerations to control which pods land where.

2. Instance Diversification

Don't use a single instance type. Diversify across 4+ similar types such as m5.xlarge, m5.2xlarge, m5a.xlarge, and m5a.2xlarge.

AWS spreads your capacity across multiple Spot pools, reducing interruption risk.

3. Graceful Interruption Handling

Install AWS Node Termination Handler:

  • Detects 2-minute interruption warning
  • Cordons the node
  • Drains pods gracefully
  • Pods reschedule on other nodes

With proper PodDisruptionBudgets, applications experience zero downtime.

4. Autoscaler Configuration

Configure Cluster Autoscaler to prefer Spot nodes:

  • Scale up Spot first
  • Fall back to on-demand when Spot unavailable
  • Scale down Spot first

GitOps-Driven Right-Sizing

The Problem: Resource Over-Provisioning

Most teams set resource requests too high "to be safe":

  • Requested: 2000m CPU, 4Gi memory
  • Actual usage: 200m CPU, 1Gi memory
  • Waste: 90% CPU, 75% memory

Impact: 5x-10x cost overrun from idle capacity.

The Solution: Data-Driven Right-Sizing

1. Vertical Pod Autoscaler (VPA)

VPA analyzes actual resource usage and recommends optimal requests:

  • Collects metrics over 7+ days
  • Generates recommendations
  • Optionally auto-applies changes

Run in recommendation mode first to validate suggestions before applying.

2. GitOps Workflow for Right-Sizing

The principle is infrastructure as code for resource optimization:

Step 1: VPA generates recommendations Step 2: Automation creates PR with updated resource requests Step 3: DevOps reviews and validates Step 4: ArgoCD/Flux deploys changes Step 5: Monitor for performance issues

This creates a continuous optimization loop:

  1. Collect usage data (7 days)
  2. Generate recommendations
  3. Review and approve
  4. Deploy via GitOps
  5. Monitor and repeat

Cost Visibility in CI/CD

The Principle: Shift-Left Cost Awareness

Developers should know the cost impact of their changes before merging, not after deployment.

Pre-Deployment Cost Estimation

Integrate cost estimation into merge requests:

  1. Parse Kubernetes manifests in the PR
  2. Calculate monthly cost based on resource requests
  3. Post estimate as PR comment
  4. Block if cost exceeds team budget

Example comment: "This deployment will add $127/month to the platform team's Kubernetes costs. Current budget utilization: 78%."

Budget Guardrails

Namespace-Level ResourceQuotas

Each team gets a namespace with CPU and memory limits that translate to monthly budget caps:

  • Total CPU: 100 cores ≈ $730/month
  • Total memory: 200Gi ≈ $500/month
  • Storage: 1Ti

When teams hit limits, they must optimize before requesting more.

Real-Time Budget Monitoring

Alert at 70%, 90%, 100% of budget:

  • 70%: Informational
  • 90%: Scale down non-critical workloads
  • 100%: Block new deployments

Automated Cleanup

Idle Resource Detection

Automatically identify waste:

  • PVCs not attached for 30+ days
  • Pods using <10% of requested CPU
  • Secrets/ConfigMaps unused for 90+ days

Scheduled Deletion

Weekly cleanup jobs:

  1. Identify resources matching TTL criteria
  2. Send 24-hour warning
  3. Delete if no objection
  4. Log for audit

Development Environment Scheduling

Scale dev clusters to zero on nights and weekends:

  • Friday 5 PM: Scale to 0
  • Monday 8 AM: Scale back up
  • Savings: 65% of dev costs

Real-World Cost Reduction Examples

Case Study 1: SaaS Startup ($30M ARR)

Before:

  • 100 nodes, all on-demand m5.xlarge
  • Over-provisioned resources (3x actual usage)
  • Dev clusters running 24/7
  • Monthly Cost: $15,000

After:

  • 70 spot nodes (70% of fleet)
  • Right-sized resources via VPA
  • Dev clusters auto-shutdown
  • Monthly Cost: $3,200

Results:

  • 79% cost reduction
  • $141,600 annual savings
  • Zero performance degradation
  • <0.1% Spot interruption rate

Case Study 2: E-Commerce Platform ($50M GMV)

Before:

  • Multiple EKS clusters per environment
  • No resource quotas
  • Abandoned test namespaces
  • Monthly Cost: $22,000

After:

  • Consolidated to 2 clusters
  • Namespace budgets and quotas
  • Automated cleanup
  • 60% Spot for web tier
  • Monthly Cost: $6,800

Results:

  • 69% cost reduction
  • $182,400 annual savings
  • Utilization improved from 23% to 71%

Implementation Roadmap

Weeks 1-2: Assessment

Goal: Understand current costs and identify opportunities

Tasks:

  • Install Kubecost or similar tool
  • Generate cost breakdown by namespace/label
  • Identify top 10 cost drivers
  • Calculate potential savings from Spot

Weeks 3-4: Spot Migration

Goal: Move fault-tolerant workloads to Spot

Tasks:

  1. Create Spot node groups (mixed instance types)
  2. Install Node Termination Handler
  3. Deploy PodDisruptionBudgets
  4. Migrate 20% of workloads
  5. Monitor interruption rates
  6. Gradually increase to 60-70%

Weeks 5-6: Right-Sizing

Goal: Optimize resource requests based on actual usage

Tasks:

  1. Install VPA in recommendation mode
  2. Collect 7 days of metrics
  3. Review recommendations
  4. Update manifests in Git
  5. Deploy via GitOps
  6. Monitor performance

Weeks 7-8: Cost Automation

Goal: Embed cost awareness into workflows

Tasks:

  1. Add cost estimation to PRs
  2. Create namespace budgets
  3. Deploy cleanup automation
  4. Set up alerts
  5. Create cost dashboard

Monitoring and Continuous Improvement

Key Metrics

Cost Metrics:

  • Cost per request/customer
  • Monthly cost trend
  • Cost by namespace/team

Efficiency Metrics:

  • CPU utilization (target: 60-75%)
  • Memory utilization (target: 65-80%)
  • Spot interruption rate (target: <2%)
  • Resource waste percentage (target: <15%)

Performance Metrics:

  • P95 response time
  • Error rate
  • Availability (target: >99.9%)

Monthly Review Process

  1. Review total spend vs. budget
  2. Identify top cost drivers
  3. Check for idle resources
  4. Review VPA recommendations
  5. Analyze Spot interruption rates
  6. Update resource requests
  7. Adjust autoscaling policies

Best Practices Summary

  1. Use Spot for 60-70% of compute
  2. Right-size resources based on actual usage
  3. Implement PodDisruptionBudgets
  4. Diversify Spot instance types (4+)
  5. Set namespace budgets and quotas
  6. Automate cleanup of idle resources
  7. Integrate cost estimates in CI/CD
  8. Monitor with Kubecost or similar
  9. Use GitOps for all changes
  10. Schedule dev environment shutdown

Get Expert Help

Optimizing Kubernetes costs requires technical expertise and operational discipline. At ZSoftly, we've helped dozens of companies reduce Kubernetes spend by 60-80% while improving performance.

Our Services:

  • Cost assessment and optimization roadmap
  • Spot instance architecture design
  • GitOps implementation for right-sizing
  • Custom automation development
  • Team training and knowledge transfer

Ready to cut your Kubernetes costs?


Next in series: "Kubernetes FinOps for Growing Companies: From Chaos to Cost Control"