When a fast-growing SaaS startup's entire CI/CD pipeline went down for 24 hours, they learned an expensive lesson: convenience today creates constraints tomorrow. GitHub Actions' built-in artifact storage seemed perfect until it wasn't. Their quick workaround—storing artifacts in GitHub Releases—violated every best practice they knew.
This is the story of how strategic thinking about infrastructure decisions prevented future crises, reduced costs to zero, and gave the team confidence to scale 10x without rearchitecting.
See the business results in our case study →
TL;DR
When a fast-growing SaaS startup hit GitHub Actions quota limits, their entire CI/CD pipeline went down for 24 hours. We helped them migrate to Cloudflare R2 with a strategic approach that eliminated costs, removed constraints, and scaled to 10x growth without rearchitecting.
The Real Problem: Architectural decisions made for short-term convenience create long-term constraints. GitHub's built-in storage was easy—until organization-wide quota limits blocked 15 repositories and 12 developers simultaneously.
The Solution: Cloudflare R2 with S3-compatible APIs, zero egress fees, per-project quotas, and automatic lifecycle management. Not just a migration—a strategic framework for evaluating infrastructure decisions.
Key Takeaways:
- Model costs at 10x scale, not current usage - Solutions that work today break tomorrow. Evaluate at future scale to identify constraints before you hit them.
- Optimize for optionality over convenience - Standard APIs, loose coupling, and abstraction layers preserve flexibility as you grow. Avoid vendor lock-in.
- Total cost includes hidden factors - Migration risk, downtime, team cognitive load, and operational overhead often dwarf sticker price.
5-Year ROI: $13,000-16,000 savings compared to GitHub Actions upgrades or AWS S3 alternatives, with zero deployment velocity constraints.
The Hidden Cost of Convenience
GitHub Actions provides built-in artifact storage, and it seems perfect for startups. Upload build artifacts, download them in later jobs, and everything works. No configuration, no additional services, no complexity.
Until Monday morning when it doesn't.
The Cascading Failure
A growing startup shipping 20-30 deploys per day across 15 repositories discovered their quota limit the hard way: every GitHub Actions workflow failed simultaneously. Not code failures. Not test failures. Infrastructure failures.
The impact cascaded immediately:
- 15 repositories blocked
- 12 developers unable to merge PRs
- 3 critical hotfixes stuck in review
- Production deployment cancelled
- 24 hours until resolution
Why "Just Use What's Built In" Stops Working
GitHub Actions artifact storage has a fundamental constraint: organization-wide quota limits. One busy project can consume quota that blocks all other projects. As teams grow and deploy more frequently, artifacts accumulate faster than anyone expects.
Build outputs, test results, coverage reports, screenshots, performance benchmarks—every artifact counts against a shared limit. The free tier seems generous until you're deploying dozens of times daily across multiple projects.
More problematic: quota limits are hard caps. There's no "burst mode" or temporary overage. When you hit the limit, everything stops. Your team's velocity doesn't just slow—it halts completely.
The Real Problem: Architectural Decisions Made for Convenience
This startup made the same choice most teams make: they optimized for short-term convenience over long-term scalability. GitHub Actions' built-in storage removed friction during early development. No additional accounts, no credentials to manage, no infrastructure to provision.
But architectural decisions made for convenience create constraints later. The question isn't whether you'll hit limits—it's when, and whether you'll be prepared.
When Workarounds Become Technical Debt
With deployments blocked and pressure mounting, the team did what teams under pressure do: they hacked together a workaround. They started storing build artifacts in GitHub Releases.
It worked. Builds resumed. Deployments continued. Crisis averted.
Except it wasn't.
Why This Workaround Was Worse Than the Problem
GitHub Releases are designed for software distribution—versioned releases that users download. They're not designed for CI/CD artifact storage. Using them this way violated multiple best practices:
No Lifecycle Management: Releases don't expire automatically. Every build created a new release. After a month, they had 100+ fake releases cluttering their repository. Someone had to manually delete them, and inevitably, nobody did.
Semantic Confusion: New contributors saw dozens of "releases" and couldn't distinguish actual product releases from build artifacts. The repository's release history became meaningless noise.
Fragile by Design: Releases lack versioning strategies for artifacts. No way to track which artifact came from which commit, which job, or which environment. Debugging production issues became guesswork.
Doesn't Scale: At 20-30 deploys per day across 15 repositories, they were creating 300-450 fake releases daily. The approach would break long before they achieved their growth goals.
The Technical Debt Trap
The team knew this was wrong. Every engineer who looked at the workflow felt uncomfortable. But the immediate problem was solved, and there was always "more important" work to do.
This is how technical debt accumulates. Quick fixes under pressure become permanent architecture. The cost compounds over time: confusion for new engineers, fragility during incidents, and eventually, another crisis that forces a real solution.
The question wasn't whether they'd need to fix this eventually—it was whether they'd fix it proactively or reactively after the next outage.
The Strategic Evaluation Framework
When evaluating artifact storage solutions, most teams focus on the wrong questions. They ask: "What's cheapest?" or "What's easiest to set up?" The right question is: "What decision gives us the most optionality as we scale?"
We evaluated three options through this lens:
Option 1: Upgrade GitHub Actions Quota
The path of least resistance: Pay GitHub more, avoid migration work, keep using what's familiar.
Why it fails strategically: You're still bound by organization-wide quotas. As the team grows, you'll hit limits again. Costs scale linearly with usage at $4/GB/month. More importantly, you're deepening vendor lock-in without addressing the architectural constraint.
When it makes sense: Pre-revenue startups with limited eng resources where a few months of runway matters more than long-term architecture.
Option 2: AWS S3
The industry standard: Battle-tested, mature ecosystem, extensive tooling.
The hidden costs: Storage costs are predictable ($0.023/GB/month), but egress fees are where AWS extracts value. Downloading artifacts—which happens constantly in CI/CD—costs $0.09/GB. At scale, egress fees dwarf storage costs.
The strategic trap: You're trading GitHub's quota limits for AWS's egress fees. Both are constraints that tax growth.
When it makes sense: Enterprises already committed to AWS with negotiated discount rates and compliance requirements mandating specific cloud providers.
Option 3: Cloudflare R2
The strategic bet: S3-compatible storage with zero egress fees and generous free tier.
Why it changes the equation: S3 compatibility means standard tooling works out of the box. Zero egress fees mean costs don't scale with deployment frequency. Per-project quotas eliminate organization-wide bottlenecks. The free tier (10GB storage, unlimited egress) covers most startups' needs indefinitely.
The trade-off: Cloudflare R2 is newer than S3 (launched 2022), with a smaller ecosystem. But Cloudflare's scale and reliability record mitigate this concern.
Strategic value: This choice gives the most optionality. Standard S3 APIs mean you can migrate to AWS later if needed. Zero egress fees mean you're not taxed for high deployment velocity. Per-project quotas mean your growth doesn't create new bottlenecks.
The Five-Year ROI Analysis
We don't just evaluate costs at current scale—we model costs at 10x scale:
Year 1-2 (Current Scale):
- GitHub upgrade: $40/month = $960
- AWS S3: $30-50/month = $720-1,200
- Cloudflare R2: $0/month = $0
Year 3-5 (10x Scale):
- GitHub upgrade: $400/month = $14,400 over 3 years
- AWS S3: $300-500/month (mostly egress) = $10,800-18,000 over 3 years
- Cloudflare R2: $0-50/month = $0-1,800 over 3 years
Total 5-year savings: $13,000-16,000
More valuable than cost: the decision doesn't create new constraints as you scale.
Why Most Teams Choose Wrong
Before diving into implementation, examine why most teams make poor architectural decisions about artifact storage:
Mistake 1: Optimizing for Setup Time Instead of Total Cost of Ownership
Teams choose solutions based on "how fast can we get this working?" instead of "what are the long-term implications?" GitHub Actions' built-in storage wins because setup time is zero. But the total cost includes:
- Migration cost when you inevitably outgrow it
- Opportunity cost of deployment velocity constrained by quotas
- Team cognitive load managing workarounds
- Risk cost of downtime when limits are hit
Mistake 2: Ignoring Growth Trajectories
Most teams evaluate solutions at current scale. A $40/month GitHub upgrade seems reasonable for a 5-person team. But what happens when you're 15 people? 50 people? The cost scales linearly, but worse, the architectural constraints become harder to fix as you grow.
Strategic architecture asks: "What decision do we not want to revisit in 18 months?"
Mistake 3: Underestimating Hidden Costs
AWS S3 appears cheap until you calculate egress fees. GitHub Actions appears convenient until you hit quota limits during a critical deployment. The hidden costs—downtime, migration work, team frustration—dwarf the visible costs.
The most expensive infrastructure decisions are the ones that look cheap initially.
Mistake 4: Confusing Familiarity with Suitability
"We already use AWS" becomes justification for using S3, even when egress fees make it the wrong choice. "We already use GitHub" justifies using their artifact storage, even when quota limits create constraints.
Familiarity reduces perceived risk, but it doesn't change actual suitability for your use case.
The Four-Week Strategic Engagement
This wasn't a migration project. It was a strategic engagement to understand their business, evaluate their options, and build infrastructure that would scale with them.
Week 1: Understanding the Business, Not Just the Technology
We started by asking questions most engineering consultants skip:
- Where do you want to be in 12 months? 24 months?
- What's your planned team growth?
- What's blocking your deployment velocity today?
- What keeps your engineering leadership up at night?
Key discoveries from stakeholder interviews:
From Engineering: Quota limits were a symptom, not the disease. The real problem was that infrastructure decisions were made reactively, under pressure, without strategic thinking.
From Product: Deployment velocity directly impacted feature velocity. Every hour builds were blocked cost them competitive positioning.
From Leadership: They were planning 3x team growth in 12 months. Any solution needed to scale without rearchitecture.
The insight that changed everything: They didn't need artifact storage—they needed confidence that infrastructure wouldn't constrain growth.
Week 2: Architecture Principles Over Implementation Details
Rather than jumping into implementation, we focused on establishing architectural principles that would guide all decisions:
Principle 1: Separation of Concerns Development and production have different needs. Dev builds churn fast (1-day retention). Production builds need longer retention for debugging (7-days). Environment-specific buckets acknowledge this reality.
Principle 2: Infrastructure as Code Every resource defined in version-controlled Terraform. Changes are code-reviewed. Configuration drift becomes impossible. New projects replicate the pattern in minutes, not days.
Principle 3: Abstraction at the Right Level Custom composite GitHub Actions provide a consistent interface across all repositories. Change the underlying storage provider? Update one composite action, not 15 workflow files.
Principle 4: Lifecycle Management Built In Artifacts expire automatically. No manual cleanup. No runbooks for artifact management. Set-and-forget operation.
The architecture review that changed the plan: We initially planned a single bucket with tagging to distinguish environments. During architecture review, we realized this violated Principle 1. Separate buckets gave each environment its own quota, its own lifecycle rules, and eliminated crosstalk between dev and prod deployments.
Week 3: Implementation with Built-In Quality
The Terraform module approach: We built reusable Terraform modules that could provision R2 buckets with lifecycle policies for any project. The startup could replicate this pattern for future projects without reinventing anything.
The composite action pattern: GitHub Actions composite actions provided reusable upload/download logic. Repository workflows call these actions without knowing the underlying implementation. If they migrate from R2 to another provider later, they change the composite action—not every workflow.
The critical testing phase: We load-tested with 50 concurrent uploads before touching production. Tested failure scenarios: network errors, invalid credentials, partial uploads. Built retry logic and comprehensive error handling.
What we didn't do: Add complexity for hypothetical future requirements. No multi-region replication. No custom backup systems. No over-engineered abstractions. Build what's needed today with patterns that extend naturally tomorrow.
Week 4: Knowledge Transfer Over Handoff
The worst consulting engagements deliver solutions that only the consultants understand. The best engagements build team capability.
Documentation that enables: Architecture diagrams explaining the "why" behind decisions. Setup guides for replicating patterns in new projects. Runbooks for the few manual operations (credential rotation). Troubleshooting guides for common errors.
Training that sticks: Two-hour architecture workshop. Hands-on exercise: each engineer deploys R2 to a test project. Q&A session addressing team concerns. Office hours during the migration for live support.
The production cutover: Zero downtime. Rolled out repository-by-repository, starting with low-risk projects. Validated each migration before proceeding. Production workflows migrated last, after the pattern was proven.
Architecture Principles That Scale
The architecture we designed prioritizes long-term maintainability over short-term convenience:
Production CI/CD pipeline showing parallel quality checks (Lint, Type Check, Security Audit, Unit Tests) followed by Build and Smoke Tests - all stages orchestrated with Cloudflare R2 artifact storage
GitHub Actions Workflow
│
├─────> Build Job
│ │
│ ├─> Build Artifacts
│ │
│ └─> r2-upload Action
│ │
│ ▼
│ ┌──────────────────┐
│ │ Cloudflare R2 │
│ │ │
│ │ Dev Bucket │
│ │ (1-day expire) │
│ │ │
│ │ Prod Bucket │
│ │ (7-day expire) │
│ └──────────────────┘
│ │
│ │ (Automatic cleanup)
│ ▼
│ [Lifecycle policies
│ delete expired
│ artifacts daily]
│
└─────> Deploy Job
│
└─> r2-download Action
│
└─> Extract & Deploy
Why This Architecture Works
Single Responsibility:
- Each component does one thing well.
- Composite actions handle artifact transfer.
- Terraform manages infrastructure.
- Lifecycle policies handle cleanup.
Loose Coupling:
- Repository workflows don't know about R2.
- They call generic upload/download actions.
- Swap the underlying storage? Change the composite actions.
Environment Isolation:
- Dev and prod buckets have separate quotas, separate lifecycle rules.
- Dev problems don't affect prod.
- Prod retention doesn't bloat dev costs.
Automated Lifecycle: Artifacts expire automatically without manual intervention. Storage costs stay predictable regardless of deployment frequency.
The Principles That Generalize
These architectural decisions aren't specific to R2. They apply to any artifact storage migration:
Model costs at 10x scale: Don't evaluate solutions at current usage. Model what happens when you're 10x larger. Does the cost scale linearly? Do architectural constraints appear?
Optimize for future migrations: Use standard APIs (S3-compatible). Build abstraction layers (composite actions). Make it easy to move again if needed.
Separate environments properly: Dev and prod have different needs. Acknowledge this in infrastructure, not just workflow configuration.
Automate lifecycle from day one: If cleanup requires manual intervention, it won't happen consistently. Build automatic expiration from the beginning.
What We'd Do Differently
No project goes perfectly. Here's what we learned:
What Worked Better Than Expected
Discovery week paid massive dividends: Taking a full week to understand the business seemed slow initially. In retrospect, it prevented three architecture mistakes we would have made jumping straight to implementation.
S3 compatibility was more valuable than we anticipated: Standard AWS CLI meant zero custom tooling. Migration was straightforward because we weren't fighting proprietary APIs.
Composite actions eliminated duplication at scale: 15 repositories benefited from one abstraction. When we added retry logic later, we updated it once, not 15 times.
Stakeholder presentations got buy-in: Presenting recommendations to leadership before implementation ensured support during the migration. No surprises, no pushback.
Challenges We Underestimated
Artifact key management was subtle: Linking upload jobs to download jobs required storing artifact keys in short-lived GitHub artifacts. Not complex, but easy to overlook in initial design.
R2 API token documentation was sparse: We created detailed guides for the team because Cloudflare's docs assumed more context than new users had.
Network hiccups caused occasional failures: Added exponential backoff retry logic in v2. Should have included this from day one.
Team training took longer than expected: Engineers needed more hands-on time to internalize the patterns. Budget 2x the time you think training needs.
What We'd Do Differently Next Time
Add monitoring earlier: We built a usage dashboard after launch. Should have been part of initial implementation.
Document failure modes more thoroughly: Initial runbook missed edge cases we discovered during production operation.
Test with more repository types: Tested with typical Node.js projects. Later discovered nuances with monorepo builds that required workflow adjustments.
The Decision Framework You Can Use
When evaluating infrastructure decisions, use this framework:
Question 1: What Happens at 10x Scale?
Model costs, performance, and operational overhead at 10x your current scale. If the solution breaks or becomes prohibitively expensive, it's the wrong choice.
Question 2: What Constraints Does This Create?
Every architectural decision creates constraints. GitHub Actions creates org-wide quota limits. AWS S3 creates egress fee pressure. Identify constraints upfront and decide if you can live with them long-term.
Question 3: How Easy Is It to Change Later?
Prefer solutions using standard protocols (S3-compatible APIs) over proprietary APIs. Build abstraction layers that let you swap implementations. Assume you'll need to migrate eventually.
Question 4: What's the Total Cost of Ownership?
Include migration costs, operational overhead, team cognitive load, and risk of downtime. The cheapest solution by sticker price often has the highest total cost.
Question 5: Does This Give Us More Optionality or Less?
Good architectural decisions increase optionality: more flexibility, more scaling paths, more migration options. Bad decisions decrease optionality: vendor lock-in, hard limits, expensive migrations.
Quick Start Checklist
- Evaluate current artifact storage costs and constraints
- Model costs at 10x scale for all storage options
- Create separate Cloudflare R2 buckets for dev and prod
- Set up lifecycle policies (1-day dev, 7-day prod retention)
- Create reusable Terraform modules for bucket provisioning
- Build composite GitHub Actions for upload/download
- Configure exponential backoff retry logic
- Test with 50+ concurrent uploads before production
- Implement monitoring and alerting for storage costs
- Document migration process for future repositories
Conclusion
The startup went from 24-hour CI/CD outage to a production-grade artifact storage system that costs $0/month and scales to 10x growth without rearchitecting.
But the real transformation wasn't technical—it was cultural. They learned to make architectural decisions strategically instead of reactively. To evaluate solutions at future scale, not just current needs. To optimize for optionality instead of convenience.
The three lessons that generalize:
-
Convenience creates constraints: Short-term convenience (GitHub's built-in storage) creates long-term constraints (quota limits). Strategic architecture inverts this.
-
Model costs at 10x scale: Solutions that work today break tomorrow. Evaluate decisions at 10x your current scale to identify architectural limits before you hit them.
-
Optimize for optionality: The best architectural decisions give you more options later, not fewer. Standard APIs, loose coupling, and abstraction layers preserve flexibility as you grow.
See the business impact in our case study →
Facing infrastructure constraints that limit your growth? We help startups build production-grade solutions that scale. Let's talk about your challenges →
