After years of "kubectl apply" cowboys and fragile CI/CD pipelines pushing directly to production, we discovered GitOps. It transformed how we deploy to Kubernetes at scale. Here's what GitOps really means in practice, why it works, and the challenges nobody talks about.
What GitOps Actually Is (Without the Hype)
GitOps is simple: your Git repository becomes the single source of truth for what should be running in your Kubernetes clusters. Instead of CI pipelines pushing changes to clusters, specialized operators like Flux CD pull changes from Git and ensure your cluster matches what's declared.
Think of it as Infrastructure as Code, but with continuous enforcement. If someone manually changes something in the cluster, GitOps automatically reverts it to match Git. No more configuration drift, no more "who changed what in production?"
Our GitOps Architecture with Flux CD
Here's how we structure GitOps for our enterprise Kubernetes deployments:
# Application repository (e.g., atlas-resources-api)
.
├── src/ # Application source code
├── helm/
│ ├── chart/ # Helm chart templates
│ └── values/
│ ├── dev.yaml # Development values
│ ├── staging.yaml # Staging values
│ └── prod.yaml # Production values
└── .github/
└── workflows/
└── build.yaml # CI pipeline
# GitOps repository (e.g., platform-gitops)
.
├── clusters/
│ ├── prod-eu-west/
│ │ ├── flux-system/ # Flux components
│ │ └── apps/ # Application deployments
│ └── staging-eu-west/
│ ├── flux-system/
│ └── apps/
└── infrastructure/
├── sources/ # Helm repositories
└── configs/ # Shared configurations
The Deployment Flow
Here's what happens when a developer pushes code:
- Developer pushes to main branch: Code triggers CI pipeline
- CI builds and pushes container: Image tagged with Git SHA goes to registry
- CI updates GitOps repo: Updates image tag in Helm values or HelmRelease
- Flux detects change: Polls GitOps repo every minute (configurable)
- Flux applies changes: Updates cluster to match desired state
- Flux monitors health: Ensures deployment succeeds, can trigger alerts
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: atlas-resources-api
namespace: flux-system
spec:
interval: 5m
targetNamespace: atlas
chart:
spec:
chart: ./helm/chart
sourceRef:
kind: GitRepository
name: atlas-resources-api
interval: 1m
values:
image:
repository: harbor.company.io/atlas/resources-api
tag: ${GIT_SHA} # Updated by CI
replicaCount: 3
ingress:
enabled: true
hostname: api.atlas.company.io
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
# Automated rollback on failure
upgrade:
remediation:
retries: 3
remediateLastFailure: true
The Real Benefits We've Experienced
Complete Audit Trail
Every change to production is a Git commit. Need to know who deployed what at 3 AM last Tuesday? It's in the Git history. Need to understand why a service was scaled up? Check the commit message. This has saved us countless hours during incident investigations.
Rollbacks That Actually Work
Rolling back is literally git revert
. No custom scripts, no remembering the previous version, no hoping the rollback procedure still works. We've reduced rollback time from 15-20 minutes to under 2 minutes.
# Instant rollback to previous version
git revert HEAD --no-edit
git push
# Flux automatically applies the revert within minutes
Self-Healing Infrastructure
Someone manually scaled a deployment? Flux scales it back. Accidentally deleted a ConfigMap? Flux recreates it. This drift prevention has eliminated entire categories of production issues.
Developer Experience
Developers don't need kubectl access. They don't need to learn Kubernetes intricacies. They push code, CI builds it, and GitOps deploys it. The abstraction is clean and familiar.
The Challenges Nobody Mentions
Secret Management Complexity
You can't store secrets in Git (obviously). This means integrating tools like Sealed Secrets, SOPS, or external secret operators. We use Sealed Secrets, but it adds complexity:
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
name: database-credentials
namespace: atlas
spec:
encryptedData:
username: AgBvA8kOp5... # Encrypted value
password: AgCdX9mRt2... # Encrypted value
The Git Bottleneck
When your Git repository is down, deployments stop. We've had GitHub outages block deployments for hours. You need contingency plans, like break-glass procedures for emergency changes.
Debugging Becomes Indirect
When something goes wrong, you're debugging Flux logs, not your deployment directly. The abstraction layer helps until it doesn't. Common issues we've faced:
- Flux gets stuck reconciling due to resource conflicts
- Image pull errors aren't immediately obvious
- Helm chart errors can be cryptic in Flux logs
- Dependency ordering issues with CRDs
Initial Learning Curve
Teams comfortable with traditional CI/CD need time to adjust. "Why can't I just kubectl apply?" is a common question. The mental model shift from push to pull takes time.
GitOps vs Traditional CI/CD: The Real Comparison
Aspect | Traditional CI/CD | GitOps |
---|---|---|
Deployment Method | CI pushes to cluster | Operator pulls from Git |
Cluster Credentials | Stored in CI system | Never leave cluster |
Rollback Speed | 10-30 minutes | 1-2 minutes |
Audit Trail | CI logs (if retained) | Complete Git history |
Drift Prevention | Manual or scripted | Automatic |
Multi-cluster | Complex pipeline logic | Different Git branches/paths |
Practical Flux CD Implementation
Bootstrap Flux in Your Cluster
# Install Flux CLI
curl -s https://fluxcd.io/install.sh | sudo bash
# Check prerequisites
flux check --pre
# Bootstrap Flux with GitHub
flux bootstrap github --owner=your-org --repository=platform-gitops --branch=main --path=clusters/prod --personal
Structure Your Helm Releases
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: ingress-nginx
namespace: flux-system
spec:
interval: 1h
url: https://kubernetes.github.io/ingress-nginx
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: ingress-nginx
namespace: flux-system
spec:
interval: 5m
chart:
spec:
chart: ingress-nginx
version: '4.x'
sourceRef:
kind: HelmRepository
name: ingress-nginx
values:
controller:
service:
type: LoadBalancer
Monitor Flux Operations
# Check Flux component status
flux get all
# Watch Flux logs
flux logs --follow
# Get detailed reconciliation status
flux get helmreleases -A
# Force reconciliation (useful for testing)
flux reconcile source git flux-system
When GitOps Makes Sense (And When It Doesn't)
Perfect for GitOps
- ✓ Multi-cluster deployments requiring consistency
- ✓ Teams needing strong audit and compliance requirements
- ✓ Environments where configuration drift is problematic
- ✓ Organizations with mature Git workflows
- ✓ Stateless applications and services
Think Twice About GitOps
- ✗ Rapid prototyping or experimental environments
- ✗ Stateful applications requiring complex migrations
- ✗ Teams without Kubernetes expertise
- ✗ Environments requiring sub-minute deployment times
- ✗ Applications with frequently changing secrets
Best Practices from Production
1. Separate Application and Infrastructure Repos
Keep application code separate from Kubernetes manifests. This allows different teams to own different parts and reduces merge conflicts.
2. Use Kustomize or Helm for Templating
Don't store raw YAML for every environment. Use Helm charts with environment-specific values or Kustomize overlays to reduce duplication.
3. Implement Progressive Delivery
Combine GitOps with Flagger for canary deployments. Flux deploys, Flagger gradually shifts traffic:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atlas-api
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: atlas-api
progressDeadlineSeconds: 60
service:
port: 8080
analysis:
interval: 30s
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 30s
4. Set Up Alerts
Configure Flux to send alerts to Slack or PagerDuty when reconciliation fails:
apiVersion: notification.toolkit.fluxcd.io/v1beta1
kind: Alert
metadata:
name: on-call-webapp
namespace: flux-system
spec:
providerRef:
name: slack
eventSeverity: error
eventSources:
- kind: HelmRelease
namespace: default
name: '*'
- kind: Kustomization
namespace: flux-system
name: '*'
The Verdict: Is GitOps Worth It?
After two years of GitOps in production across multiple clusters and teams, my answer is:absolutely yes, with caveats.
GitOps has eliminated entire categories of problems. No more configuration drift, no more mysterious production changes, no more failed rollbacks. The audit trail alone has justified the investment during compliance audits.
But it's not free. You need to invest in tooling, training, and new processes. Secret management becomes more complex. Debugging requires understanding an additional abstraction layer. And you're adding a dependency on Git availability.
For enterprises running Kubernetes at scale, GitOps is becoming the de facto standard. For smaller teams or simpler deployments, the overhead might not be worth it. Evaluate your specific needs, but don't dismiss GitOps as just another buzzword. It's a fundamental shift in how we think about deployment, and for many organizations, it's the right shift.
If you're considering GitOps for your organization, also check out our article on monorepo architectures, which explores another critical aspect of modern DevOps infrastructure organization.