Open to DevOps Engineer · SRE · Platform Engineer - US Remote or Hybrid
Platform Engineer & AI Infrastructure
Five years building cloud infrastructure, Kubernetes platforms, and ML pipelines on AWS.
I build and operate production infrastructure for engineering and ML teams: Kubernetes clusters on EKS, GitOps delivery with ArgoCD and Terraform, CI/CD pipelines, and the platform layer for running ML training jobs and inference workloads.
5+
Years in Production
40%
Cloud Cost Reduction
Core Expertise
Platform Engineering
Design and operate internal developer platforms on Kubernetes and AWS.
GitOps-first delivery with ArgoCD and Helm. Self-service infrastructure
provisioning via Terraform modules.
AI & ML Infrastructure
Kubernetes-native ML platforms: GPU node pools, Kubeflow and Argo Workflows
for training pipelines, autoscaling inference endpoints, MLflow model registry,
and drift monitoring.
Security-Hardened CI/CD & SRE
Multi-stage pipelines with container scanning (Trivy), OPA policy gates, and
secrets management via HashiCorp Vault. SLO/SLA engineering backed by
Prometheus, Grafana, and OpenTelemetry. Least-privilege IAM with
Terraform modules and RBAC enforcement.
Projects
Multi-Tenant EKS Developer Platform
Problem
40+ engineers sharing one EC2-based environment - no isolation, no cost visibility,
3-4 hour deployment cycles with frequent cross-team conflicts.
What I Built
Migrated to EKS with Karpenter for cost-optimized spot/on-demand node provisioning.
Built a Helm chart library for standardized workload packaging. Deployed ArgoCD for
GitOps delivery with environment promotion gates. OPA Gatekeeper policies enforce
namespace isolation and security guardrails. Terraform modules enable self-service
environment provisioning in under 5 minutes.
22 min
avg deploy
time (was 4hr)
time (was 4hr)
32%
cluster cost
reduction
reduction
ML Training & Inference Platform on Kubernetes
Problem
Data science team deploying models manually via SSH. No versioning, no rollback,
no monitoring - production model degradation going undetected for days.
What I Built
Kubeflow pipelines for reproducible training jobs on GPU node pools with
scale-to-zero. MLflow for experiment tracking and model registry with
staging/canary promotion gates. Automated CI/CD pipeline for model deployment.
Grafana dashboards for inference latency, throughput, and drift detection.
85%
faster model
deployment
deployment
$0
idle training
cost
cost
Cloud Cost Engineering Program - $18k/mo Saved
Problem
$45k/month AWS bill growing 20% month-over-month. No per-team cost visibility.
$12k/month in idle and oversized resources identified during initial audit.
What I Built
Integrated Infracost into CI/CD for pre-merge cost estimates - blocking costly
changes before they merge. Deployed Kubecost for per-namespace attribution piped
into Grafana team dashboards. Automated Reserved Instance and Savings Plan
purchasing. RDS rightsizing, spot fleet for all non-prod environments,
and S3 lifecycle policies across all buckets.
40%
monthly AWS
cost reduction
cost reduction
$18k
saved per
month
month
3+
costly PRs
blocked/week
blocked/week
Technical Depth
Cloud Platform
EKS, ECS, Lambda, RDS, ElastiCache, VPC, IAM, SageMaker, CloudWatch, Cost Explorer, S3, CloudFront, Route 53, GuardDuty, Organizations
Kubernetes
Helm, ArgoCD, Karpenter, KEDA, Istio, Cilium, OPA/Gatekeeper, Falco, Trivy, RBAC, network policies, multi-tenant cluster design
IaC & CI/CD
Terraform, Terragrunt, Ansible, Pulumi, GitHub Actions, Jenkins, ArgoCD, Infracost
ML Platform
Kubeflow, MLflow, SageMaker Pipelines, Argo Workflows, GPU node scheduling, model registry, drift monitoring
Observability
Prometheus, Grafana, OpenTelemetry, Datadog, PagerDuty, Jaeger, distributed tracing, SLO/SLA design
Security
HashiCorp Vault, SOPS, AWS Secrets Manager, IAM hardening, Trivy, Falco, OPA Gatekeeper, SOC2 controls