AI infrastructure and ML platform scale-out
Led platform architecture and delivery systems for AI/ML workloads used across global R&D teams, with an emphasis on throughput, reproducibility, and production readiness.
- Scaled a global AI / HPC platform spanning Nvidia A100/H100, EKS, FSx, and S3-backed storage patterns.
- Standardized Terraform-based IaC and automated compute provisioning to improve platform throughput by 23%.
- Operationalized internal models through SageMaker and Bedrock, reducing time-to-production by 26% across 7+ teams.
Stack · AWS · EKS · FSx · S3 · SageMaker · Bedrock · Terraform · HPC