TL;DR
Member of Technical Staff, ML Infra (AI): Designing, building, and maintaining a compute platform that powers all AI research at the SF AI Lab with an accent on managing large-scale GPU pools and ensuring optimal resource utilization. Focus on modernizing infrastructure using Infrastructure as Code (AWS CDK), optimizing system performance across multiple GPU architectures, and building distributed systems for multi-tenant research environments.
Location: This role is based at the SF AI Lab. Compensation reflects the cost of labor across several US geographic markets, and references to Los Angeles County and San Francisco ordinances indicate a US-based position.
Salary: $150,000–$325,000/year
Company
hirify.global is a small, talent-dense team within Amazon's AGI Autonomy organization, focused on advancing Artificial General Intelligence (AGI) systems.
What you will do
- Design, build, and maintain the compute platform for AI research, managing large-scale GPU pools.
- Partner directly with research scientists to develop infrastructure solutions that accelerate research velocity.
- Implement and maintain robust security controls and hardening measures while enabling researcher productivity.
- Modernize and scale existing infrastructure by converting manual deployments into reproducible Infrastructure as Code using AWS CDK.
- Optimize system performance across multiple GPU architectures and ensure maximum computational efficiency.
- Design and implement monitoring, orchestration, and automation solutions for GPU workloads at scale.
- Build distributed systems infrastructure, including Kubernetes-based orchestration, to support multi-tenant research environments.
Requirements
- 5+ years of professional experience in systems development, DevOps, or infrastructure engineering.
- Hands-on experience with AWS services and cloud infrastructure (EC2, VPC, S3, IAM, CloudFormation/CDK).
- Programming skills in Python, Go, or similar languages for infrastructure automation.
- Experience building and maintaining production systems at scale.
- Demonstrated ability to troubleshoot complex distributed systems issues.
- Knowledge of security best practices and experience implementing security controls.
- Experience with Infrastructure as Code (IaC) principles and tools.
Nice to have
- Knowledge of AWS CDK and CloudFormation for infrastructure automation.
- Networking experience (VPC design, network security, performance optimization).
- Security hardening experience in cloud environments, including compliance frameworks.
- Experience with Kubernetes and container orchestration at scale.
- Familiarity with GPU computing, CUDA, and ML framework internals (PyTorch, TensorFlow, Ray).
Culture & Benefits
- Work within a small, high-impact foundational infrastructure team at the SF AI Lab.
- Opportunity to work with some of the most advanced AI infrastructure in the world.
- Build skills that define the future of ML systems engineering.
- Collaborate closely with research scientists, translating needs into robust, scalable infrastructure.
- Total compensation package includes equity, sign-on payments, medical, financial, and other benefits.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →