Staff Infrastructure Engineer, Cluster Infrastructure (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Infrastructure Engineer (Cluster Infrastructure/AI): Designing and scaling the full lifecycle of compute clusters across cloud providers and datacenters with an accent on agent-driven automation and high-bandwidth connectivity. Focus on establishing technical strategy for cluster scalability, homogeneity, and fault tolerance at hyperscale.
Location: Hybrid: Must be based in or be able to work from San Francisco, CA; New York City, NY; or Seattle, WA (minimum 25% office presence)
Salary: $320,000 - $4,050,000 USD per year
Company
is a public benefit corporation dedicated to creating reliable, interpretable, and steerable AI systems.
What you will do
- Own the technical strategy and roadmap for agent-driven cluster lifecycle management, including provisioning, updates, and decommissioning.
- Collaborate with cloud providers and internal research, inference, and product teams to shape long-term compute and infrastructure strategy.
- Ensure clusters are provisioned secure-by-default and leverage cloud solutions for high-bandwidth inter-cluster connectivity.
- Define and drive strategies for cluster scalability, homogeneity, and fault tolerance.
- Establish operational-excellence practices, including incident response and a healthy on-call culture.
- Provide technical mentorship and coaching to support the growth of surrounding engineers.
Requirements
- Deep expertise in distributed systems, reliability, and cloud platforms (Kubernetes, IaC, AWS/GCP/Azure).
- Strong proficiency in Rust, Go, or Python, and experience with Terraform.
- Proven track record of leading complex, multi-quarter technical initiatives spanning multiple teams.
- Ability to build alignment across senior stakeholders and communicate effectively.
- Must be based in or able to work from one of the designated US offices.
Nice to have
- 8+ years of software engineering experience, including time as a technical lead.
- Experience operating hyperscale compute infrastructure (100+ clusters, 10K+ nodes).
- Depth in Kubernetes internals, cluster orchestration systems (e.g., Mesos, Borg), or cloud networking (VPC, BGP, eBPF).
- Experience with cluster security, pod security standards, RBAC, and container hardening.
- Expertise with workflow orchestration tools like Temporal or Argo Workflows.
Culture & Benefits
- Competitive compensation with optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours and collaborative office spaces.
- Highly collaborative "big science" approach to AI research.
- Visa sponsorship available for eligible candidates.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →