Site Reliability Engineer (Supercomputing)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Site Reliability Engineer (Supercomputing): Operating and scaling the next generation of compute clusters that power ’s frontier research with an accent on distributed systems engineering and hands-on infrastructure work. Focus on managing fast-moving operations, diagnosing issues, and building automation for large-scale systems.
Location: Onsite in San Francisco
Salary: $255K – $490K + Offers Equity
Company
is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity.
What you will do
- Spin up and scale large Kubernetes clusters, including automation for provisioning and lifecycle management.
- Build software abstractions that unify multiple clusters for seamless training workloads.
- Own node bring-up from bare metal through firmware upgrades.
- Improve operational metrics such as reducing cluster restart times and accelerating upgrade cycles.
- Integrate networking and hardware health systems for end-to-end reliability.
- Develop monitoring and observability systems to maintain cluster stability.
Requirements
- Experience as an infrastructure or distributed systems engineer in large-scale environments.
- Strong knowledge of Kubernetes internals and containerized workloads.
- Proficiency in cloud infrastructure concepts and automating operations.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →