Site Reliability Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Site Reliability Engineer (AI/DevOps): Shaping the reliability, scalability, and performance of the platform and customer-facing applications with an accent on infrastructure automation and high-availability ML workloads. Focus on building a cloud-agnostic platform and optimizing HPC clusters to ensure seamless model training and inference.
Location: Based in New York, NY (Hybrid: at least 3 days per week in office). Open to candidates who are open to relocating to the USA.
Company
is a pioneering AI company democratizing high-performance, open-source, and cutting-edge models and solutions.
What you will do
- Design and maintain scalable, fault-tolerant infrastructure for web services and ML workloads.
- Manage production systems, troubleshooting issues and implementing monitoring and alerting systems.
- Automate infrastructure deployment and orchestration using Kubernetes, Flux, and Terraform.
- Collaborate with AI/ML researchers to enable reproducible model-training experiments.
- Develop a cloud-agnostic platform as an abstraction layer between science and infrastructure.
- Contribute to open-source projects, research publications, and technical documentation.
Requirements
- Master’s degree in Computer Science, Engineering, or a related field.
- 7+ years of experience in a DevOps/SRE role.
- Strong experience with cloud computing, distributed systems, and reliability KPIs.
- Hands-on proficiency with Docker, Kubernetes, Prometheus, Grafana, and ELK Stack.
- Proficiency in Python, Go, or Bash and strong networking/security knowledge.
- Must be based in or willing to relocate to NYC.
Nice to have
- Experience in AI/ML environments.
- Knowledge of High-Performance Computing (HPC) systems and Slurm.
- Experience with AI-oriented solutions like Fluidstack, Coreweave, or Vast.
Culture & Benefits
- Competitive salary and equity package.
- Comprehensive medical, dental, and vision insurance for employees and families.
- 401K with 6% matching.
- 18 days of PTO and visa sponsorship.
- Monthly stipends for meals ($400), gym membership ($120), and transportation.
- Access to BetterUp coaching on a voluntary basis.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →