Senior Site Reliability Engineer (AI Infrastructure Operations)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer (AI Infrastructure Operations): Design, build, and operate reliable, scalable infrastructure across our GPU cloud with an accent on hands-on engineering, system reliability, and operational excellence. Focus on improving performance, automating operations, and ensuring platform stability at scale.
Location: US
Salary: $100,000 - $200,000 USD
Company
is the GPU cloud engineered for AI—purpose-built to deliver high-performance, cost-efficient infrastructure for AI-native startups and global enterprises.
What you will do
- Design, build, and improve automation, tooling, and infrastructure systems supporting AI and HPC workloads.
- Contribute to the development of control-plane systems and operational frameworks.
- Define and implement SLOs, SLIs, and monitoring strategies to ensure system reliability.
- Participate in incident response and root cause analysis, driving improvements to reduce recurrence.
- Collaborate with Engineering, Network, and Fleet teams to improve system design and operational processes.
- Drive improvements in availability, scalability, and operational efficiency.
Requirements
- 5–8+ years of experience in SRE, Systems Engineering, or Software Engineering in production environments.
- Strong software engineering skills with experience building automation and infrastructure tooling.
- Solid understanding of Linux systems, networking, and distributed systems.
- Experience troubleshooting issues across infrastructure, OS, networking, and application layers.
- Familiarity with monitoring, alerting, and observability tools.
- Ability to balance reliability, performance, and delivery speed.
Nice to have
- Experience with AI or HPC environments, including GPUs or high-performance systems.
- Exposure to high-speed networking (InfiniBand/RDMA).
- Familiarity with Kubernetes, cloud platforms, or bare-metal environments.
- Experience with observability systems in high-scale environments.
Culture & Benefits
- Culture is defined by ownership, accountability, and rapid innovation.
- Operate with urgency and transparency.
- Competitive benefits package including medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →