обновлено 3 часа назад

Senior Site Reliability Engineer (AI Infrastructure Operations)

100 000 - 200 000$

Формат работы

remote (только USA)

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Site Reliability Engineer (AI Infrastructure Operations): Design, build, and operate reliable, scalable infrastructure across our GPU cloud with an accent on hands-on engineering, system reliability, and operational excellence. Focus on improving performance, automating operations, and ensuring platform stability at scale.

Location: US

Salary: $100,000 - $200,000 USD

Company

Nscale is the GPU cloud engineered for AI—purpose-built to deliver high-performance, cost-efficient infrastructure for AI-native startups and global enterprises.

What you will do

Design, build, and improve automation, tooling, and infrastructure systems supporting AI and HPC workloads.
Contribute to the development of control-plane systems and operational frameworks.
Define and implement SLOs, SLIs, and monitoring strategies to ensure system reliability.
Participate in incident response and root cause analysis, driving improvements to reduce recurrence.
Collaborate with Engineering, Network, and Fleet teams to improve system design and operational processes.
Drive improvements in availability, scalability, and operational efficiency.

Requirements

5–8+ years of experience in SRE, Systems Engineering, or Software Engineering in production environments.
Strong software engineering skills with experience building automation and infrastructure tooling.
Solid understanding of Linux systems, networking, and distributed systems.
Experience troubleshooting issues across infrastructure, OS, networking, and application layers.
Familiarity with monitoring, alerting, and observability tools.
Ability to balance reliability, performance, and delivery speed.

Nice to have

Experience with AI or HPC environments, including GPUs or high-performance systems.
Exposure to high-speed networking (InfiniBand/RDMA).
Familiarity with Kubernetes, cloud platforms, or bare-metal environments.
Experience with observability systems in high-scale environments.

Culture & Benefits

Culture is defined by ownership, accountability, and rapid innovation.
Operate with urgency and transparency.
Competitive benefits package including medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Похожие вакансии