Principal Site Reliability Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Principal Site Reliability Engineer (AI): Leading reliability strategy and architectural design for high-performance AI and HPC infrastructure with an accent on scalability, automation, and operational excellence. Focus on designing large-scale control-plane systems, defining reliability standards, and driving systemic improvements across GPU and network platforms.
Company
provides high-performance, cost-effective GPU cloud infrastructure engineered specifically for AI start-ups and enterprise customers.
What you will do
- Own and evolve the long-term reliability strategy for AI and HPC infrastructure.
- Design and lead the development of large-scale control-plane systems and automation frameworks.
- Define reliability standards, SLO frameworks, and operational best practices.
- Act as a senior technical escalation point during critical incidents to ensure systemic resolution.
- Partner with cross-functional leadership to influence platform design and operational maturity.
- Mentor senior and mid-level engineers to elevate SRE practices across the organization.
Requirements
- 10+ years of experience in SRE, Systems, or Software Engineering operating complex infrastructure.
- Expert-level software engineering skills in building production-grade automation.
- Deep expertise in Linux, networking, and distributed systems design at scale.
- Extensive experience debugging failures across hardware, OS, network, and application layers.
- Proven ability to lead technical initiatives across teams without direct authority.
- Strong systems-thinking mindset balancing reliability, velocity, and cost.
Nice to have
- Hands-on experience with AI/HPC platforms, InfiniBand/RDMA, and workload schedulers like SLURM.
- Experience designing observability systems for high-cardinality and high-throughput environments.
- Familiarity with Kubernetes at scale and hybrid or bare-metal cloud architectures.
Culture & Benefits
- Competitive base and equity package with annual reviews.
- Remote-first environment with a focus on trust, autonomy, and flexible work.
- Opportunity to work at a fast-growing startup building cutting-edge AI technology.
- Collaborative, supportive environment with a focus on professional growth and progression.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →