TL;DR
Member of Technical Staff, Hardware Health (AI): Ensuring sustained reliability, performance, and availability of advanced AI training infrastructures featuring multi-gigawatt clusters and high-performance networks with an accent on developing predictive health models, failure detection, and autonomous remediation systems. Focus on designing advanced ROCE transport, fabric architecture, network modeling, and AI cluster bring-up.
Location: Hybrid in Zürich, Switzerland (expected to work from office 4 days/week if living within 25 miles).
Company
hirify.global operates one of the world’s most advanced AI training infrastructures, featuring multi-gigawatt clusters and ultra-low-latency networks.
What you will do
- Design and tune ROCE transport, congestion control, and network fabrics.
- Develop telemetry, observability, reliability engineering, and automated troubleshooting systems.
- Deploy novel routing techniques to achieve network reliability in large networks.
- Collaborate with world-class network designers like NVIDIA and Broadcom, and in-house silicon/network co-design teams.
- Conduct AI training/inference cluster bring-up, performance benchmarking, and root-cause analysis.
- Gather data and insights to develop the pretraining compute roadmap.
Requirements
- Bachelor’s Degree in Computer Science or a related technical field.
- 6+ years of technical engineering experience.
- Proficiency in coding with languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python.
- Expected to work from a designated Microsoft office at least four days a week if living within 25 miles of Zürich.
Culture & Benefits
- Part of hirify.global’s Superintelligence Team, a startup-like team pushing AI boundaries toward Humanist Superintelligence.
- Mission to empower every person and organization, innovate, and collaborate with values of respect, integrity, and accountability.
- Opportunity to partner with product teams giving models the chance to reach billions of users.
- Work in a fast-paced, design-driven, product development cycle.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →