Operations Engineer (HPC Networking)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Operations Engineer (HPC Networking): Support the deployment, monitoring, and maintenance of large-scale InfiniBand fabrics ensuring stability and performance with an accent on network troubleshooting and cluster operations. Focus on investigating connectivity problems, resolving performance bottlenecks, and maintaining HPC control plane components.
Location: Hybrid (NJ, NY, CA, WA); remote may be considered for candidates located more than 30 miles from an office. Must be a U.S. person (citizen, national, lawful permanent resident, refugee, or asylee) to comply with U.S. Government export regulations.
Salary: $110,000 – $179,000
Company
is the 'Essential Cloud for AI', providing a high-performance infrastructure platform for AI labs, startups, and global enterprises.
What you will do
- Monitor the performance and health of InfiniBand fabrics, including switches, host adapters, and nodes.
- Investigate and resolve operational issues such as network connectivity problems and performance bottlenecks.
- Assist with the installation and operational bring-up of large InfiniBand fabrics with onsite personnel and customers.
- Perform routine maintenance and upgrades on InfiniBand switches and control plane components.
- Collaborate with HPC cluster operations teams to provide troubleshooting and operational expertise.
Requirements
- At least 1 year of experience with InfiniBand or similar networking technologies.
- Solid understanding of networking architectures, topologies, and operational best practices.
- Experience with Linux system administration and maintenance.
- Proficiency in at least one scripting language.
Nice to have
- Hands-on experience with Nvidia UFM or similar fabric management tools.
- Familiarity with SLURM job scheduler in HPC environments.
- Experience with monitoring platforms such as Grafana or Prometheus.
- Experience with automation frameworks like Ansible.
- Knowledge of data center operations, including server racks and cabling.
- Python or Bash scripting skills.
Culture & Benefits
- 100% company-paid medical, dental, and vision insurance.
- 401(k) with a generous employer match and Employee Stock Purchase Program (ESPP).
- Flexible PTO and comprehensive family-forming support via Carrot.
- Mental wellness benefits through Spring Health and tuition reimbursement.
- Catered lunch provided daily in office and data center locations.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →