Operations Engineer (HPC Networking)

110 000 - 179 000$

Формат работы

remote (только USA)/hybrid

Тип работы

fulltime

Грейд

middle

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Operations Engineer (HPC Networking): Support the deployment, monitoring, and maintenance of large-scale InfiniBand fabrics ensuring stability and performance with an accent on network troubleshooting and cluster operations. Focus on investigating connectivity problems, resolving performance bottlenecks, and maintaining HPC control plane components.

Location: Hybrid (NJ, NY, CA, WA); remote may be considered for candidates located more than 30 miles from an office. Must be a U.S. person (citizen, national, lawful permanent resident, refugee, or asylee) to comply with U.S. Government export regulations.

Salary: $110,000 – $179,000

Company

hirify.global is the 'Essential Cloud for AI', providing a high-performance infrastructure platform for AI labs, startups, and global enterprises.

What you will do

Monitor the performance and health of InfiniBand fabrics, including switches, host adapters, and nodes.
Investigate and resolve operational issues such as network connectivity problems and performance bottlenecks.
Assist with the installation and operational bring-up of large InfiniBand fabrics with onsite personnel and customers.
Perform routine maintenance and upgrades on InfiniBand switches and control plane components.
Collaborate with HPC cluster operations teams to provide troubleshooting and operational expertise.

Requirements

At least 1 year of experience with InfiniBand or similar networking technologies.
Solid understanding of networking architectures, topologies, and operational best practices.
Experience with Linux system administration and maintenance.
Proficiency in at least one scripting language.

Nice to have

Hands-on experience with Nvidia UFM or similar fabric management tools.
Familiarity with SLURM job scheduler in HPC environments.
Experience with monitoring platforms such as Grafana or Prometheus.
Experience with automation frameworks like Ansible.
Knowledge of data center operations, including server racks and cabling.
Python or Bash scripting skills.

Culture & Benefits

100% company-paid medical, dental, and vision insurance.
401(k) with a generous employer match and Employee Stock Purchase Program (ESPP).
Flexible PTO and comprehensive family-forming support via Carrot.
Mental wellness benefits through Spring Health and tuition reimbursement.
Catered lunch provided daily in office and data center locations.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →