Назад
Company hidden
2 часа назад

Infrastructure Operations Engineer (AI)

Формат работы
hybrid
Тип работы
fulltime
Грейд
middle
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Infrastructure Operations Engineer (AI/GPU Cloud): Managing and optimizing data center infrastructure to ensure efficiency, reliability, and scalability of the GPU cloud with an accent on Linux, Kubernetes, and networking. Focus on troubleshooting complex infrastructure incidents, implementing automation, and collaborating with cross-functional teams to improve service delivery.

Location: Must be based in North Carolina, US (Hybrid/Travel required)

Company

hirify.global is a GPU cloud provider engineered for AI, providing high-performance infrastructure for AI startups and enterprises.

What you will do

  • Handle day-to-day tickets and alerts in the support rotation, escalating issues to Engineering when necessary.
  • Manage and resolve infrastructure tickets using the internal system, maintaining clear communication with all parties.
  • Execute runbooks to resolve common issues and propose incremental improvements and fixes.
  • Monitor, troubleshoot, and triage platform issues, capturing logs for efficient handover.
  • Identify and implement automation opportunities to optimize operational processes.
  • Travel to hirify.global or customer locations for deployments, troubleshooting, and operational tasks.

Requirements

  • Location: Must be based in North Carolina, USA
  • Strong fundamentals in Linux CLI, systemd, filesystems, permissions, and basic networking tools.
  • Solid understanding of IP addressing, subnets, VLANs, routing, DNS, and firewalls.
  • Experience with Kubernetes core concepts (nodes, pods, services, logs) and basic troubleshooting.
  • Ability to write simple Bash or Python scripts and use Git for version control.
  • Familiarity with GPU diagnostics (e.g., nvidia-smi) and observability dashboards.

Nice to have

  • Hands-on Kubernetes administration, operators, and storage/networking add-ons.
  • Knowledge of RDMA/InfiniBand, NCCL, or job schedulers for HPC.
  • Experience with Infrastructure as Code (Ansible, Terraform) and GitOps/CI/CD (GitHub Actions).
  • Experience with security tools like Teleport or Vault.

Culture & Benefits

  • Highly competitive compensation package including base salary and equity.
  • Dynamic progression plan tailored to individual ambitions.
  • "Human-First" flexibility and autonomy in shaping the workday.
  • Collaborative, remote-first culture within a fast-growing AI startup.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →