Infrastructure Support Lead (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Infrastructure Support Lead (GPU Cloud/AI): Leading and managing the US infrastructure support team to deliver high-performance GPU cloud services with an accent on team leadership, service delivery (SLA), and operational excellence. Focus on managing Kubernetes and Linux-based infrastructure at scale, solving complex technical incidents across compute and networking layers, and automating operational workflows.
Location: Must be based in the US. Remote-first team, but requires travel to or customer sites when needed.
Company
is a GPU cloud engineered for AI, providing cost-effective, high-performance infrastructure for AI startups and large enterprise customers.
What you will do
- Manage, coach, and mentor the Infrastructure US team, including performance reviews, development planning, and shift scheduling.
- Own ticket queue management and ensure strict adherence to ITIL processes across incidents, requests, and changes.
- Drive operational excellence by improving dashboards, alerting, and runbooks to reduce repeat incidents.
- Provide hands-on technical support across compute, storage, networking, and Kubernetes environments at scale.
- Act as the regional escalation point for high-impact incidents and lead post-incident reviews to identify recurring patterns.
- Collaborate with Senior Engineers on technical improvements and the development of operational tooling.
Requirements
- Must be based in the United States to lead the regional team.
- Proven experience leading or managing engineers in an operational support environment with a focus on meeting SLAs.
- Strong Linux systems engineering expertise with a track record of troubleshooting production compute, storage, and network layers.
- Experience operating and debugging Kubernetes environments and distributed systems.
- Solid understanding of networking fundamentals (L2/L3, routing, VLANs) and high-performance fabrics like RDMA/NVLink.
- Proficiency with scripting (Bash, Python) and Infrastructure as Code tools such as Ansible and Terraform.
Nice to have
- Experience with GPU platforms (NVIDIA/AMD) and performance diagnostics (nvidia-smi, NCCL).
- Exposure to HPC or distributed workloads involving InfiniBand or MPI.
- Experience with CI/CD or GitOps tooling.
- Experience working in multi-region environments.
Culture & Benefits
- Highly competitive compensation package including base salary and equity.
- Remote-first work culture with "Human-First Flexibility," granting autonomy to shape your own schedule.
- Dynamic progression plan tailored to individual ambitions and ownership of impact.
- Collaborative and innovative environment within a fast-growing tech startup.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →