Principal Technical Program Manager (AI Infrastructure)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Principal Technical Program Manager (AI Infrastructure): Driving the operational stability and scaling of high-performance GPU fleets and InfiniBand networks with an accent on availability and uptime metrics. Focus on optimizing operational workflows, leading large-scale infrastructure build-outs, and bridging engineering execution with strategic business goals.
Location: Remote (Global) — Geography is no barrier to impact or connection
Company
is a GPU cloud engineered for AI, providing cost-effective, high-performance infrastructure for AI startups and large enterprise customers.
What you will do
- Lead strategic operational programs, including new data center AI infrastructure build-outs and large-scale fleet software/firmware rollouts.
- Establish and drive accountability for critical infrastructure KPIs, specifically targeting 97.5% Availability and 99% Uptime.
- Analyze and optimize operational workflows across Fleet Operations, Network Operations, and SRE to reduce toil and improve MTTR.
- Act as the primary liaison between hardware, compute platform, and network engineering teams, as well as external GPU and network hardware vendors.
- Translate capacity planning models into actionable infrastructure delivery and readiness roadmaps.
- Identify and mitigate technical, schedule, and resource risks related to AI infrastructure scaling.
Requirements
- 5+ years of experience in Technical Program Management driving large-scale infrastructure or software engineering programs.
- Strong foundational understanding of data center infrastructure, distributed systems, Linux, and networking concepts.
- Proven expertise in modern program management methodologies (Agile, Scrum, PMP preferred).
- Experience defining and improving system performance based on operational metrics (SLOs, SLIs, MTTR).
- Ability to thrive in a fast-paced, high-growth environment and manage multiple priorities under ambiguity.
Nice to have
- Direct experience with data center infrastructure build-outs and hardware commissioning.
- Domain knowledge of AI/HPC infrastructure, including NVIDIA GPUs and InfiniBand/RDMA networks.
- Experience in hyperscale or public cloud environments supporting 24/7 mission-critical services.
- Familiarity with SRE principles, automation tooling, and CI/CD pipelines for infrastructure.
- Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field.
Culture & Benefits
- Highly competitive package including base salary and equity with annual reviews.
- Remote-first team culture with high autonomy and human-first flexibility.
- Opportunity to join a fast-growing tech startup and work on cutting-edge AI infrastructure.
- Dynamic progression plan tailored to individual ambitions and ownership.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →