Datacenter Hardware Operations Technician Lead (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Datacenter Hardware Operations Technician Lead (AI): Serving as the senior on-site technical authority for hardware reliability and fleet health at a flagship AI campus with an accent on diagnosing complex hardware failures and driving root cause analysis. Focus on maintaining world-class operational performance for large-scale GPU, server, and storage infrastructure.
Location: Must be based onsite in Abilene, Texas 5 days per week
Compensation: $86.4K – $228K
Company
is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity.
What you will do
- Serve as the senior on-site hardware operations lead for server, GPU, storage, and rack-level infrastructure.
- Drive technical triage and resolution of complex hardware failures impacting production systems.
- Partner with Fleet Health Engineering to investigate recurring hardware issues and improve fleet reliability.
- Lead root cause analysis (RCA) efforts for critical hardware incidents and develop corrective action plans.
- Collaborate with operations teams and OEM vendors to coordinate repairs, upgrades, and lifecycle activities.
- Establish and improve hardware maintenance procedures, operational runbooks, and troubleshooting standards.
Requirements
- 8+ years of experience supporting large-scale datacenter hardware infrastructure.
- Deep expertise with server platforms, GPU systems, storage infrastructure, and rack integration.
- Proven experience diagnosing complex hardware failures and leading repair efforts in production environments.
- Strong understanding of hardware reliability engineering principles and fleet-health management.
- Ability to partner effectively across engineering, operations, manufacturing, and vendor organizations.
- Must be able to sit onsite in Abilene, Texas 5 days per week.
Nice to have
- Experience supporting large-scale GPU clusters or AI/ML infrastructure environments.
- Familiarity with fleet health systems, telemetry platforms, and hardware monitoring tools.
- Experience with failure analysis methodologies such as FRACAS, RCCA, 5-Why, or FMEA.
- Knowledge of Linux system administration and hardware validation workflows.
- Industry certifications such as CompTIA Server+ or OEM hardware certifications.
Culture & Benefits
- Opportunity to work on world-class AI infrastructure at scale.
- Collaborative environment partnering with top-tier engineering and operations teams.
- Commitment to safety, diversity, and inclusion in the workplace.
- Focus on operational excellence and continuous improvement of infrastructure standards.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →