Sr. Software Engineer (Data Center Automation) (AI)

Формат работы

onsite

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Sr. Software Engineer (Data Center Automation) (AI): Managing and enhancing reliability across a multi-data center environment with an accent on automating reliability workflows and observability solutions. Focus on reducing MTTR through proactive monitoring, optimizing Linux-based systems for AI workloads, and integrating software reliability with physical infrastructure.

Location: Memphis, TN

Company

hirify.global is focused on creating AI systems that accurately understand the universe and aid humanity in its pursuit of knowledge.

What you will do

Design and deploy scalable services in Python and Rust to automate reliability workflows, monitoring, and incident response.
Implement advanced observability stacks (metrics, logging, tracing) to provide real-time insights into multi-data center health.
Collaborate with network and facility operations to mitigate physical risks and automate fault tolerance and disaster recovery.
Troubleshoot complex hardware, environmental, and software issues in distributed environments to harden system resilience.
Optimize Linux kernels and container orchestration (Kubernetes) for high-performance AI compute environments.
Mentor junior engineers and drive a culture of automation and knowledge sharing.

Requirements

Must be based in or able to work onsite in Memphis, TN
3+ years of experience in SRE, infrastructure engineering, or DevOps in large-scale production environments.
Strong production experience in Python and proficiency in a systems-level language like Rust, Go, or C++.
Deep knowledge of Linux systems administration, kernel tuning, and scripting for automation.
Practical experience with Docker, Kubernetes, and observability tools like Prometheus and Grafana.
Understanding of large-scale networking fundamentals (TCP/IP, routing, DNS).

Nice to have

5+ years of experience in hyperscale, cloud, or AI/ML training infrastructure.
Experience optimizing GPU clusters or high-throughput compute environments.
Background in bare-metal provisioning and multi-site failover mechanisms.
Experience integrating software tools with physical DC infrastructure (power, cooling).

Culture & Benefits

Flat organizational structure where initiative and excellence are rewarded with leadership.
High-impact environment focusing on engineering excellence and curiosity.
Collaborative culture emphasizing concise knowledge sharing and strong work ethic.
Opportunity to work on bleeding-edge AI infrastructure at a global scale.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →