Sr. Software Engineer (Data Center Automation) (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Sr. Software Engineer (Data Center Automation) (AI): Managing and enhancing reliability across a multi-data center environment with an accent on automating reliability workflows and observability solutions. Focus on reducing MTTR through proactive monitoring, optimizing Linux-based systems for AI workloads, and integrating software reliability with physical infrastructure.
Location: Memphis, TN
Company
is focused on creating AI systems that accurately understand the universe and aid humanity in its pursuit of knowledge.
What you will do
- Design and deploy scalable services in Python and Rust to automate reliability workflows, monitoring, and incident response.
- Implement advanced observability stacks (metrics, logging, tracing) to provide real-time insights into multi-data center health.
- Collaborate with network and facility operations to mitigate physical risks and automate fault tolerance and disaster recovery.
- Troubleshoot complex hardware, environmental, and software issues in distributed environments to harden system resilience.
- Optimize Linux kernels and container orchestration (Kubernetes) for high-performance AI compute environments.
- Mentor junior engineers and drive a culture of automation and knowledge sharing.
Requirements
- Must be based in or able to work onsite in Memphis, TN
- 3+ years of experience in SRE, infrastructure engineering, or DevOps in large-scale production environments.
- Strong production experience in Python and proficiency in a systems-level language like Rust, Go, or C++.
- Deep knowledge of Linux systems administration, kernel tuning, and scripting for automation.
- Practical experience with Docker, Kubernetes, and observability tools like Prometheus and Grafana.
- Understanding of large-scale networking fundamentals (TCP/IP, routing, DNS).
Nice to have
- 5+ years of experience in hyperscale, cloud, or AI/ML training infrastructure.
- Experience optimizing GPU clusters or high-throughput compute environments.
- Background in bare-metal provisioning and multi-site failover mechanisms.
- Experience integrating software tools with physical DC infrastructure (power, cooling).
Culture & Benefits
- Flat organizational structure where initiative and excellence are rewarded with leadership.
- High-impact environment focusing on engineering excellence and curiosity.
- Collaborative culture emphasizing concise knowledge sharing and strong work ethic.
- Opportunity to work on bleeding-edge AI infrastructure at a global scale.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →