Senior/Staff Software Engineer (ML Infrastructure)

Формат работы

onsite

Тип работы

fulltime

Грейд

senior/principal

Английский

Страна

Описание вакансии

Текст:

TL;DR

Senior/Staff Software Engineer (ML Infrastructure): Designing, building, and operating foundational systems for large-scale machine learning and AI at Slack with an accent on distributed model training, serving, and deployment. Focus on evolving GPU-backed inference infrastructure, optimizing data processing systems, and setting long-term architectural direction for ML infrastructure.

Location: Onsite in Seattle, Austin, Atlanta, or Bellevue, USA

Company

hirify.global's Slack AI team focuses on transforming how people work by making Slack an AI-powered operating system.

What you will do

Design, build, and operate scalable, reliable, and performant systems for ML model training, serving, and deployment.
Evolve GPU-backed inference infrastructure to support high-throughput, latency-sensitive AI workloads.
Architect and optimize distributed training and data processing systems using technologies like Ray, Airflow, or Spark.
Build and maintain Kubernetes-based platforms and orchestration layers, including tools like KubeRay and vLLM.
Architect solutions to bridge legacy systems with modern technologies while ensuring application stability.
Develop robust monitoring, observability, and alerting for production ML workloads.
Provide technical leadership through design reviews, mentorship, and by setting engineering standards.

Requirements

Significant professional experience in software engineering with a strong focus on infrastructure, backend systems, platform engineering, or MLOps.
Deep experience building and operating distributed systems, including expert-level knowledge of Kubernetes.
Hands-on experience with modern ML infrastructure and serving stacks (e.g., Ray, KubeRay, vLLM).
Experience working with GPU infrastructure, including performance optimization and operational management at scale.
Strong experience with data infrastructure and orchestration technologies (e.g., Airflow, Spark).
Experience building and operating cloud-native systems on public cloud platforms (AWS, GCP, Azure) and infrastructure as code.
Demonstrated ability to drive technical direction for complex systems and balance short-term delivery with long-term architectural goals.
Excellent written communication and ability to thrive in an asynchronous team environment.

Culture & Benefits

Join a team shaping the future of work by making Slack an AI-powered operating system.
Contribute to deep architectural decisions for large-scale, high-performance ML and AI systems.
Work on complex scalability and reliability challenges at the intersection of distributed systems and ML.
Opportunity to thrive in an asynchronous and globally distributed infrastructure team.