TL;DR
Senior/Staff Software Engineer (ML Infrastructure): Designing, building, and operating foundational systems for large-scale machine learning and AI at Slack with an accent on distributed model training, serving, and deployment. Focus on evolving GPU-backed inference infrastructure, optimizing data processing systems, and setting long-term architectural direction for ML infrastructure.
Location: Onsite in Seattle, Austin, Atlanta, or Bellevue, USA
Company
hirify.global's Slack AI team focuses on transforming how people work by making Slack an AI-powered operating system.
What you will do
- Design, build, and operate scalable, reliable, and performant systems for ML model training, serving, and deployment.
- Evolve GPU-backed inference infrastructure to support high-throughput, latency-sensitive AI workloads.
- Architect and optimize distributed training and data processing systems using technologies like Ray, Airflow, or Spark.
- Build and maintain Kubernetes-based platforms and orchestration layers, including tools like KubeRay and vLLM.
- Architect solutions to bridge legacy systems with modern technologies while ensuring application stability.
- Develop robust monitoring, observability, and alerting for production ML workloads.
- Provide technical leadership through design reviews, mentorship, and by setting engineering standards.
Requirements
- Significant professional experience in software engineering with a strong focus on infrastructure, backend systems, platform engineering, or MLOps.
- Deep experience building and operating distributed systems, including expert-level knowledge of Kubernetes.
- Hands-on experience with modern ML infrastructure and serving stacks (e.g., Ray, KubeRay, vLLM).
- Experience working with GPU infrastructure, including performance optimization and operational management at scale.
- Strong experience with data infrastructure and orchestration technologies (e.g., Airflow, Spark).
- Experience building and operating cloud-native systems on public cloud platforms (AWS, GCP, Azure) and infrastructure as code.
- Demonstrated ability to drive technical direction for complex systems and balance short-term delivery with long-term architectural goals.
- Excellent written communication and ability to thrive in an asynchronous team environment.
Culture & Benefits
- Join a team shaping the future of work by making Slack an AI-powered operating system.
- Contribute to deep architectural decisions for large-scale, high-performance ML and AI systems.
- Work on complex scalability and reliability challenges at the intersection of distributed systems and ML.
- Opportunity to thrive in an asynchronous and globally distributed infrastructure team.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →