Staff Machine Learning Engineer, Genai Platform (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Machine Learning Engineer, GenAI Platform (AI): Architecting and scaling 's Generative AI and LLM platform capabilities with an accent on designing resilient, large-scale distributed systems. Focus on building self-serve LLM workflows and developing comprehensive evaluation & benchmarking infrastructure.
Location: Remote (United States)
Salary: $253,300 - $354,600 USD
Company
is a community of communities built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet.
What you will do
- Drive GenAI Infrastructure Strategy: Propose, design, and lead the architecture of our next-generation LLM platform.
- Design Resilient, Large-Scale Distributed Systems: Architect highly fault-tolerant training infrastructure capable of supporting multi-week, distributed workloads across massive GPU clusters.
- Build Self-Serve LLM Workflows: Design and implement robust, production-grade pipelines for LLM fine-tuning.
- Develop Comprehensive Evaluation & Benchmarking Infrastructure: Build scalable systems for automated regression detection, structured metrics tracking, and complex inference-heavy evaluation patterns.
- Architect Advanced Data Ingestion Pipelines: Extend our distributed data platforms to natively and efficiently handle the massive, multimodal datasets required for modern GenAI workloads, optimizing for throughput and dynamic batching.
- Provide Technical Leadership & Mentorship: Analyze complex bottlenecks in distributed systems to optimize for performance and cost-efficiency.
Requirements
- 10+ years of work experience in a production software development environment or building complex distributed data systems, plus a degree in ML, Engineering, Computer Science, or a related discipline.
- GenAI/LLM Infrastructure Expertise: Proven track record of designing and operating large-scale ML systems, specifically working with distributed training frameworks and LLM serving/inference optimization.
- Distributed Systems Mastery: Hands-on experience managing fault-tolerant, petabyte-scale distributed systems and multi-node/multi-GPU training clusters.
- Advanced MLOps Knowledge: Deep understanding of modern ML orchestration, fine-tuning pipelines, and model evaluation methodologies. Experience with tools like Ray, MLflow, or similar ecosystem standards.
- GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes.
- Production Engineering Fundamentals: Hands-on experience with Kubernetes, Docker, and building production-quality code in Python and/or Go.
Culture & Benefits
- Comprehensive Healthcare Benefits and Income Replacement Programs
- 401k with Employer Match
- Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
- Flexible Vacation & Paid Volunteer Time Off
- Generous Paid Parental Leave
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →