Senior Machine Learning Operations Engineer II (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Machine Learning Operations Engineer II (AI): Designing and scaling infrastructure and automated pipelines to reliably train, deploy, and monitor ML models in production with an accent on CI/CD, distributed infrastructure, and system reliability. Focus on automating Continuous Training (CT) pipelines, optimizing GPU/CPU clusters, and implementing robust observability for data and concept drift.
Location: Remote (Must be based in the US or Canada)
Salary: $148,000 – $216,000 USD (US) / $171,500 – $201,000 CAD (Canada)
Company
provides location sharing and safety services for families, serving nearly 100 million monthly active users globally.
What you will do
- Design and manage automated CI/CD and Continuous Training (CT) pipelines for ML model development and delivery.
- Containerize and scale ML models as high-availability microservices or batch processing workflows.
- Establish unified logging, alerting, and monitoring to track model performance, latency, and data drift.
- Provision and optimize cloud-based ML infrastructure using Infrastructure as Code (IaC) paradigms.
- Collaborate with product teams to drive infrastructure adoption via SDK/API development and system maintenance.
- Implement robust lineage tracking for data, code, and model artifacts to ensure compliance and reproducibility.
Requirements
- 5+ years of professional software engineering, DevOps, or data engineering experience.
- 2+ years specifically dedicated to building and maintaining MLOps infrastructure.
- Strong proficiency in Python and software engineering best practices (unit testing, modular design).
- Hands-on experience with Docker and Kubernetes (EKS, GKE).
- Familiarity with ML lifecycle tools: MLflow, Kubeflow, SparkML, and Airflow.
- Practical experience with major cloud ecosystems (AWS, GCP, or Databricks).
Nice to have
- Experience implementing production feature stores (e.g., Feast, Tecton) and model registries.
- Experience deploying and optimizing LLMs using frameworks like vLLM, Triton, or TGI.
- Proficiency with Terraform for managing reproducible environments.
- Familiarity with distributed computation engines such as Apache Spark, Ray, or Dask.
- Relevant cloud or architecture certifications (e.g., AWS ML Specialty, CKA).
Culture & Benefits
- Remote-first work environment with equipment and tool reimbursement.
- Comprehensive medical, dental, and vision insurance (100% paid for US employees).
- 401(k) matching (US) and RRSP with DPSP (Canada).
- Flexible PTO and 12 company-wide days off per year.
- Learning and Development programs to support professional growth.
- Free Platinum Membership for the employee's circle.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →