TL;DR
Senior ML Platform / ML Infrastructure Engineer (AI): Designing and building standardized ML training and serving pipelines with an accent on real-time inference, infrastructure as code, and model lifecycle governance. Focus on implementing ultra-low-latency serving patterns, ensuring end-to-end observability, and collaborating on cost-efficient, secure architectures.
Location: Hybrid in Toronto or Montreal, Canada (2 days/week in-office).
Company
hirify.global is the #1 loyalty app for mobile gamers, helping them discover new games and earn rewards.
What you will do
- Design, build, and operate standardized ML training-to-serving pipelines using Airflow.
- Manage real-time and batch inference on AWS SageMaker, including multi-model endpoints and autoscaling.
- Implement ultra-low-latency serving patterns with Redis/Valkey for feature caching and online retrieval.
- Provision and manage ML/data infrastructure with Terraform, focusing on SageMaker, ECR/ECS/EKS, and network resources.
- Establish and manage model lifecycle governance with registries, approval workflows, and audit trails.
- Implement end-to-end observability for ML workflows, including data freshness, drift checks, and performance SLOs.
Requirements
- 5+ years building and operating production-grade ML/data platforms with a focus on serving, reliability, and developer experience.
- Strong software engineering skills in Python, Go, or Java for building resilient services and APIs.
- Deep experience with AWS SageMaker inference: endpoint configuration, containerization, autoscaling.
- Expertise with online feature stores like Redis/Valkey in ML serving contexts.
- Proven Terraform experience for end-to-end ML/data infrastructure management.
- Extensive experience with Airflow orchestration at scale: dependency modeling, DAG factories, and integrations.
Culture & Benefits
- Welcoming and fun work environment with team lunches, game nights, and company-wide events.
- Culture deeply rooted in growth, supported by a smart, dynamic, and enthusiastic team.
- Utilizes data to constantly learn, improve, and adapt.
- Fosters an environment where ideas are shared, boundaries are pushed, and calculated risks are encouraged.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →