Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Principal Engineer (ML Platform): Build and operate the ML platform systems that train, evaluate, and production-serve generative models with an accent on reliability, scalability, performance, and resource efficiency in GPU/cloud environments. Focus on designing platform architecture across research and product workflows, improving scheduling/monitoring/debugging, and creating automation-friendly, agent-oriented tooling that reduces operational overhead.
Location: Remote (Europe)
Company
Synthesia develops an AI video platform for business and enterprise skill development.
What you will do
- Design and improve platform systems for model training, evaluation, and production serving.
- Build infrastructure and tooling to make ML workloads more reliable, scalable, and cost-efficient.
- Develop internal tools and workflows that are easy to operate by humans and by agents.
- Work on architecture for deploying, serving, and operating models across research and product environments.
- Improve scheduling, monitoring, and debugging for GPU and cloud workloads.
- Drive improvements across observability, automation, reliability, and developer experience.
Requirements
- Strong experience building or operating production systems with a focus on reliability, scalability, and maintainability.
- Systems mindset: think in terms of bottlenecks, failure modes, interfaces, resource usage, and long-term operability.
- Hands-on experience with cloud infrastructure, Linux, and infrastructure automation.
- Experience with Kubernetes and operating distributed workloads in production.
- Strong coding skills, ideally in Python or similar languages used for backend systems and tooling.
- Experience building internal platforms, developer tooling, or infrastructure abstractions used by other engineers.
Nice to have
- Experience operating ML infrastructure or model serving systems in production.
- Experience with observability and debugging in distributed systems.
- Familiarity with Terraform, Datadog, GitHub Actions, or similar tools.
- Experience building agentic or LLM-powered internal tools and workflow orchestration (e.g., Temporal).
- Familiarity with performance optimization, scheduling, or resource allocation problems.
Culture & Benefits
- Hands-on IC role with significant ownership and technical direction influence.
- Close collaboration with researchers and product engineers to turn pain points into robust platform capabilities.
- Focus on pragmatic architectural tradeoffs as the platform scales.
- Emphasis on automation, reliability, and developer experience to reduce operational overhead.
Hiring process
- Interviews focused on production systems thinking, ML platform architecture, and reliability/scalability tradeoffs.
- Technical evaluation of hands-on experience with cloud, Linux, Kubernetes, and ML serving/training operations.
- Discussion of how experience translates to building automation-friendly, agent-oriented platform tooling.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →