Назад
Company hidden
2 дня назад

Staff Machine Learning Engineer, Genai Platform (AI)

253 300 - 354 600$
Формат работы
remote (только USA)
Тип работы
fulltime
Грейд
lead
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify RU Global, списка компаний с восточно-европейскими корнями
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Staff Machine Learning Engineer, GenAI Platform (AI): Architecting and scaling hirify.global's Generative AI and LLM platform capabilities with an accent on designing resilient, large-scale distributed systems. Focus on building self-serve LLM workflows and developing comprehensive evaluation & benchmarking infrastructure.

Location: Remote (United States)

Salary: $253,300 - $354,600 USD

Company

hirify.global is a community of communities built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet.

What you will do

  • Drive GenAI Infrastructure Strategy: Propose, design, and lead the architecture of our next-generation LLM platform.
  • Design Resilient, Large-Scale Distributed Systems: Architect highly fault-tolerant training infrastructure capable of supporting multi-week, distributed workloads across massive GPU clusters.
  • Build Self-Serve LLM Workflows: Design and implement robust, production-grade pipelines for LLM fine-tuning.
  • Develop Comprehensive Evaluation & Benchmarking Infrastructure: Build scalable systems for automated regression detection, structured metrics tracking, and complex inference-heavy evaluation patterns.
  • Architect Advanced Data Ingestion Pipelines: Extend our distributed data platforms to natively and efficiently handle the massive, multimodal datasets required for modern GenAI workloads, optimizing for throughput and dynamic batching.
  • Provide Technical Leadership & Mentorship: Analyze complex bottlenecks in distributed systems to optimize for performance and cost-efficiency.

Requirements

  • 10+ years of work experience in a production software development environment or building complex distributed data systems, plus a degree in ML, Engineering, Computer Science, or a related discipline.
  • GenAI/LLM Infrastructure Expertise: Proven track record of designing and operating large-scale ML systems, specifically working with distributed training frameworks and LLM serving/inference optimization.
  • Distributed Systems Mastery: Hands-on experience managing fault-tolerant, petabyte-scale distributed systems and multi-node/multi-GPU training clusters.
  • Advanced MLOps Knowledge: Deep understanding of modern ML orchestration, fine-tuning pipelines, and model evaluation methodologies. Experience with tools like Ray, MLflow, or similar ecosystem standards.
  • GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes.
  • Production Engineering Fundamentals: Hands-on experience with Kubernetes, Docker, and building production-quality code in Python and/or Go.

Culture & Benefits

  • Comprehensive Healthcare Benefits and Income Replacement Programs
  • 401k with Employer Match
  • Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
  • Flexible Vacation & Paid Volunteer Time Off
  • Generous Paid Parental Leave

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →