TL;DR
Applied Research Engineer (AI): Building and managing GPU cluster infrastructure and training pipelines to support scalable model training with an accent on distributed systems and cloud orchestration. Focus on optimizing cluster performance, ensuring fault tolerance for long-running jobs, and providing tooling that accelerates the research and development lifecycle.
Location: Remote (United States) or Hybrid (Redwood City, CA / San Francisco, CA)
Salary: $150,000 – $180,000 USD
Company
hirify.global helps enterprises transform expert knowledge into specialized AI at scale by focusing on data-centric AI development.
What you will do
- Manage and scale GPU cluster infrastructure on major cloud providers.
- Develop and operate job orchestration systems using Kubernetes or Slurm.
- Maintain and optimize ML training frameworks and post-training pipelines.
- Implement experiment tracking, dataset versioning, and artifact management solutions.
- Monitor cluster health and implement fault-tolerant, auto-recovery mechanisms for distributed workloads.
- Collaborate with research scientists to unblock infrastructure challenges and evolve training capabilities.
Requirements
- Must be located in the United States
- Experience managing GPU clusters and cloud networking.
- Proficiency with distributed orchestration tools like Kubernetes or Slurm.
- Strong Python programming skills and familiarity with software engineering best practices.
- Knowledge of distributed training concepts including parallelism and memory optimization.
- Experience with ML experiment tracking and versioning tools.
Nice to have
- Practical experience with post-training workflows such as SFT or RLHF.
- Hands-on work with AWS HyperPod or similar cloud-native infrastructure.
Culture & Benefits
- High-growth environment with established market-proven solutions.
- Opportunities to directly influence strategic technical decisions.
- Supportive environment focused on continuous learning and career growth.
- Equal Employment Opportunity employer committed to diversity and inclusivity.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →