Staff Infrastructure Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Infrastructure Engineer (AI): Designing and implementing large-scale infrastructure systems to support AI scientist training, evaluation, and deployment with an accent on distributed environments, performance optimization, and container orchestration. Focus on building scalable VM/sandboxing architectures, optimizing data pipelines, and enabling stable reinforcement learning workflows.
Location: Hybrid with at least 25% office presence in San Francisco, USA
Salary: $340,000 - $425,000 USD annually
Company
is a public benefit corporation focused on creating reliable, interpretable, and steerable AI systems that are safe and beneficial for society.
What you will do
- Design and implement large-scale infrastructure for AI scientist training, evaluation, and deployment across distributed systems
- Identify and resolve infrastructure bottlenecks impacting scientific AGI progress
- Develop robust evaluation frameworks for scientific AGI measurement
- Build scalable VM/sandbox/container architectures for safe execution of long-horizon AI tasks
- Collaborate to translate experimental requirements into production-ready infrastructure
- Develop and optimize large-scale data pipelines and reinforcement learning training/inference workflows
Requirements
- Must have 6+ years of experience in infrastructure engineering with expertise in large-scale distributed systems
- Strong communication and collaboration skills
- Deep knowledge of performance optimization and system architectures for high-throughput ML workloads
- Experience with containerization (Docker, Kubernetes) and orchestration at scale
- Proven track record building large-scale data pipelines and distributed storage systems
- Ability to diagnose and resolve complex infrastructure challenges in production
- Experience working across the full ML stack from data pipelines to performance optimization
- Experience collaborating with researchers to scale experimental ideas
Nice to have
- Experience with language model training infrastructure and distributed ML frameworks (PyTorch, JAX)
- Background in AI research lab infrastructure or large-scale ML organizations
- Knowledge of GPU/TPU architectures and language model inference optimization
- Experience with cloud platforms (AWS, GCP) at enterprise scale
- Familiarity with VM and container orchestration
- Experience with workflow orchestration and experiment management systems
- History working with large-scale reinforcement learning
- Comfort with large-scale data pipelines (Beam, Spark, Dask)
Culture & Benefits
- Competitive compensation including equity and benefits
- Generous vacation and parental leave
- Flexible working hours and hybrid work policy
- Visa sponsorship available with immigration lawyer support
- Collaborative and impact-driven research environment
Hiring process
- Evaluation of relevant experience and skills
- Technical interviews focusing on infrastructure and distributed systems
- Assessment of communication and collaboration abilities
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →