2 дня назад
Site Reliability Engineer (AI Infrastructure)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
Текст:
TL;DR
Site Reliability Engineer (AI Infrastructure): Provisioning and operating Kubernetes-based clusters for AI workloads with an accent on automation, scalability, and observability. Focus on building the foundation for reliable global AI compute and solving complex networking and scheduling challenges.
Location: Global Remote / San Francisco, CA
Company
provides early-stage startups with scaled AI infrastructure and is building a global liquidity layer for AI compute.
What you will do
- Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers.
- Build automation and tooling to streamline cluster deployments and integrations.
- Debug customer issues across networking, storage, scheduling, and system layers.
- Improve reliability and scalability of both training and inference infrastructure.
- Design and implement monitoring, alerting, and observability for critical systems.
- Participate in on-call and incident response, leading postmortems and reliability improvements.
Requirements
- 5+ years experience in SRE, DevOps, or infrastructure engineering roles.
- Strong Linux systems and networking fundamentals.
- Deep experience with Kubernetes and container orchestration at scale.
- Proficiency with Infrastructure-as-Code tools such as Terraform, Helm, and Ansible.
- Strong automation and scripting skills in Python, Go, or Bash.
- Experience with observability stacks including Prometheus, Grafana, Loki, and Datadog.
Nice to have
- Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton).
- Familiarity with high-performance networking (InfiniBand, NVLink) or distributed storage (VAST, Weka, Ceph).
- Customer-facing support or consulting experience.
Culture & Benefits
- High level of ownership and autonomy to shape how systems run.
- Direct collaboration with customers and providers.
- Opportunity to build the foundation for reliable, scalable AI infrastructure.
- Builder-centric environment focusing on solving hard engineering problems.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →
Похожие вакансии
2 дня назад
Member of Technical Staff, DevOps/Infrastructure Engineering (AI)
2 дня назад
Senior Platform Engineer (AI)
18 часов назад
Infrastructure Software Engineer (AI)
150 000 - 215 000$
12 часов назад
DevOps Engineer (Azure)
Ocean AI
3 дня назад
Senior DevOps / Platform Engineer (Highload, Web3, AI)
5 000 - 6 000$
23 часа назад
Senior AI-Enabled DevOps Engineer
134 000 - 149 000$