Staff Network Site Reliability Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Network Site Reliability Engineer (AI): Building and running the fundamental network infrastructure for a full-stack AI cloud platform with an accent on reliability targets, automation, and scalability. Focus on designing safer change workflows, evolving observability, and solving complex network failures in high-throughput systems.
Location: United States (Must be authorized to work in the US)
Salary: $179,500 - $224,300 USD
Company
is building a full-stack AI cloud platform that supports developers and enterprises from data and model training through to production deployment.
What you will do
- Define and own reliability goals for network services and critical paths (SLIs/SLOs, availability targets).
- Drive reliability improvements for site readiness and inter-site connectivity (DCI).
- Own incident response, lead investigations, and turn failures into durable fixes.
- Build and evolve observability via actionable metrics, logs, traces, and alerting.
- Design safer change workflows, including automation, CI/CD, and canarying for network changes.
- Collaborate with network and platform teams to embed operability into system designs.
Requirements
- Strong production Linux fundamentals and a structured approach to debugging complex systems.
- Solid understanding of networking basics (control plane vs data plane, latency/loss, failure domains).
- Hands-on experience operating and improving high-availability systems.
- Ability to write and maintain automation in Go or Python.
- Experience with modern infrastructure tooling such as IaC, CI/CD, and container platforms.
- Must be authorized to work in the United States.
Nice to have
- Experience with load balancers, tunneling, NAT64, or other datapath-heavy systems.
- Low-level networking performance background (eBPF/XDP, DPDK, kernel networking internals).
- Experience building network-safe delivery pipelines with automated verification and drift detection.
- Background in large-scale network observability and routing/flow telemetry.
Culture & Benefits
- Competitive compensation and benefits packages.
- Career growth and continuous learning opportunities.
- Culture of flexibility, ownership, and bold thinking.
- Opportunity to work on impactful AI projects within an international environment.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →