Network Engineer, Capacity And Efficiency (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Network Engineer, Capacity And Efficiency (AI): Owns the cost, utilization, and attribution story for non-accelerator infrastructure with an accent on network, compute, and storage backbone. Focus on building the network observability stack, hunting for efficiency, and driving cost attribution.
Location: San Francisco, CA or New York City, NY. Currently, we expect all staff to be in one of our offices at least 25% of the time.
Salary: $320,000 - $405,000 USD
Company
’s mission is to create reliable, interpretable, and steerable AI systems.
What you will do
- Build the network observability stack by designing and deploying telemetry pipelines.
- Analyze inter-region traffic patterns, identify hot links and stranded capacity, and quantify the dollar impact.
- Design and operate traffic classification, marking, and shaping across the backbone.
- Tie network spend back to the teams and workloads that generate it.
- Partner across the company to influence teams to achieve outcomes.
- Extend our intent-based network configuration systems and write the tooling that turns your efficiency findings into safe, reviewable, and impactful changes.
Requirements
- Have 5+ years operating large-scale production networks.
- Be genuinely fluent across the stack: BGP, ECMP, VXLAN/EVPN or equivalent overlays, QoS, and L1/optical basics.
- Know at least one major CSP’s networking model deeply — AWS or GCP — and understand how their overlays interact with physical underlays.
- Have built or operated network telemetry at scale.
- Comfortable writing Python or Go to build tooling, telemetry pipelines, infrastructure-as-code, config management for network devices and automation, that you’ll ship to production.
- Think quantitatively by default and communicate crisply.
Nice to have
- SRE experience for large-scale network infrastructure.
- Background on a cloud provider's networking team or a cloud networking product team.
- Familiarity with AI/ML infrastructure traffic patterns.
- Experience with HPC fabrics like InfiniBand, RoCE v2, lossless Ethernet, or custom high-radix topologies and an understanding of how job placement, congestion management, and adaptive routing interact at scale.
- Background in traffic engineering for large backbones and the operational judgment to know when TE is worth the complexity.
- Hands-on time with multi-cloud connectivity.
- Experience building cost/chargeback systems for shared infrastructure, or FinOps exposure in a large cloud environment.
Culture & Benefits
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.
- Lovely office space in which to collaborate with colleagues.
Hiring process
- We encourage you to apply even if you do not believe you meet every single qualification.
- We think AI systems like the ones we're building have enormous social and ethical implications.
- We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →