Network Reliability Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Network Reliability Engineer (AI): Designing and operating the global network and reliability layer for a high-performance private supercomputer with an accent on distributed compute, ML workloads, and real-time analytics. Focus on building scalable network architecture, automating infrastructure, and ensuring mission-critical system reliability.
Location: Must be based in San Francisco, California (On-site)
Salary: $210,000 – $240,000
Company
is a pioneering Causal AI platform helping Fortune 100 enterprises prove business outcomes using trusted, causal evidence.
What you will do
- Architect and operate scalable, secure network architecture for large-scale machine learning workloads.
- Own network device configuration management end to end to ensure consistency and reliability.
- Improve system and network performance through automation, observability, and proactive capacity planning.
- Implement and manage complex network protocols including BGP, VPNs, and external peering.
- Build and maintain comprehensive monitoring, alerting, and incident response systems.
- Partner across engineering and data science to drive a culture of performance and reliability.
Requirements
- 8+ years in network or infrastructure engineering, with 5+ years in datacenter operations.
- Extensive hands-on experience with network devices (firewalls, switches, load balancers) and protocols like BGP, QoS, MPLS, and IPsec.
- Experience designing and operating modern datacenter network fabrics (spine-leaf, EVPN/VXLAN, ECMP).
- Proficiency in network automation and IaC tooling (Ansible, Terraform, Nornir) and IPAM/DCIM platforms.
- Strong operational experience with Linux-based production infrastructure and Kubernetes networking.
- Solid scripting skills in Python or Bash for debugging and automation.
Nice to have
- Experience with NVIDIA networking technologies (Cumulus Linux, InfiniBand, Spectrum-X, BlueField DPUs).
- Familiarity with data-intensive platforms like Spark, Airflow, or Kafka.
- Experience with storage network protocols such as NFS, LustreFS, or iSCSI.
- Background in high-compliance or SOC 2 environments.
Culture & Benefits
- Work on cutting-edge infrastructure including one of the world's fastest private supercomputers.
- High-impact role with ownership over architecture decisions for Fortune 100-scale systems.
- Generous equity program to ensure meaningful ownership.
- Transparent compensation philosophy based on real-time market data.
- Collaborative environment with top-tier engineering talent.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →