Tech Lead (Network Observability)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Tech Lead (Network Observability): Leading the architecture and development of a high-performance network monitoring platform for RDMA, RoCE, and InfiniBand networks with an accent on cross-stack observability and telemetry pipelines. Focus on optimizing low-latency data collection, designing scalable backend services, and troubleshooting complex network layers for AI GPU clusters.
Location: Onsite in Palo Alto, California
Salary: $180,000 - $260,000
Company
is pioneering software-driven AI fabrics to increase GPU cluster utilization through cross-stack observability and performance acceleration.
What you will do
- Lead the architecture, design, and development of scalable network monitoring platforms for RDMA, RoCE, InfiniBand, and TCP/IP infrastructure.
- Build backend telemetry services, observability dashboards, alerts, diagnostics, and anomaly detection workflows.
- Troubleshoot complex production issues across application, OS, server, and network layers.
- Establish engineering standards, drive automation, and define technical roadmaps with cross-functional teams.
- Mentor engineers on distributed systems and high-performance networking best practices.
Requirements
- Degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field.
- Strong programming experience in C++, Go, Python, or Rust.
- Proven experience leading engineering teams or complex infrastructure projects.
- Hands-on experience with RDMA, RoCE, InfiniBand, and Linux networking (TCP/IP, routing, congestion control).
- Experience with monitoring and visualization tools such as Prometheus, Grafana, Datadog, or OpenTelemetry.
- Must be able to work onsite in Palo Alto, California
Nice to have
- Experience supporting AI/ML, HPC, or GPU cluster infrastructure workloads.
- Knowledge of eBPF, XDP, DPDK, or Linux kernel networking tools (tcpdump, Wireshark, ethtool).
- Experience with Kubernetes, cloud infrastructure (AWS, GCP, Azure), and infrastructure automation.
- Experience designing time-series data systems and high-cardinality telemetry platforms.
Culture & Benefits
- Competitive compensation and eligibility for the company's equity program.
- Catered lunch.
- Friendly and inclusive workplace culture.
- Comprehensive benefits package.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →