Назад
Company hidden
4 дня назад

Tech Lead (Network Observability)

180 000 - 260 000$
Формат работы
onsite
Тип работы
fulltime
Грейд
lead
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Tech Lead (Network Observability): Leading the architecture and development of a high-performance network monitoring platform for RDMA, RoCE, and InfiniBand networks with an accent on cross-stack observability and telemetry pipelines. Focus on optimizing low-latency data collection, designing scalable backend services, and troubleshooting complex network layers for AI GPU clusters.

Location: Onsite in Palo Alto, California

Salary: $180,000 - $260,000

Company

hirify.global is pioneering software-driven AI fabrics to increase GPU cluster utilization through cross-stack observability and performance acceleration.

What you will do

  • Lead the architecture, design, and development of scalable network monitoring platforms for RDMA, RoCE, InfiniBand, and TCP/IP infrastructure.
  • Build backend telemetry services, observability dashboards, alerts, diagnostics, and anomaly detection workflows.
  • Troubleshoot complex production issues across application, OS, server, and network layers.
  • Establish engineering standards, drive automation, and define technical roadmaps with cross-functional teams.
  • Mentor engineers on distributed systems and high-performance networking best practices.

Requirements

  • Degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field.
  • Strong programming experience in C++, Go, Python, or Rust.
  • Proven experience leading engineering teams or complex infrastructure projects.
  • Hands-on experience with RDMA, RoCE, InfiniBand, and Linux networking (TCP/IP, routing, congestion control).
  • Experience with monitoring and visualization tools such as Prometheus, Grafana, Datadog, or OpenTelemetry.
  • Must be able to work onsite in Palo Alto, California

Nice to have

  • Experience supporting AI/ML, HPC, or GPU cluster infrastructure workloads.
  • Knowledge of eBPF, XDP, DPDK, or Linux kernel networking tools (tcpdump, Wireshark, ethtool).
  • Experience with Kubernetes, cloud infrastructure (AWS, GCP, Azure), and infrastructure automation.
  • Experience designing time-series data systems and high-cardinality telemetry platforms.

Culture & Benefits

  • Competitive compensation and eligibility for the company's equity program.
  • Catered lunch.
  • Friendly and inclusive workplace culture.
  • Comprehensive benefits package.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →