Senior Engineer Network Observability
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Engineer Network Observability (System Engineering): Designing, developing, and maintaining network monitoring, telemetry, and observability systems for a large-scale GPU cloud network with an accent on real-time insights, anomaly detection, and automated alerting. Focus on building scalable telemetry solutions, integrating diverse network platforms, and ensuring network reliability through advanced observability tools.
Location: Hybrid in Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA, with remote work considered for candidates located more than 30 miles from an office.
Salary: $139,000–$204,000
Company
is a publicly traded AI-focused cloud infrastructure company delivering high-performance GPU cloud platforms for AI labs, startups, and enterprises.
What you will do
- Develop and maintain network observability platforms using Python and Go.
- Collaborate with engineering teams to unify logs, metrics, and events from multiple network platforms into a single observability pipeline.
- Design scalable telemetry solutions with protocols like gNMI, SNMP, and tools such as Prometheus, Grafana, and Alertmanager.
- Integrate observability solutions across infrastructure and participate in architectural decisions.
- Participate in on-call rotations to troubleshoot and resolve observability issues.
- Mentor junior team members and promote continuous learning.
Requirements
- Must be a U.S. person or eligible to access export controlled information per U.S. Government regulations.
- Experience with Prometheus, Grafana, Alertmanager, gNMI, SNMP, Python, Go, Kubernetes, and Linux networking.
- Strong knowledge of IP networking, routing, switching, and network troubleshooting.
- Experience with network platforms such as Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, and SR Linux.
- Passion for automation and containerized workloads.
- Bachelor’s degree in Computer Science or related field preferred.
Nice to have
- Experience with machine learning for anomaly detection.
- Network certifications like CCNA or CCNP.
- Familiarity with distributed tracing tools like OpenTelemetry, Jaeger, or Zipkin.
- Experience with advanced metrics, analytics, and event correlation.
Culture & Benefits
- Comprehensive medical, dental, vision, and life insurance fully paid by employer.
- Flexible PTO, paid parental leave, and family-forming support.
- 401(k) with employer match and employee stock purchase program.
- Casual work environment with catered lunches at office locations.
- Focus on innovation, collaboration, and continuous learning.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →