Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Engineer, Network Observability (Network Observability): Define and evolve the technical direction for network observability, building resilient telemetry systems with an accent on scalable collectors, persistence, and alerting across logs/metrics/events/flows. Focus on leading cross-team standardization, making high-leverage architectural tradeoffs for reliability and incident response, and mentoring engineers while operating as a senior escalation point.
Location: Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA
Salary: $207,000–$275,000 (base)
Company
CoreWeave provides cloud infrastructure and tools for building and scaling AI.
What you will do
- Set technical direction for network observability across multiple teams, aligning platform, data models, and telemetry strategy with long-term goals.
- Design and evolve scalable observability solutions using collectors (e.g., gNMI, SNMP, Prometheus scraping, OpenTelemetry), persistence (e.g., Loki, ClickHouse), and visualization/alerting (e.g., Grafana, Alertmanager) with a focus on reliability and future scale.
- Standardize observability patterns and improve signal quality across logs, metrics, events, flows, and related diagnostics.
- Lead high-leverage technical tradeoffs with engineering leadership to improve resilience, scalability, and operator efficiency.
- Serve as a go-to expert for critical observability challenges and coordinate during incidents.
- Mentor engineers through technical reviews and design guidance; participate in RFCs and architectural decisions; join rotating on-call as a senior escalation point.
Requirements
- Deep expertise building flexible network observability solutions across collectors, distribution, processing, persistence, alerting, analytics, and visualization.
- Experience as a Network Engineer, SRE, Software Engineer, or Systems Engineer in large-scale environments, with a track record operating observability or infrastructure platforms for multiple teams.
- Proven ability to lead through ambiguity and make sound architectural and operational tradeoffs balancing near-term needs and long-term maintainability.
- Strong systems thinking and practical experience designing resilient, scalable solutions that improve visibility and incident response.
- Proficiency with Python, Go, and Bash; familiarity with configuration management and templating (e.g., Ansible, Jinja2).
- Hands-on Linux and IP networking knowledge, including routing/switching and network troubleshooting; experience with networking platforms such as SONiC, HPE Junos, NVIDIA Cumulus Linux, Nokia SR OS, or SR Linux.
Nice to have
- Experience applying machine learning techniques/tools to proactively detect performance or security anomalies in network traffic.
- Experience with OpenTelemetry, Jaeger, Zipkin, or similar end-to-end tracing tooling.
- Experience shaping technical roadmaps and leading platform investments that improved reliability or scalability across multiple teams.
- Network certifications such as CCNA, CCNP, or similar.
Culture & Benefits
- Medical, dental, and vision insurance fully paid by CoreWeave; company-paid life insurance; disability coverage.
- 401(k) with generous employer match; Flexible PTO; tuition reimbursement; ESPP participation.
- Health Savings Account and Flexible Spending Account; mental wellness benefits via Spring Health.
- Paid parental leave and family-forming support; childcare support with Kinside.
- Flexible, casual work environment with a culture focused on innovative disruption.
Hiring process
- Interviews and technical evaluation focused on observability/networking expertise and architectural judgment.
- Discussion of role fit, experience alignment, and collaboration/mentorship approach.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →