Назад
Company hidden
3 часа назад

Staff+ Software Engineer, Observability (AI)

405 000 - 485 000$
Формат работы
hybrid
Тип работы
fulltime
Грейд
lead
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Staff+ Software Engineer (Observability, AI): Designing and building scalable telemetry ingest and storage pipelines for metrics, logs, traces, and error data across hirify.global’s multi-cluster infrastructure with an accent on the reliability and operational excellence of research and product systems. Focus on reducing mean time to detection and resolution by building cross-signal correlation, unified query interfaces, and AI-assisted diagnostic tooling.

Location: San Francisco, CA | New York City, NY | Seattle, WA. Expect all staff to be in one of our offices at least 25% of the time.

Salary: $405,000 - $485,000 USD

Company

hirify.global’s mission is to create reliable, interpretable, and steerable AI systems.

What you will do

  • Design and build scalable telemetry ingest and storage pipelines for metrics, logs, traces, and error data across hirify.global’s multi-cluster infrastructure.
  • Own and evolve core observability platforms, driving migrations and architectural improvements that improve reliability, reduce cost, and scale with organizational growth.
  • Build instrumentation libraries, SDKs, and integrations that make it easy for engineering teams to emit high-quality telemetry from their services.
  • Drive alerting and SLO infrastructure that enables teams to define, monitor, and respond to reliability targets with minimal noise.
  • Reduce mean time to detection and resolution by building cross-signal correlation, unified query interfaces, and AI-assisted diagnostic tooling.
  • Partner with Research, Inference, Product, and Infrastructure teams to ensure observability solutions meet the unique needs of each organization.

Requirements

  • Have 10+ years of relevant industry experience building and operating large-scale observability or monitoring infrastructure.
  • Have deep experience with at least one observability signal area (metrics, logging, tracing, or error analytics) and familiarity with the others.
  • Understand high-throughput data pipelines, columnar storage engines, and the tradeoffs involved in ingesting and querying telemetry data at scale.
  • Have experience operating or building on top of observability platforms such as Prometheus, Grafana, ClickHouse, OpenTelemetry, or similar systems.
  • Have strong proficiency in at least one of Python, Rust, or Go.
  • Have excellent communication skills and enjoy partnering with internal teams to improve their operational visibility and incident response capabilities.

Nice to have

  • Experience operating metrics systems at very high cardinality (hundreds of millions of active time series or more).
  • Experience with log storage migrations or operating columnar databases (ClickHouse, BigQuery, or similar) for analytics workloads.
  • Experience with OpenTelemetry instrumentation, collector pipelines, and tail-based sampling strategies.
  • Experience building or operating alerting platforms, on-call tooling, or SLO frameworks at scale.
  • Interest in applying AI/LLMs to operational workflows such as automated root cause analysis, anomaly detection, or intelligent alerting.

Culture & Benefits

  • Competitive compensation and benefits.
  • Optional equity donation matching.
  • Generous vacation and parental leave.
  • Flexible working hours.
  • Lovely office space in which to collaborate with colleagues.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Текст вакансии взят без изменений

Источник - загрузка...