TL;DR
Staff+ Software Engineer (Observability, AI): Designing and building scalable telemetry ingest and storage pipelines for metrics, logs, traces, and error data across hirify.global’s multi-cluster infrastructure with an accent on the reliability and operational excellence of research and product systems. Focus on reducing mean time to detection and resolution by building cross-signal correlation, unified query interfaces, and AI-assisted diagnostic tooling.
Location: San Francisco, CA | New York City, NY | Seattle, WA. Expect all staff to be in one of our offices at least 25% of the time.
Salary: $405,000 - $485,000 USD
Company
hirify.global’s mission is to create reliable, interpretable, and steerable AI systems.
What you will do
- Design and build scalable telemetry ingest and storage pipelines for metrics, logs, traces, and error data across hirify.global’s multi-cluster infrastructure.
- Own and evolve core observability platforms, driving migrations and architectural improvements that improve reliability, reduce cost, and scale with organizational growth.
- Build instrumentation libraries, SDKs, and integrations that make it easy for engineering teams to emit high-quality telemetry from their services.
- Drive alerting and SLO infrastructure that enables teams to define, monitor, and respond to reliability targets with minimal noise.
- Reduce mean time to detection and resolution by building cross-signal correlation, unified query interfaces, and AI-assisted diagnostic tooling.
- Partner with Research, Inference, Product, and Infrastructure teams to ensure observability solutions meet the unique needs of each organization.
Requirements
- Have 10+ years of relevant industry experience building and operating large-scale observability or monitoring infrastructure.
- Have deep experience with at least one observability signal area (metrics, logging, tracing, or error analytics) and familiarity with the others.
- Understand high-throughput data pipelines, columnar storage engines, and the tradeoffs involved in ingesting and querying telemetry data at scale.
- Have experience operating or building on top of observability platforms such as Prometheus, Grafana, ClickHouse, OpenTelemetry, or similar systems.
- Have strong proficiency in at least one of Python, Rust, or Go.
- Have excellent communication skills and enjoy partnering with internal teams to improve their operational visibility and incident response capabilities.
Nice to have
- Experience operating metrics systems at very high cardinality (hundreds of millions of active time series or more).
- Experience with log storage migrations or operating columnar databases (ClickHouse, BigQuery, or similar) for analytics workloads.
- Experience with OpenTelemetry instrumentation, collector pipelines, and tail-based sampling strategies.
- Experience building or operating alerting platforms, on-call tooling, or SLO frameworks at scale.
- Interest in applying AI/LLMs to operational workflows such as automated root cause analysis, anomaly detection, or intelligent alerting.
Culture & Benefits
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.
- Lovely office space in which to collaborate with colleagues.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →