TL;DR
Staff+ Software Engineer (Observability): Building and maintaining scalable telemetry ingest and storage pipelines across hirify.global’s multi-cluster infrastructure with an accent on high-throughput data pipelines and cost-efficient columnar storage. Focus on reducing mean time to detection and resolution by building cross-signal correlation and AI-assisted diagnostic tooling.
Location: Expected to be in one of the offices at least 25% of the time.
Salary: £325,000 - £390,000 GBP
Company
hirify.global’s mission is to create reliable, interpretable, and steerable AI systems.
What you will do
- Design and build scalable telemetry ingest and storage pipelines for metrics, logs, traces, and error data.
- Own and evolve core observability platforms, driving migrations and architectural improvements.
- Build instrumentation libraries, SDKs, and integrations for engineering teams to emit high-quality telemetry.
- Drive alerting and SLO infrastructure to define, monitor, and respond to reliability targets.
- Reduce mean time to detection and resolution by building cross-signal correlation and AI-assisted diagnostic tooling.
- Partner with Research, Inference, Product, and Infrastructure teams to ensure observability solutions.
Requirements
- 10+ years of experience building and operating large-scale observability or monitoring infrastructure.
- Deep experience with at least one observability signal area (metrics, logging, tracing, or error analytics).
- Understand high-throughput data pipelines, columnar storage engines, and the tradeoffs involved in ingesting and querying telemetry data at scale.
- Experience operating or building on top of observability platforms such as Prometheus, Grafana, ClickHouse, OpenTelemetry, or similar systems.
- Proficiency in at least one of Python, Rust, or Go.
- Excellent communication skills and enjoy partnering with internal teams to improve their operational visibility and incident response capabilities.
Nice to have
- Experience operating metrics systems at very high cardinality (hundreds of millions of active time series or more).
- Experience with log storage migrations or operating columnar databases (ClickHouse, BigQuery, or similar) for analytics workloads.
- Experience with OpenTelemetry instrumentation, collector pipelines, and tail-based sampling strategies.
- Experience building or operating alerting platforms, on-call tooling, or SLO frameworks at scale.
- Experience with Kubernetes-native monitoring, eBPF-based observability, or continuous profiling.
- Interest in applying AI/LLMs to operational workflows such as automated root cause analysis, anomaly detection, or intelligent alerting.
Culture & Benefits
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.
- Lovely office space in which to collaborate with colleagues.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →