Staff Site Reliability Engineer - Site Experience
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Site Reliability Engineer (Site Experience): Lead reliability engineering initiatives for critical user-facing systems at internet scale with an accent on APIs, content delivery, feed generation, search, messaging, and real-time experiences. Focus on designing highly available architectures, reducing operational risks, driving automation, and leading incident response.
Location: Dublin, Ireland
Company
is a community of communities built on shared interests, home to 100,000+ active communities and 126 million daily active unique visitors, one of the internet’s largest sources of information.
What you will do
- Drive reliability, scalability, and operational excellence for critical user-facing systems including APIs, content delivery, feeds, search, messaging, and real-time experiences.
- Partner with product and infrastructure teams to architect systems for massive global load, guiding decisions on failover, redundancy, degradation, traffic management, and capacity planning.
- Identify risks and bottlenecks, build mitigation strategies, and drive improvements to reduce incidents and enhance service health.
- Build automation and tooling to eliminate repetitive work, improve deployment safety, incident response, and reliability guardrails.
- Lead incident response, blameless postmortems, root cause analysis, and long-term fixes.
- Champion best practices for SLIs/SLOs, capacity management, release engineering, and operational maturity; mentor engineers to raise reliability culture.
Requirements
- 8+ years in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large-scale distributed systems.
- Strong collaboration and communication skills to influence technical direction across teams.
- Experience supporting high-traffic, user-facing production environments.
- Deep understanding of distributed systems, networking, Linux systems, or cloud native architectures.
- Strong programming skills in Go, Python, or similar.
- Strong knowledge of observability (metrics, logging, tracing, alerting), SLOs, automation, incident management, and performance optimization.
Nice to have
- Experience with internet-scale traffic, Kubernetes, containers, cloud infrastructure.
- Familiarity with Prometheus, Grafana, OpenTelemetry, Envoy, Kafka, ClickHouse, Cassandra, Redis, CDN optimization, or global infrastructure.
- Open source contributions or technical community participation.
- Leading large-scale incident response and operational transformations.
Culture & Benefits
- Global benefits including workspace support, professional development, caregiving, family planning, gender-affirming care, mental health & coaching.
- Private medical, dental, vision benefits; personal retirement savings with matching; cycle to work and tax saver schemes.
- Flexible vacation, paid volunteer time off, generous paid parental leave.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →