Назад
Company hidden
9 часов назад

Site Reliability Engineer (AI)

157 700 - 277 800$
Формат работы
hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
UK, US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Site Reliability Engineer (AI): Building and optimizing high-availability, fault-tolerant infrastructure solutions for an AI-powered enterprise platform with an accent on automating operational tasks, defining SLOs, and leading incident response. Focus on designing scalable systems on public clouds (AWS, GCP, Azure), proactive problem-solving, and championing reliability best practices for rapid growth.

Location: This is a hybrid position, based out of our New York City or London hubs.

Salary: $157,700–$277,800

Company

hirify.global is where leading enterprises orchestrate AI-powered work, providing an end-to-end platform for building and deploying AI agents grounded in company data.

What you will do

  • Automate operational tasks and infrastructure management by developing robust tools and platforms using Python, Go, or similar languages.
  • Design and implement scalable, fault-tolerant infrastructure solutions on public cloud providers (AWS, GCP, Azure) to support a high-traffic AI platform.
  • Own the reliability, performance, and efficiency of core services, defining and upholding stringent Service Level Objectives (SLOs) and Error Budgets.
  • Own the observability stack for monitoring, logging, and alerting systems across complex distributed systems.
  • Lead incident response, post-mortems, and root cause analyses to prevent future outages and build resilient system architecture.
  • Collaborate closely with product and engineering teams, providing expert guidance on system design for reliability, performance, and scalability.

Requirements

  • 7+ years of experience in site reliability engineering, DevOps, or similar roles focused on large-scale, high-availability production systems.
  • Deep expertise with cloud platforms (AWS strongly preferred), containerization technologies like Docker and Kubernetes, and Infrastructure-as-Code tools such as Terraform.
  • Strong proficiency in programming languages such as Python, Java, or Go for automation and monitoring.
  • Knowledge of monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack).
  • Demonstrated ability to identify systemic weaknesses and propose innovative solutions to complex reliability problems.
  • Excellent communication, collaboration, and problem-solving skills with cross-functional teams.
  • A strong sense of ownership and accountability for mission-critical systems.

Culture & Benefits

  • Generous PTO and company holidays.
  • Medical, dental, and vision coverage for you and your family.
  • Paid parental leave for all parents (12 weeks) and fertility/family planning support.
  • Early-detection cancer testing and flexible spending account options.
  • Health savings account for eligible plans with company contribution (US employees).
  • Annual work-life stipends for wellness and learning & development.
  • Company-wide and team off-sites.
  • Competitive compensation, company stock options, and 401k (US employees).

Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →