Назад
Company hidden
6 часов назад

Site Reliability Engineer (AI)

142 696 - 158 303$
Формат работы
remote (только USA)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Site Reliability Engineer (AI): Establishing and managing reliability, observability, and incident response for AI services with an accent on SLOs, error budgets, and operational readiness reviews. Focus on automating toil, monitoring AI-specific failure modes, and ensuring production scalability.

Location: Remote (100% telework). Must be a US citizen with a DoD Secret security clearance required at time of hire.

Salary: $142,696 – $158,303 per year

Company

hirify.global develops a diverse portfolio of high-technology solutions and products for defense and scientific missions.

What you will do

  • Define and manage SLOs and error budgets for every AI service moving to production.
  • Build and maintain monitoring, logging, and alerting infrastructure to detect degradations.
  • Establish incident management procedures, lead post-incident reviews, and drive corrective actions.
  • Perform operational readiness reviews to ensure services meet reliability, security, and operational standards.
  • Forecast capacity needs and monitor costs associated with tokens, compute, and storage.
  • Identify and automate repetitive operational tasks to eliminate toil.

Requirements

  • Bachelor's or Master's degree in Software Engineering, Computer Science, or a related STEM field.
  • 5-8+ years of production SRE or DevOps experience owning system reliability.
  • Hands-on experience with Prometheus, Grafana, Datadog, ELK, or CloudWatch.
  • Strong scripting and automation skills using Python, Bash, and IaC (Terraform or CloudFormation).
  • Experience with Docker and Kubernetes orchestration at scale.
  • US citizenship and Department of Defense Secret security clearance are required at time of hire.

Nice to have

  • Experience with AI/ML production systems, including model serving and inference monitoring.
  • Multi-cloud experience across AWS, Azure, and GCP.
  • Familiarity with Google SRE principles and their practical application.
  • Experience in high-compliance environments such as defense, healthcare, or finance.

Culture & Benefits

  • 100% telework (fully remote) flexibility.
  • 9/80 work schedule.
  • Highly competitive benefits package.
  • Collaborative environment valuing trust, honesty, and transparency.
  • Opportunity to define SRE practices from scratch for an AI platform.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →