Назад
Company hidden
4 часа назад

Senior Staff Production Engineer (AI)

140 000 - 200 000$
Формат работы
remote (только USA)/hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Staff Production Engineer (AI): Driving automation and observability across a multi-cloud infrastructure with an accent on reducing Mean Time to Mitigate (MTTM) and shaping scalability. Focus on implementing self-healing systems, defining SLIs/SLOs, and leading incident response to ensure the reliability of a global platform.

Location: Hybrid in San Jose, CA (3 days a week) or remote within the US.

Salary: $140,000 - $200,000 USD

Company

hirify.global accelerates digital transformation by providing a cloud-native Zero Trust Exchange platform that secures connections between users, devices, and applications.

What you will do

  • Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments.
  • Drive an "automation-first" culture by writing code (Python/Go) to eliminate manual toil and build self-healing systems.
  • Implement and maintain sophisticated observability (Prometheus, Grafana, OpenTelemetry), define SLIs/SLOs, and establish error budgets.
  • Act as a lead Incident Commander (TDO on-call), develop response playbooks, and conduct deep-dive post-incident analyses.
  • Partner with Engineering and partner teams to conduct operability reviews.

Requirements

  • 8+ years of experience managing reliability, scalability, and availability for large-scale production services.
  • Deep expertise in programming (e.g., Python, Go, or C/C++).
  • Strong background in networking protocols, Linux/FreeBSD systems, and distributed architecture.
  • Experience in high-stakes incident management and participation in a 24/7 on-call rotation.
  • Proficiency in leveraging ITIL frameworks and incident data to drive service maturity through systematic problem management and technical operability reviews.

Nice to have

  • Extensive experience with public cloud (AWS, Azure, GCP) and Infrastructure-as-Code (Ansible, Terraform).
  • Experience with chaos engineering and disaster recovery planning at scale.
  • Expertise in global routing (BGP) and traffic tunneling (GRE, IPSec) with a deep understanding of L7 proxy architectures (HAProxy), DNS at scale, and OS networking stack internals.

Culture & Benefits

  • Impact in your role matters more than title, and trust is built on results.
  • Value constructive, honest debate and focus on getting to the best ideas faster.
  • Build high-performing teams that can make an impact quickly and with high quality.
  • Committed to building a team that reflects the communities served and the customers worked with.
  • Foster an inclusive environment that values all backgrounds and perspectives, emphasizing collaboration and belonging.
  • Offer comprehensive and inclusive benefits to meet the diverse needs of employees and their families throughout their life stages.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →