Назад
Company hidden
2 дня назад

Production Engineer – Team Lead (AI)

196 000 - 262 000SGD
Формат работы
remote (только Singapore)/hybrid
Тип работы
fulltime
Грейд
lead
Английский
b2
Страна
Singapore
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Production Engineer – Team Lead (AI): Managing cloud infrastructure stability and reliability for an AI hyperscaler with an accent on incident management, SLO tracking, and operational excellence. Focus on driving rapid incident resolution, implementing automation to reduce MTTD/MTTR, and mentoring Production Engineers.

Location: Hybrid in Singapore. Remote work may be considered for candidates located more than 30 miles from an office based on specialized skill sets.

Salary: 196,000 – 262,000 SGD

Company

hirify.global is the AI Hyperscaler™, delivering a cloud platform of cutting-edge services powering the next wave of AI.

What you will do

  • Act as Incident Commander, providing decisive leadership to resolve critical incidents and minimize platform impact.
  • Lead root cause analysis (RCA) and post-incident reviews (PIR) to implement sustainable long-term solutions.
  • Define and track Service Level Objectives (SLOs) to guide prioritization and improve system resilience.
  • Spearhead automation strategies to reduce MTTD and MTTR, increasing overall platform reliability.
  • Mentor and develop Production Engineers, fostering a culture of knowledge sharing and professional growth.
  • Develop and maintain incident response playbooks and escalation processes for diverse failure scenarios.

Requirements

  • 4+ years of experience in production engineering, SRE, or cloud operations.
  • Deep knowledge of Kubernetes-based infrastructure, AWS, or GCP.
  • Proficiency with monitoring tools like Prometheus and Grafana and a strong understanding of observability.
  • Hands-on experience with Python, Bash, and Terraform.
  • Proven ability to make critical decisions under pressure during high-stakes incidents.
  • Must be a U.S. person or eligible to access export controlled information according to U.S. Government regulations.

Nice to have

  • Previous experience in a formal Incident Commander role.
  • Advanced knowledge of distributed systems and containerization.
  • Experience developing and managing self-healing infrastructure.

Culture & Benefits

  • 100% company-paid medical, dental, and vision insurance.
  • 401(k) with generous employer match and Employee Stock Purchase Program (ESPP).
  • Flexible PTO and paid parental leave.
  • Catered lunch provided daily in office and data center locations.
  • Comprehensive wellness support via Spring Health and family-forming support through Carrot.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →