Senior Site Reliability Engineer (AI)

Формат работы

onsite

Тип работы

fulltime

Грейд

senior

Английский

Страна

Singapore

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Site Reliability Engineer (AI): Ensuring the reliability, performance, and scalability of AI products, model-serving infrastructure, and backend API systems with an accent on automating operations and enhancing observability. Focus on building resilient systems, solving complex infrastructure problems, and supporting AI workloads in production.

Location: Role based in Singapore office and may require up to 1 travel trip per year.

Company

hirify.global is on a global mission to revolutionize the way the world games.

What you will do

Administer, monitor, and manage cloud-scale production environments for AI model APIs, backend services, and high-traffic web systems serving global users.
Design and implement fault-tolerant, autoscaling cloud architectures tailored for AI inference workloads, including GPU-based environments and software products.
Build automated self-recovery systems to ensure high availability, rapid failover, and cost-efficient resource usage for all software products.
Manage and monitor AI model-serving platforms, inference engines, vector databases, data pipelines, software applications.
Implement and maintain comprehensive monitoring, logging, and alerting for all AI and backend services.
Work closely with software engineering, ML engineering, and release management to enhance operational procedures, deployment processes, and incident response workflows.

Requirements

5+ years of relevant experience in SRE, DevOps, infrastructure engineering, or cloud operations.
Experience operating production services with significant availability or scaling demands.
Strong knowledge in Web Technologies such as HTTP, REST, SSL, Load Balancers, Web Proxies (NGINX).
Comfortable with Linux and Docker administration.
Basic knowledge in AWS, CI/CD (Jenkins), IaC (Terraform), Container Orchestration (AWS ECS or K8s), Version Control (Git), Database (mySQL, noSQL).
Strong ability to code and script (preferably Bash scripting and Python).
Must have good analytical skills to debug deployment problems without taking help from developers.
Has a Bachelor’s or Master’s degree in computer science, AI or similar discipline from an accredited institution.

Culture & Benefits

Opportunity to make an impact globally while working across a global team located across 5 continents.
Gamer-centric #LifeAthirify.global experience that will put you in an accelerated growth, both personally and professionally.
Inclusive, respectful, and fair workplace for every employee across all the countries we operate in.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Senior Site Reliability Engineer (AI)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Culture & Benefits

Похожие вакансии

DevOps Engineer

Senior Cloud Site Reliability Engineer

Software Infrastructure Engineer

Platform & Reliability Engineer (Fintech)