Ai Reliability & Monitoring Engineering Lead (AI)

256 000 - 276 000$

Формат работы

hybrid

Тип работы

fulltime

Грейд

lead

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

AI Reliability & Monitoring Engineering Lead (AI): Defining, building, and maintaining infrastructure and processes to ensure the reliability, scalability, and performance of hirify.global’s AI-powered API and agentic systems in production with an accent on monitoring, availability, incident response, and automation. Focus on supporting AI services and tools trusted by millions of developers globally.

Location: Expected to come into the office 3-days a week if based out of San Francisco Bay Area, Boston, Bangalore, Hyderabad, London, and New York.

Salary: $256,000 to $276,000

Company

hirify.global is the world’s leading API platform, used by more than 40 million developers and 500,000 organizations, including 98% of the Fortune 500.

What you will do

Develop and manage reliability metrics (SLOs) for AI-driven API services and agentic AI platform features.
Implement comprehensive observability and monitoring systems for real-time performance and fault detection.
Design and drive automated failover, recovery, and incident response strategies for high-availability AI infrastructure.
Optimize resource utilization, particularly GPU/accelerator efficiency, ensuring cost-effective AI system operation.
Collaborate closely with engineering, platform, and product teams to align reliability efforts with broader organizational goals.
Drive continuous improvement in deployment practices, monitoring approaches, and incident management processes.

Requirements

Have a strong background in AI reliability engineering, SRE, or DevOps for distributed systems.
Understand the unique challenges of maintaining large-scale AI systems and integrating AI-specific metrics into reliability frameworks.
Are experienced with cloud platforms, monitoring tools, and incident response automation.
Are comfortable collaborating across teams to influence best practices for AI system reliability and operational health.
Thrive in dynamic, fast-paced environments focusing on delivering reliable, safe AI-powered services.

Nice to have

Hands-on experience with AI/ML infrastructure, including GPU/xPU optimization and scaling.
Familiarity with API platform operations and large-scale distributed services.
Prior experience building or operating observability tools tailored for AI and agentic systems.
Contribution to open-source projects or reliability engineering thought leadership.

Culture & Benefits

Flexible schedule working with a fun, collaborative team.
Full medical coverage, flexible PTO, wellness reimbursement, and a monthly lunch stipend.
Wellness programs to help you stay in the best of your physical and mental health.
Frequent and fascinating team-building events will keep you connected, while our donation-matching program can support the causes you care about.
Hybrid work model.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Ai Reliability & Monitoring Engineering Lead (AI)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

Ml Infrastructure Engineer (AI)

Senior AI Engineer (APM)

Staff Research Engineer (AI)

Staff AI Engineer (AI)

Principal Delivery Consultant (AI/ML)

Senior Manager - AI Delivery Lead (AI)

Разработка

Game Dev

Design и Creative

Аналитика

Менеджмент

People & Business