Senior Site Reliability Engineer

Формат работы

onsite

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Site Reliability Engineer (AI Cloud): Ensuring hirify.global's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security with an accent on defining SLOs, capacity management across distributed GPU networks, and incident response systems. Focus on designing monitoring and alerting, building automation for resource allocation, implementing secure rollouts and tenant isolation.

Location: San Francisco, CA (onsite, full-time)

Company

hirify.global Labs is democratizing AI compute through an open-access GPU marketplace aggregating global resources for affordable AI inference and innovation.

What you will do

Define and maintain SLOs/SLAs for job success rates, trust, and economic efficiency in the GPU marketplace.
Build monitoring, alerting, and observability systems for deep infrastructure visibility.
Manage capacity planning, forecasting, and resource allocation across distributed GPU suppliers.
Lead incident response, on-call rotations, post-mortems, and resilience improvements to reduce MTTR.
Implement secure deployment mechanisms like progressive rollouts, canary deployments, and automated rollbacks.
Enhance infrastructure security with tenant isolation, secrets management, key systems, and compliance frameworks.

Requirements

Expert in SRE with experience defining/monitoring SLOs/SLAs for production systems.
Strong capacity planning, resource allocation, and cost optimization for distributed systems.
Proven incident response, on-call, and post-mortem processes improving resilience.
Deep knowledge of deployment systems: progressive rollouts, canary, feature flags, rollbacks.
Proficient in observability: metrics, logging, tracing, alerting (Prometheus, Grafana, ELK).
Strong infrastructure security: tenant isolation, network segmentation, secrets/key management.
Knowledge of compliance (SOC 2, ISO 27001) and IaC, config management, CI/CD.
Excellent debugging of complex distributed systems under pressure.

Nice to have

Experience with GPU infrastructure, AI/ML platforms, or compute marketplaces.
Background in distributed/peer-to-peer systems or decentralized infrastructure.
Multi-tenancy security, container/runtime security, chaos engineering.
Cost optimization for cloud/GPU, high-uptime systems (99.9%+ SLAs).
Experience at AWS, Google Cloud, Azure, or infrastructure startups; open-source contributions.

Culture & Benefits

High-impact role in a Series A startup led by AI/Math/CS PhD founders.
Focus on open-source technology and making AI accessible globally.
Equal opportunity employer committed to diversity and inclusion.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →