Senior Site Reliability Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer (AI Cloud): Ensuring 's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security with an accent on defining SLOs, capacity management across distributed GPU networks, and incident response systems. Focus on designing monitoring and alerting, building automation for resource allocation, implementing secure rollouts and tenant isolation.
Location: San Francisco, CA (onsite, full-time)
Company
Labs is democratizing AI compute through an open-access GPU marketplace aggregating global resources for affordable AI inference and innovation.
What you will do
- Define and maintain SLOs/SLAs for job success rates, trust, and economic efficiency in the GPU marketplace.
- Build monitoring, alerting, and observability systems for deep infrastructure visibility.
- Manage capacity planning, forecasting, and resource allocation across distributed GPU suppliers.
- Lead incident response, on-call rotations, post-mortems, and resilience improvements to reduce MTTR.
- Implement secure deployment mechanisms like progressive rollouts, canary deployments, and automated rollbacks.
- Enhance infrastructure security with tenant isolation, secrets management, key systems, and compliance frameworks.
Requirements
- Expert in SRE with experience defining/monitoring SLOs/SLAs for production systems.
- Strong capacity planning, resource allocation, and cost optimization for distributed systems.
- Proven incident response, on-call, and post-mortem processes improving resilience.
- Deep knowledge of deployment systems: progressive rollouts, canary, feature flags, rollbacks.
- Proficient in observability: metrics, logging, tracing, alerting (Prometheus, Grafana, ELK).
- Strong infrastructure security: tenant isolation, network segmentation, secrets/key management.
- Knowledge of compliance (SOC 2, ISO 27001) and IaC, config management, CI/CD.
- Excellent debugging of complex distributed systems under pressure.
Nice to have
- Experience with GPU infrastructure, AI/ML platforms, or compute marketplaces.
- Background in distributed/peer-to-peer systems or decentralized infrastructure.
- Multi-tenancy security, container/runtime security, chaos engineering.
- Cost optimization for cloud/GPU, high-uptime systems (99.9%+ SLAs).
- Experience at AWS, Google Cloud, Azure, or infrastructure startups; open-source contributions.
Culture & Benefits
- High-impact role in a Series A startup led by AI/Math/CS PhD founders.
- Focus on open-source technology and making AI accessible globally.
- Equal opportunity employer committed to diversity and inclusion.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →