Senior Site Reliability Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer (SRE/Observability): Design and operate observability layers for AI platforms, build automated findings-to-fix loops, and implement reliability hardening controls for internal AI and agentic systems with an accent on telemetry, runtime resilience, and production readiness. Focus on codifying detections and operational checks as code, improving incident workflows and post-incident learning, and keeping AI systems measurable, resilient, and controllable in production.
Location: Mexico City, Mexico
Company
is a global mobile games and interactive entertainment company developing and live-operating a diversified portfolio of games.
What you will do
- Design and operate observability for AI platforms, including audit trails, tool-call logs, correlation IDs, traces, and runtime visibility across service boundaries.
- Build automated findings-to-fix loops for AI and cloud platforms by integrating signals into pragmatic remediation workflows.
- Implement reliability and hardening controls for internal AI systems (alerting, health checks, rollback drills, kill-switch validation, rate limiting, drift detection).
- Codify detections, policies, and operational checks as code to reduce toil and prevent regressions.
- Review platform and AI-application changes from a reliability and application-hardening perspective, focusing on secrets, telemetry, external calls, risky MCP usage, and production readiness.
- Own AI-platform operational readiness and partner with central IT/SOC teams for escalations, postmortems, and shared incident workflows.
Requirements
- 5+ years in SRE, production engineering, platform operations, or security automation with strong coding ability.
- Hands-on scripting and coding experience, especially Python, with comfort working against APIs, log pipelines, and automation workflows.
- Experience building observability and alerting systems in AWS or comparable cloud environments.
- Ability to reduce operational toil through automation while keeping signal quality high and false positives manageable.
- Comfort with incident handling, rollback thinking, and evidence-driven postmortems (SLA/SLO discussions included).
- Strong interest in AI systems, agent runtimes, and MCP-style integration risks.
Culture & Benefits
- Engineering role focused on building telemetry, automation, and runtime reliability for AI platforms (not a generic SOC position).
- Work with a global team and collaborate with central IT/SOC for escalations and incident workflows.
- Support for continuous improvement through automation, post-incident learning, and repeatable playbooks.
- Background checks required after a conditional offer due to access to sensitive/confidential information.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →