Staff Site Reliability Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Site Reliability Engineer (SRE/AI): Lead development of AI-assisted reliability tooling for analyzing tickets, logs, traces to resolve outages faster with an accent on observability, incident response, and automation. Focus on defining SLO/SLI frameworks, scaling cloud operations for SaaS, and mentoring engineers on SRE practices.
Remote Argentina
Company
Domino builds a platform for AI-driven organizations to develop and operate data science and AI solutions at scale, serving customers like Johnson & Johnson, GSK, and UBS.
What you will do
- Lead development of internal AI-assisted reliability tooling to analyze tickets, logs, traces, and documentation for faster outage resolution.
- Improve observability coverage and signal quality for critical customer-facing systems.
- Own end-to-end incident response, documentation, and prevention of recurrence.
- Guide development of customer-facing observability tools in products.
- Define and mature SLO/SLI frameworks for priority services.
- Scale cloud operations for single-tenant SaaS and improve deployment reliability.
- Mentor engineers and shape SRE practices, workflows, and culture.
Requirements
- Deep experience in SRE, platform engineering, or software engineering with operational ownership.
- Fluency with Kubernetes, Linux, cloud platforms, and observability tooling for production problem-solving.
- Strong ability to identify and close reliability gaps in products, tools, and processes.
- Strong software engineering skills in Python or Go for building reliable internal tools.
- Comfort leading ambiguous technical work and influencing teams without authority.
- History of improving reliability through engineering and automation.
- Strong communication and mentoring experience in technical decision-making.
- Sound judgment on AI/LLM tooling in operational workflows.
Nice to have
- Experience with LLM-based systems, retrieval workflows, SaaS operations, or support/developer tooling.
Culture & Benefits
- Growing a diverse team with people from all backgrounds encouraged to apply.
- Value growth mindset, creative problem-solving, and seeking opportunities for success.
- Emphasize truth-seeking, authenticity, and continuous improvement in all areas.
- Environment of teaching and learning to equip employees for success.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →