Site Reliability Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer (AI): Building and optimizing highly available, performant, and reliable AI platform services with an accent on scalable infrastructure design and incident prevention. Focus on automating operational tasks, ensuring system resilience on public clouds, and leading incident response for complex distributed systems.
Location: Hybrid in New York City or London
Salary: $157,700–$277,800
Company
is a leader in enterprise generative AI, providing an end-to-end platform for businesses to orchestrate AI-powered work and deploy AI agents.
What you will do
- Automate operational tasks and infrastructure management by developing robust tools and platforms using Python, Go, or similar languages.
- Design and implement scalable, fault-tolerant infrastructure solutions on public cloud providers (AWS, GCP, Azure).
- Own the reliability, performance, and efficiency of core services, defining and upholding stringent Service Level Objectives (SLOs) and Error Budgets.
- Own the observability stack for monitoring, logging, and alerting systems.
- Lead incident response, post-mortems, and root cause analyses.
- Collaborate closely with product and engineering teams, providing expert guidance on system design for reliability, performance, and scalability.
Requirements
- 7+ years of experience in site reliability engineering, DevOps, or a similar role focused on building and operating large-scale, high-availability production systems.
- Deep expertise with cloud platforms (AWS strongly preferred), containerization technologies like Docker and Kubernetes, and Infrastructure-as-Code tools such as Terraform.
- Strong proficiency in programming languages such as Python, Java, Go for automation and monitoring.
- Knowledge of monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack).
- Excellent communication, collaboration, and problem-solving skills.
- A strong sense of ownership and accountability for mission-critical systems.
- Hybrid work from New York City or London hubs is required.
Culture & Benefits
- Generous PTO, plus company holidays.
- Medical, dental, and vision coverage for you and your family, and paid parental leave.
- Fertility and family planning support, and early-detection cancer testing.
- Flexible spending account, dependent FSA, and health savings account with company contribution.
- Annual work-life stipends for wellness, learning, and development.
- Company-wide off-sites and team off-sites.
- Competitive compensation, company stock options and 401k (for US Full-time employees).
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →