TL;DR
Senior Site Reliability Engineer (AI): Building and optimizing high-availability, fault-tolerant infrastructure solutions for an AI-powered enterprise platform with an accent on automating operational tasks, defining SLOs, and leading incident response. Focus on designing scalable systems on public clouds (AWS, GCP, Azure), proactive problem-solving, and championing reliability best practices for rapid growth.
Location: This is a hybrid position, based out of our New York City or London hubs.
Salary: $157,700–$277,800
Company
hirify.global is where leading enterprises orchestrate AI-powered work, providing an end-to-end platform for building and deploying AI agents grounded in company data.
What you will do
- Automate operational tasks and infrastructure management by developing robust tools and platforms using Python, Go, or similar languages.
- Design and implement scalable, fault-tolerant infrastructure solutions on public cloud providers (AWS, GCP, Azure) to support a high-traffic AI platform.
- Own the reliability, performance, and efficiency of core services, defining and upholding stringent Service Level Objectives (SLOs) and Error Budgets.
- Own the observability stack for monitoring, logging, and alerting systems across complex distributed systems.
- Lead incident response, post-mortems, and root cause analyses to prevent future outages and build resilient system architecture.
- Collaborate closely with product and engineering teams, providing expert guidance on system design for reliability, performance, and scalability.
Requirements
- 7+ years of experience in site reliability engineering, DevOps, or similar roles focused on large-scale, high-availability production systems.
- Deep expertise with cloud platforms (AWS strongly preferred), containerization technologies like Docker and Kubernetes, and Infrastructure-as-Code tools such as Terraform.
- Strong proficiency in programming languages such as Python, Java, or Go for automation and monitoring.
- Knowledge of monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack).
- Demonstrated ability to identify systemic weaknesses and propose innovative solutions to complex reliability problems.
- Excellent communication, collaboration, and problem-solving skills with cross-functional teams.
- A strong sense of ownership and accountability for mission-critical systems.
Culture & Benefits
- Generous PTO and company holidays.
- Medical, dental, and vision coverage for you and your family.
- Paid parental leave for all parents (12 weeks) and fertility/family planning support.
- Early-detection cancer testing and flexible spending account options.
- Health savings account for eligible plans with company contribution (US employees).
- Annual work-life stipends for wellness and learning & development.
- Company-wide and team off-sites.
- Competitive compensation, company stock options, and 401k (US employees).
Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →