TL;DR
Senior Staff SRE (AI): Designing, operating, and evolving cloud infrastructure and operational platforms that power mission-critical SaaS and IoT services with an accent on observability, intelligent automation, and AIOps capabilities. Focus on defining the technical vision for operational intelligence, leading large-scale automation initiatives, and embedding reliability into system design.
Location: Remote (USA) or Onsite (San Diego, CA)
Salary: $207,000.00 – $261,000.00
Company
hirify.global is a company focused on designing, operating, and evolving cloud infrastructure and operational platforms for mission-critical SaaS and IoT services at a global scale.
What you will do
- Define and drive long-term strategy for observability, operational intelligence, and reliability engineering across the organization.
- Lead the evolution towards intelligent operations by designing AIOps capabilities such as anomaly detection, event correlation, and automated remediation.
- Architect and lead the end-to-end observability platform across metrics, logs, traces, and events.
- Drive large-scale automation initiatives, including self-service infrastructure workflows, policy-as-code guardrails, and automated response.
- Partner with product, platform, and data teams to embed reliability, performance, cost efficiency, and fault tolerance into system design.
- Provide technical leadership during high-severity incidents and guide blameless postmortems.
Requirements
- 8–10+ years of experience in SRE, platform engineering, or cloud infrastructure roles supporting large-scale production environments.
- Demonstrated experience leading architecture, reliability strategy, or operational platforms across multiple teams.
- Deep expertise designing and operating large-scale AWS environments, including services like VPC, EC2, EKS/ECS, RDS/DynamoDB, and S3.
- Senior-level experience with observability platforms (New Relic, Datadog, Prometheus/Grafana, OpenTelemetry).
- Expert-level experience with Infrastructure-as-Code using Terraform and/or CloudFormation, including GitOps workflows.
- Strong scripting or programming skills (Python, Go, Bash) and expert understanding of Linux systems, networking, and Kubernetes.
Nice to have
- Experience implementing or evaluating AIOps capabilities such as anomaly detection or predictive alerting.
- Familiarity with applying machine learning or AI techniques to operational data or reliability workflows.
Culture & Benefits
- Comprehensive medical, dental, and vision insurance, with Health Savings Account and Flexible Spending Accounts.
- 401(k) and 401(k) match.
- Flexible Time Off (FTO) or Paid Time Off (PTO), plus 11 paid holidays and 1 inclusive holiday per year.
- Employee Well-Being program and Education Reimbursement Program.
- Commitment to building a diverse and inclusive workforce and providing reasonable accommodation for candidates with disabilities.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →