Senior Site Reliability Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer (SRE): Ensuring the IFIaaS SaaS platform is reliable, available, and performant with an accent on SLO/SLI ownership, end-to-end observability, and incident response across distributed on-prem/hybrid environments. Focus on designing HA/DR failover, automating runbooks and provisioning, and maintaining secure, compliant operations while reducing MTTA/MTTR and toil.
Location: Hybrid — three in-office days per week in Minneapolis, Ottawa, Colorado, or Dallas (primary posting location: Shakopee, MN).
Salary: $129,098-$189,343 per year (US).
Company
provides identity-centric security solutions, enabling trusted identities, payments, and data protection.
What you will do
- Own SLOs/SLIs for availability (99.9%), latency, error rate, and quality of service across microservices.
- Design and operate observability (metrics, logs, traces, synthetic checks, and real-user monitoring) and instrument services with structured logs and trace context.
- Build health probes, SLA monitors, and on-call/monitoring tooling (e.g., Splunk on-call, Prometheus, Datadog) and use metrics to detect and diagnose issues.
- Lead incident response (triage, communications, coordination, mitigation) and run blameless postmortems with actionable follow-ups.
- Maintain and improve runbooks, escalation paths, paging policies, and MTTA/MTTR reduction programs; implement war room protocols during incidents.
- Automate provisioning and configuration drift detection/correction; manage patching, backups/restores (RPO/RTO), and compliance evidence for PCI-DSS/PCI-CP and SOC 2/ISO 27001.
Requirements
- 5+ years of experience in SRE, DevOps, or software engineering supporting distributed, production-grade environments, including troubleshooting microservices and Windows/VMware systems in on-prem/hybrid infrastructure.
- Hands-on automation and observability experience, including Terraform/Ansible/DSC, CI/CD, and enterprise monitoring/logs/metrics/tracing tools (e.g., Datadog, Prometheus, Splunk).
- Infrastructure automation proficiency (e.g., Terraform, Ansible, Jenkins, Octopus, PowerShell DSC).
- Proficiency in VMware, Windows Server administration, networking fundamentals, and system-level performance analysis.
- Hands-on experience operating and troubleshooting enterprise microservices, APIs, and distributed application stacks in on-prem/hybrid environments.
- Must provide after-hours production support on a rotational basis to ensure 24/7/365 availability.
Nice to have
- Experience operating in compliance-sensitive environments (PCI-DSS, PCI-CP, SOC 2) with strong integrity and accountability.
- Leadership behaviors and communication skills, including leading through example and driving operational excellence.
Culture & Benefits
- Hybrid flexibility with three in-office days per week; distributed workforce.
- Comprehensive US health and well-being programs, including medical, vision, dental, and 401(k) matching.
- Paid personal time off plus 12 paid holidays, parental leave, life/disability insurance, and education reimbursement (eligibility applies).
- Discretionary annual incentive plan eligibility.
- Focus on operational excellence, blameless postmortems, and continuous improvement.
Hiring process
- Recruiter screen followed by interviews to assess SRE/observability/incident response experience and operational practices.
- Compensation and eligibility details discussed with the recruiter.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →