Staff Site Reliability Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Site Reliability Engineer (AWS/Kubernetes): Establish and evolve SRE best practices across the organization, including reliability principles, error budgets, incident response, and observability strategy with an accent on SLIs/SLOs, alerting, dashboards, and automation. Focus on designing software-driven infrastructure solutions, leading large initiatives, and improving platform resilience, scalability, and developer workflows.
Location: Remote
Company
is the fastest growing healthcare technology company building products to make prescriptions accessible and affordable, including BlinkRx pharma-to-patient cloud and Quick Save for better access to medications.
What you will do
- Establish and evolve SRE best practices, including reliability principles, error budgets, incident response, postmortems, and operational readiness.
- Define and drive observability strategy for system health with SLIs/SLOs, alerting, dashboards, and service indicators.
- Design and implement software-driven infrastructure solutions to automate processes and reduce toil.
- Act as technical leader, set priorities, and influence decisions across cloud infrastructure, reliability tooling, and platform architecture.
- Own large ambiguous initiatives from concept to delivery, aligning stakeholders in engineering, security, and product.
- Improve platform resilience, scalability, performance, and compliance; identify risks and lead upgrades.
- Partner with teams to enhance developer workflows, tooling, and operational maturity; provide mentorship and code reviews.
- Lead incident response, escalation, postmortems, and knowledge sharing through documentation.
Requirements
- Bachelor’s or Master’s in Computer Science or equivalent; 7+ years in SRE, infrastructure, or platform engineering at scale.
- Expert troubleshooting across full stack: application, kernel, network; strong Linux and OS fundamentals.
- Advanced networking: load balancing, proxies, DNS, TCP/IP, NAT, service communication.
- Experience in Python, Go, Bash; automating operations; building internal tools.
- Deep cloud experience (AWS preferred, GCP/Azure ok), Kubernetes (EKS, Helm), observability systems, containers, microservices.
- IaC with Terraform, Pulumi, CloudFormation, or Ansible; holistic infrastructure design for cost, reliability, security.
Culture & Benefits
- Highly collaborative team of builders and operators inventing new ways in healthcare innovation.
- Impact millions of patients at intersection of healthcare and finances; build generational company.
- Relentlessly learning, curious, aggressively collaborative cross-functional environment.
- Equal opportunity employer valuing diversity.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →