Staff Site Reliability Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Site Reliability Engineer: Responsible for all aspects of the production data center services, including servers, operating systems, storage, and supporting systems with an accent on the availability, latency, performance, efficiency, and scalability. Focus on incident response, troubleshooting, reducing toil, and driving systemic fixes through platform engineering.
Location: Remote (USA); San Jose, California, USA
Salary: $119,000 - $170,000 USD
Company
accelerates digital transformation to ensure our customers can be more agile, efficient, resilient, and secure.
What you will do
- Own the reliability of a large-scale cloud service by partnering with Engineering and Network teams to define requirements early and conduct operability reviews.
- Develop and operate end-to-end observability and incident tooling to manage SLOs/error budgets, reduce noise, and improve system detection and diagnosis.
- Participate in an on-call rotation to lead full-cycle incident response and perform deep cross-stack troubleshooting to drive permanent software fixes.
- Build and maintain everything-as-code for fleet and service lifecycle, driving provisioning, configuration, release automation, and complex rollout/rollback workflows.
- Continuously improve platform hygiene through consistent OS/app upgrades, dependency/vulnerability patching, capacity and performance tuning, and strict CI/CD validation prior to production rollouts.
Requirements
- US Citizenship is required due to the nature of assigned customers.
- 5+ years industry experience in software engineering, infrastructure software, and/or platform engineering.
- Proficiency in at least one programming language (such as Python, Bash, or Go) with demonstrated ability to write production-quality code.
- Strong Linux/Unix systems fundamentals and solid understanding of networking protocols and components.
- Proven experience operating production services and ability to participate in on-call rotations and support occasional after-hours or weekend deployments.
- Managing BSD in production, with a focus on driving systemic fixes through platform engineering.
Nice to have
- Proven expertise in operating Kubernetes at scale.
- Deep experience with the Prometheus/OpenTelemetry ecosystems, including instrumenting golden signals, defining SLOs, and performing alert tuning to ensure high-availability environments.
Culture & Benefits
- Comprehensive and inclusive benefits to meet the diverse needs of employees and their families throughout their life stages.
- Committed to building a team that reflects the communities served and the customers worked with.
- Foster an inclusive environment that values all backgrounds and perspectives, emphasizing collaboration and belonging.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →