Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Sr. SRE Engineer II (EPICS, NG-SIEM): Own reliability and scalability of a next-generation SIEM platform with an accent on end-to-end observability, coordinated scaling, and incident response across complex distributed pipelines. Focus on building automation and scaling systems that keep ingest, search, and workflow execution healthy under 24/7 high-volume load.
Location: Australia (Sydney) — hybrid; expected in the Sydney office (Level 18, 141 Walker Street, North Sydney) 2–3x a week.
Company
Global cybersecurity company building an AI-native security platform.
What you will do
- Design, build, and maintain end-to-end observability (monitoring and synthetic tests) across the NG-SIEM pipeline from ingest through search and workflow execution.
- Engineer coordinated scaling solutions that treat the NG-SIEM pipeline as a unified system and eliminate cascading bottlenecks across dependent components (e.g., Kafka, ingest pipelines, downstream services).
- Lead platform-wide incident response (P2 and above) as a subject matter expert, diagnosing and resolving multi-component failures and coordinating incident communications; participate in follow-the-sun on-call rotations.
- Build capacity forecasting and cost management models for end-to-end pipeline dimensions; develop tooling to track and surface cost drivers.
- Automate remediation via runbooks (e.g., pipeline-wide scaling responses, CID rebalancing, infrastructure healing) to resolve issues before customer impact.
- Collaborate with cross-functional teams to triage SLO breaches, drive problem management, and improve long-term platform resilience and efficiency.
Requirements
- 10+ years of experience in software engineering, site reliability engineering, or platform engineering, with significant time on large-scale distributed systems.
- Strong proficiency in at least one systems programming language (Go, Java, Rust, or C++) and one scripting language (Python, Bash).
- Deep experience with end-to-end observability: building monitoring pipelines, defining SLIs/SLOs, and creating dashboards for multi-service architectures.
- Proven ability to diagnose and resolve complex 24/7 incidents spanning multiple distributed components.
- Experience with coordinated capacity planning and scaling for large infrastructure footprints.
- Hands-on experience with streaming platforms (Kafka or similar), including backpressure, partition management, and consumer group dynamics at scale.
Nice to have
- Experience in a similar reliability/platform role at a hyperscaler (AWS, Azure, GCP) or large-scale SaaS provider.
- Track record of automated remediation and self-healing infrastructure.
- Cost modeling and unit economics for large compute and storage footprints.
- Familiarity with cloud-native architectures and serverless computing.
- Exposure to log management, cybersecurity products, or security operations workflows.
- Experience with disaster recovery planning and execution for multi-region systems.
Culture & Benefits
- Hybrid work with expectation to be in the Sydney office 2–3x per week.
- Market-leading compensation and equity awards.
- Comprehensive physical and mental wellness programs.
- Competitive vacation and holidays; paid parental and adoption leaves.
- Professional development opportunities for all employees.
- Vibrant office culture and employee networks/volunteer opportunities.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →