TL;DR
Software Engineer (Site Reliability Engineering): Building and optimizing high-scale cloud services with an accent on platform uptime, performance, and health data visualization. Focus on transforming monitoring strategy into active, high-fidelity signals for real-time alerting and incident response, and integrating reliability testing into software development lifecycles.
Location: Onsite in San Francisco, Seattle, Palo Alto, or Bellevue, USA
Company
hirify.global is a technology organization managing high-level frameworks to measure platform uptime and performance, bridging reporting and individual engineering teams.
What you will do
- Provide input into long-range platform requirements and operational guidelines, making health data actionable for service owners.
- Analyze and understand service telemetry, driving continuous improvement of health signals.
- Partner with internal engineering teams to integrate global availability standards into monitoring pipelines and automated alerting flows.
- Identify and mitigate onboarding friction by leveraging automated test suites for streamlined reliability signals.
- Serve as a technical subject matter expert for centralized infrastructure services (logging, monitoring, and data platforms).
- Quarterback the integration of failure signals into standard engineering workflows, ensuring automated work items and proactive investigations.
Requirements
- A related technical degree.
- 5+ years of proven experience in production environments (software engineer, systems engineer, service owner, or lead developer).
- Fluency in Java or a similar object-oriented language (Python, C++, etc.).
- Deep understanding of telemetry systems and experience building or managing production monitoring and alerting frameworks.
- Experience using Linux environments and the ability to navigate complex, distributed system architectures.
- Familiarity with core web technologies: HTTP, JSON, REST, and XML.
Nice to have
- Previous experience in a Service Owner or Technical Lead role within a high-scale, multi-tenant cloud environment.
- Strong background in Site Reliability Engineering (SRE) principles and industry-standard availability best practices.
- Experience with automated testing frameworks (e.g., Selenium, Integration testing, or Chaos Engineering).
- Log parsing and data analysis experience using platforms such as Splunk or ELK.
- Experience with SQL and relational databases (PostgreSQL, Oracle, etc.).
Culture & Benefits
- Be part of the Availability Standards team, influencing platform uptime and performance.
- Follow a consultative engineering approach, partnering with service owners.
- Advocate for the customer and influence the product roadmap by ensuring world-class availability.
- Work within a team focused on maintaining world-class availability.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →