TL;DR
Site Reliability Engineer (AWS): Improving and protecting the reliability, performance, and operability of production systems with an accent on evolving AWS-based infrastructure. Focus on designing end-to-end observability, defining SLIs/SLOs, and contributing to tooling and automation.
Location: Remote from across the U.S. (excluding Alabama, Alaska, Connecticut, Hawaii, Kentucky, Mississippi, Nebraska, New Mexico, North Dakota, Rhode Island, South Dakota, West Virginia, and Wyoming), with an option for hybrid work in San Francisco, CA; Los Angeles, CA; Toronto, Canada; or Raleigh, NC.
Salary: $120,000–$150,000 base salary + annual bonus
Company
hirify.global is transforming the commercial contracting industry with a software platform, achieving a $1 billion valuation and $275M+ in funding to empower contractors with AI-driven tools.
What you will do
- Drive and refine modern SRE practices including SLIs/SLOs and reliability reviews.
- Design and maintain end-to-end observability (metrics, logs, traces, dashboards, alerts).
- Partner with product and engineering teams on service reliability (architecture, failure modes, rollout, capacity).
- Evolve and operate AWS infrastructure using Infrastructure as Code (Terraform).
- Contribute code to services, tooling, and automation for reliability improvements.
- Participate in incident response and post-incident reviews for production issues.
Requirements
- 3+ years of professional experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
- Thorough understanding and hands-on experience with modern SRE practices (SLIs/SLOs, automation, safe deployments, post-incident reviews).
- Software engineering experience, able to write and maintain production-quality code in Python or Node.js/TypeScript.
- Strong observability skills including designing metrics, logging, tracing, dashboards, and alerts, with experience in tools like Datadog, Prometheus, Grafana.
- Experience with AWS in production, Terraform, and container/orchestration platforms (Docker with ECS, EKS, or Kubernetes).
- English: B2 required.
Nice to have
- Experience using LLMs to assist in work.
- Incident management experience (participating in/coordinating incident response, using tools like incident.io, PagerDuty, OpsGenie).
Culture & Benefits
- Generous equity grant and Macbook computer provided.
- Comprehensive benefits package and flexible PTO.
- Work from home stipend.
- Hybrid work schedules in hubs with lunch provided for in-office days.
- Company events (BBQs, team-building activities, both in-person and virtual).
- Opportunities for growth, career advancement, and working with cutting-edge technology.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →