TL;DR
Senior Site Reliability Engineer (AI): Responsible for the reliability, performance, and availability of critical systems, applications, and services for a GPU cloud engineered for AI, with an accent on monitoring, automation, incident response, and capacity planning. Focus on building and maintaining highly available, scalable systems across hybrid environments, including data centers, on-premise hardware, and cloud platforms using open-source tools like Prometheus and Grafana.
Location: Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.
Salary: Highly competitive package (base + equity) with reviews every 12 months.
Company
hirify.global is the GPU cloud engineered for AI, providing cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers.
What you will do
- Manage core services including observability platforms and incident management systems, reducing manual toil through automation.
- Build and maintain observability tooling, focusing on Prometheus, Grafana, Alertmanager, Loki, and Cortex/Mimir, ensuring detailed visibility and actionable insights.
- Develop automation scripts and infrastructure-as-code templates for operational efficiency in on-prem and hybrid environments.
- Collaborate with distributed teams to establish and maintain SLIs/SLOs for critical services.
- Perform incident response and manage alerting pipelines for infrastructure applications and services.
- Contribute to internal resources, analysis, documentation, and best practices, fostering a blameless culture of continuous improvement.
Requirements
- Strong experience with Linux systems administration and infrastructure automation (e.g., Ansible, Terraform).
- Proven background in building and maintaining SRE systems in production-grade environments.
- Hands-on experience operating and scaling Prometheus-based monitoring solutions in distributed, multi-tenant environments (including Thanos, Grafana, Cortex/Mimir).
- Solid understanding of networking fundamentals, hardware infrastructure, and managing multiple data center environments.
- Demonstrated scripting and/or development skills in at least one language (e.g., Python, Bash) with a bias towards automating operational workflows.
- Strong knowledge of SNMP, IPMI, and other datacenter/hardware protocols.
- Competence in metrics and log-based observability platforms aligned with cloud-native and distributed architectures.
- Familiarity with incident response, root cause analysis, and driving technical postmortems.
- Strong grasp of availability principles, including metrics, logging, and tracing, with a focus on SLA/SLO delivery and improvement.
Culture & Benefits
- Collaborative, supportive, and innovative remote-first environment where contributions spark real impact.
- Highly competitive package (base + equity) with reviews every 12 months.
- Opportunity to join the fastest-growing tech startup, pushing boundaries in cutting-edge AI.
- Dynamic progression plan tailored to ambitions, encouraging new things, leadership, and challenging the status quo.
- Human-First Flexibility, trusting hirify.globalrs to deliver with autonomy to shape their day.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →