Senior Site Reliability Engineer (AI)
ΠΡΡΡ & Π‘ΠΎΠΏΡΠΎΠ²ΠΎΠ΄
ΠΠ»Ρ ΠΌΡΡΡΠ° Ρ ΡΡΠΎΠΉ Π²Π°ΠΊΠ°Π½ΡΠΈΠ΅ΠΉ Π½ΡΠΆΠ΅Π½ Plus
ΠΠΏΠΈΡΠ°Π½ΠΈΠ΅ Π²Π°ΠΊΠ°Π½ΡΠΈΠΈ
TL;DR
Senior Site Reliability Engineer (AI): Designing and operating scalable, reliable, and secure infrastructure to support large-scale AI and HPC workloads with an accent on CI/CD pipelines, Kubernetes orchestration, and observability. Focus on driving automation, enforcing SRE best practices, and ensuring high availability across globally distributed platforms.
Location: Remote (United States)
Salary: $109,600 β $164,400
Company
is a leader in AI-powered cloud and digital infrastructure, providing sovereign AI capabilities and high-performance compute infrastructure globally.
What you will do
- Design and maintain robust CI/CD pipelines using GitLab CI, Azure DevOps, or Jenkins.
- Operate and optimize Kubernetes clusters to ensure scalability, performance, and resilience.
- Develop Infrastructure as Code (IaC) using Terraform, Helm, and Ansible to automate provisioning.
- Implement monitoring and observability stacks using Prometheus, VictoriaMetrics, Grafana, and ELK/EFK.
- Lead root cause analysis (RCA) and define SRE metrics including SLAs, SLOs, and error budgets.
- Provide mentorship to junior engineers and participate in on-call rotations for critical services.
Requirements
- Must be based in the United States.
- 5+ years of experience in DevOps, SRE, or platform engineering in production environments.
- Proven expertise in managing Kubernetes clusters (GKE, EKS, AKS, or self-managed).
- Proficiency in scripting and programming with Python, Bash, or Go.
- Strong experience with Terraform, Helm, or Ansible.
- Bachelorβs or Masterβs degree in Computer Science or a related technical field.
Nice to have
- Experience supporting AI/ML or HPC workloads in production.
- Knowledge of GPU resource management and workload schedulers.
- Familiarity with large-scale distributed systems and performance tuning.
Culture & Benefits
- Inclusive environment with a diverse team of over 1,100 employees from 68 nationalities.
- Culture based on grit, passion, and driving meaningful impact.
- Comprehensive benefits package including bonuses on top of base salary.
- Focus on trust, accountability, and high performance.
ΠΡΠ΄ΡΡΠ΅ ΠΎΡΡΠΎΡΠΎΠΆΠ½Ρ: Π΅ΡΠ»ΠΈ ΡΠ°Π±ΠΎΡΠΎΠ΄Π°ΡΠ΅Π»Ρ ΠΏΡΠΎΡΠΈΡ Π²ΠΎΠΉΡΠΈ Π² ΠΈΡ ΡΠΈΡΡΠ΅ΠΌΡ, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡ iCloud/Google, ΠΏΡΠΈΡΠ»Π°ΡΡ ΠΊΠΎΠ΄/ΠΏΠ°ΡΠΎΠ»Ρ, Π·Π°ΠΏΡΡΡΠΈΡΡ ΠΊΠΎΠ΄/ΠΠ, Π½Π΅ Π΄Π΅Π»Π°ΠΉΡΠ΅ ΡΡΠΎΠ³ΠΎ - ΡΡΠΎ ΠΌΠΎΡΠ΅Π½Π½ΠΈΠΊΠΈ. ΠΠ±ΡΠ·Π°ΡΠ΅Π»ΡΠ½ΠΎ ΠΆΠΌΠΈΡΠ΅ "ΠΠΎΠΆΠ°Π»ΠΎΠ²Π°ΡΡΡΡ" ΠΈΠ»ΠΈ ΠΏΠΈΡΠΈΡΠ΅ Π² ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΊΡ. ΠΠΎΠ΄ΡΠΎΠ±Π½Π΅Π΅ Π² Π³Π°ΠΉΠ΄Π΅ β