Senior Site Reliability Engineer (SRE)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer (SRE) (SRE/DevOps): Building and maintaining scalable infrastructure and automation for traditional services and AI-driven workloads with an accent on reliability, observability, and CI/CD for model deployment workflows. Focus on incident response, root-cause analysis, and partnering with AI/ML teams to support training, serving, and lifecycle management.
Location: Atlanta, GA
Salary: $99,090 - $123,860 USD (annual base)
Company
provides financial services focused on helping customers and communities build a better financial future.
What you will do
- Design, build, and maintain scalable infrastructure and automation tools for traditional and AI-based systems.
- Develop software to improve reliability and reduce manual toil.
- Implement and manage CI/CD pipelines, including model deployment workflows.
- Monitor performance, availability, and security using modern observability tools.
- Collaborate with data science and ML engineering teams to support AI/ML training, serving, and lifecycle management.
- Lead incident response, root cause analysis, and postmortem processes; advocate SRE principles across engineering and AI teams.
Requirements
- 5+ years of experience in SRE, DevOps, or software engineering.
- Strong programming skills (e.g., Python, Java).
- Experience supporting AI/ML workloads (model training, inference, GPU orchestration).
- Deep understanding of Linux systems, cloud platforms (primarily Azure and AWS), and container orchestration.
- Experience with infrastructure-as-code tools (e.g., Terraform, Ansible, GitHub).
- Proficiency with monitoring/logging tools (e.g., Dynatrace) and strong networking, security, and distributed systems knowledge.
Nice to have
- Experience with AI model observability, drift detection, or performance monitoring.
- Contributions to open-source SRE/DevOps/ML infrastructure tools.
- Cloud platform certifications.
Culture & Benefits
- Health, dental, vision, and life insurance plans.
- 401(k) savings plan with generous company matching (up to 6%).
- Employer-paid retirement plan (cash balance retirement plan, 4%).
- Tuition reimbursement up to $5,250/year.
- Paid time off (20 days), paid company holidays, and a flexible Diversity Celebration Day.
- Paid volunteer time (40 hours per calendar year).
Hiring process
- Interviews to assess SRE/DevOps experience, reliability/observability practices, and experience with AI/ML infrastructure.
- Discussion of collaboration approach and incident/operations leadership.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →