Эта вакансия в архиве

Посмотреть похожие вакансии ↓
Company hidden
4 дня назад

Site Reliability Engineer (Kubernetes)

Формат работы
remote
Тип работы
fulltime
Грейд
middle
Английский
b2

Описание вакансии

Текст:
/

TL;DR

Site Reliability Engineer (Kubernetes): Improving the availability, performance, and scalability of large-scale, multi-cloud SaaS environments with an accent on automation, observability, and incident response. Focus on designing backend services and production engineering tools while integrating AI-assisted workflows to enhance operational efficiency.

Company

hirify.global is a software company providing a platform to manage, accelerate, and secure software delivery from code to production.

What you will do

  • Support the reliability, performance, and scalability of large-scale, multi-cloud, Kubernetes-based SaaS environments.
  • Investigate and troubleshoot production issues across distributed systems and infrastructure in collaboration with Engineering teams.
  • Design and develop backend services, internal platforms, and production engineering tools using Python or Go.
  • Improve observability and operational readiness through SRE practices, monitoring, and capacity planning.
  • Evaluate and contribute to AI-assisted automation solutions to improve troubleshooting and production workflows.
  • Participate in on-call rotations and lead incident response to ensure system stability.

Requirements

  • 2-4 years of experience in SRE, Production Engineering, or DevOps roles.
  • Hands-on experience with Kubernetes-based containerized workloads.
  • Experience with at least one public cloud provider: AWS, GCP, or Azure.
  • Proficiency in developing backend services or automation tools using Python, Go, or similar languages.
  • Strong understanding of Linux fundamentals, networking, and production troubleshooting.
  • Familiarity with CI/CD tools and observability platforms like Prometheus or Grafana.

Nice to have

  • Experience using AI-assisted operational workflows for log analysis or incident triage.
  • Familiarity with agentic automation frameworks such as LangGraph or LangChain.
  • Experience with AI-assisted development tools like GitHub Copilot or Cursor.

Culture & Benefits

  • Opportunity to work on a mission-critical platform used by the majority of the Fortune 100.
  • Collaborative, impact-focused environment with a focus on modern SRE practices.
  • Continuous learning culture with exposure to cutting-edge technologies and AI integration.