Site Reliability Engineer (AI Infrastructure)

Формат работы

remote (только Europe/Canada/United_states)

Тип работы

fulltime

Грейд

senior

Английский

Страна

UK/US/Argentina +20 еще

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Site Reliability Engineer (AI Infrastructure): Building and scaling systems that power AI agents in production with an accent on reliability, observability, and developer experience. Focus on designing platform services, APIs, and SDKs to enable the safe and efficient consumption of AI infrastructure as a service.

Location: Remote (Available in UK, Argentina, Brazil, Bulgaria, Canada, Chile, Colombia, Cyprus, Czech Republic, Hungary, Ireland, Lithuania, Mexico, Peru, Poland, Portugal, Romania, South Africa, Spain, Sweden, Switzerland, UAE)

Company

hirify.global is one of the world's longest-standing crypto platforms, trusted by over 10 million individuals and institutions globally.

What you will do

Design, build, and operate the infrastructure layer supporting AI agent workflows in production.
Develop platform services, APIs, SDKs, and self-service capabilities for engineering teams to consume AI infrastructure.
Manage compute, orchestration, and serving infrastructure for model inference using Kubernetes and AWS.
Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads.
Utilize Terraform for Infrastructure as Code (IaC) and maintain CI/CD pipelines for rapid deployment of AI services.
Collaborate with AI and Data Engineering teams to harden experimental agent prototypes into production systems.

Requirements

5+ years of experience as an SRE, Infrastructure Engineer, or Platform Engineer in a production environment.
Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production.
Proficiency with Terraform, Kubernetes, Docker, and AWS.
Strong scripting skills (bash/shell) and proficiency in Python.
Experience building developer platforms, internal tooling, or APIs consumed by engineering teams at scale.
Experience implementing incident response procedures and participating in on-call rotations.

Nice to have

Experience with agent orchestration frameworks like LangGraph or CrewAI.
Background in data infrastructure including Airflow, Kafka, or Spark.
Experience with Cloudflare's product ecosystem (networking, security, Zero Trust).
Experience working in fast-moving 0→1 environments or platform-building teams.

Culture & Benefits

Remote-first work environment across multiple global jurisdictions.
Merit-based hiring culture that celebrates diverse talents and perspectives.
Opportunity to work at the intersection of data infrastructure and applied AI in a high-stakes production environment.
Focus on developer experience and long-term scalability.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →