Site Reliability Engineer (AI Infrastructure)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Site Reliability Engineer (AI Infrastructure): Building and scaling systems that power AI agents in production with an accent on reliability, observability, and developer experience. Focus on designing platform services, APIs, and SDKs to enable the safe and efficient consumption of AI infrastructure as a service.
Location: Remote (Available in UK, Argentina, Brazil, Bulgaria, Canada, Chile, Colombia, Cyprus, Czech Republic, Hungary, Ireland, Lithuania, Mexico, Peru, Poland, Portugal, Romania, South Africa, Spain, Sweden, Switzerland, UAE)
Company
is one of the world's longest-standing crypto platforms, trusted by over 10 million individuals and institutions globally.
What you will do
- Design, build, and operate the infrastructure layer supporting AI agent workflows in production.
- Develop platform services, APIs, SDKs, and self-service capabilities for engineering teams to consume AI infrastructure.
- Manage compute, orchestration, and serving infrastructure for model inference using Kubernetes and AWS.
- Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads.
- Utilize Terraform for Infrastructure as Code (IaC) and maintain CI/CD pipelines for rapid deployment of AI services.
- Collaborate with AI and Data Engineering teams to harden experimental agent prototypes into production systems.
Requirements
- 5+ years of experience as an SRE, Infrastructure Engineer, or Platform Engineer in a production environment.
- Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production.
- Proficiency with Terraform, Kubernetes, Docker, and AWS.
- Strong scripting skills (bash/shell) and proficiency in Python.
- Experience building developer platforms, internal tooling, or APIs consumed by engineering teams at scale.
- Experience implementing incident response procedures and participating in on-call rotations.
Nice to have
- Experience with agent orchestration frameworks like LangGraph or CrewAI.
- Background in data infrastructure including Airflow, Kafka, or Spark.
- Experience with Cloudflare's product ecosystem (networking, security, Zero Trust).
- Experience working in fast-moving 0→1 environments or platform-building teams.
Culture & Benefits
- Remote-first work environment across multiple global jurisdictions.
- Merit-based hiring culture that celebrates diverse talents and perspectives.
- Opportunity to work at the intersection of data infrastructure and applied AI in a high-stakes production environment.
- Focus on developer experience and long-term scalability.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →