Назад
Company hidden
22 часа назад

Sr. SRE Platform Software Engineer (AI)

Формат работы
hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Sr. SRE Platform Software Engineer (AI/Cloud): Leading the design and evolution of a next-generation public cloud platform for AI and Bitcoin mining infrastructure with an accent on global scalability, reliability, and performance. Focus on building observability stacks, Kubernetes operators, and automated hardware lifecycle management for large-scale GPU workloads.

Location: Hybrid (Remote and In Person) in San Jose, CA or Austin, TX

Company

hirify.global is a world-leading technology company specializing in AI and Bitcoin mining infrastructure, providing comprehensive computational solutions and cloud capabilities globally.

What you will do

  • Lead the end-to-end architecture for CPU, GPU, RDS, storage, networking, and AI services.
  • Develop and maintain critical SRE components including collection agents, metrics stores, and alerting frameworks.
  • Implement fault-prediction engines and automated remediation workflows.
  • Manage hardware lifecycle and DC operations via ZTP pipelines and BMC/IPMI management.
  • Build and optimize CI/CD and GitOps pipelines using Argo, Flux, and Helm.
  • Design multi-region deployments to power large-scale AI and enterprise workloads.

Requirements

  • 7+ years of production software engineering experience, including 2+ years of on-call operational experience.
  • Production-depth mastery of Go (preferred), Rust, or Java, and proficiency in Python.
  • Strong grasp of distributed systems fundamentals (Raft, Paxos, consistent hashing, idempotency).
  • Experience building production-scale observability stacks (Prometheus, VictoriaMetrics, Loki, etc.).
  • Proven experience writing Kubernetes controllers or CRDs handling production traffic.
  • Expertise in mTLS bootstrap and secrets management using HashiCorp Vault or cloud KMS.

Nice to have

  • Deep understanding of NVIDIA internals, DCGM, and MIG/vGPU partitioning.
  • Experience with InfiniBand or RoCE fabrics and NCCL collective tracing.
  • Management of HPC storage (Lustre, DDN, VAST, or NVMe-oF).
  • Hands-on experience with BMC, IPMI, and Redfish at OEM scale.
  • Familiarity with Kubernetes GPU Operator, Slurm controller, or Ray GCS.

Culture & Benefits

  • Opportunity to lead the architecture of high-impact AI computational infrastructure.
  • Hybrid work environment blending remote flexibility with in-person collaboration.
  • Commitment to equal employment opportunities and a diverse professional environment.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →