Company hidden

22 часа назад

Sr. SRE Platform Software Engineer (AI)

Формат работы

hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Sr. SRE Platform Software Engineer (AI/Cloud): Leading the design and evolution of a next-generation public cloud platform for AI and Bitcoin mining infrastructure with an accent on global scalability, reliability, and performance. Focus on building observability stacks, Kubernetes operators, and automated hardware lifecycle management for large-scale GPU workloads.

Location: Hybrid (Remote and In Person) in San Jose, CA or Austin, TX

Company

hirify.global is a world-leading technology company specializing in AI and Bitcoin mining infrastructure, providing comprehensive computational solutions and cloud capabilities globally.

What you will do

Lead the end-to-end architecture for CPU, GPU, RDS, storage, networking, and AI services.
Develop and maintain critical SRE components including collection agents, metrics stores, and alerting frameworks.
Implement fault-prediction engines and automated remediation workflows.
Manage hardware lifecycle and DC operations via ZTP pipelines and BMC/IPMI management.
Build and optimize CI/CD and GitOps pipelines using Argo, Flux, and Helm.
Design multi-region deployments to power large-scale AI and enterprise workloads.

Requirements

7+ years of production software engineering experience, including 2+ years of on-call operational experience.
Production-depth mastery of Go (preferred), Rust, or Java, and proficiency in Python.
Strong grasp of distributed systems fundamentals (Raft, Paxos, consistent hashing, idempotency).
Experience building production-scale observability stacks (Prometheus, VictoriaMetrics, Loki, etc.).
Proven experience writing Kubernetes controllers or CRDs handling production traffic.
Expertise in mTLS bootstrap and secrets management using HashiCorp Vault or cloud KMS.

Nice to have

Deep understanding of NVIDIA internals, DCGM, and MIG/vGPU partitioning.
Experience with InfiniBand or RoCE fabrics and NCCL collective tracing.
Management of HPC storage (Lustre, DDN, VAST, or NVMe-oF).
Hands-on experience with BMC, IPMI, and Redfish at OEM scale.
Familiarity with Kubernetes GPU Operator, Slurm controller, or Ray GCS.

Culture & Benefits

Opportunity to lead the architecture of high-impact AI computational infrastructure.
Hybrid work environment blending remote flexibility with in-person collaboration.
Commitment to equal employment opportunities and a diverse professional environment.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Похожие вакансии

Sr. SRE Platform Software Engineer (AI)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

Staff SRE (Platform Engineering)

Senior Software Engineer (AI)

SRE/DevOps Engineer (Cybersecurity)

Senior Site Reliability Engineer (AI)

Senior Software Engineer, Cloud Platform (AI)

Senior Software Engineer (Go)

Разработка

Game Dev

Design и Creative

Аналитика

Менеджмент

People & Business

Sr. SRE Platform Software Engineer (AI)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Categories

Похожие вакансии

Staff SRE (Platform Engineering)

Senior Software Engineer (AI)

SRE/DevOps Engineer (Cybersecurity)

Senior Site Reliability Engineer (AI)

Senior Software Engineer, Cloud Platform (AI)

Senior Software Engineer (Go)