Engineer, Supercomputing & Distributed Systems (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Engineer, Supercomputing & Distributed Systems (AI): Building and operating the infrastructure for Krea's research and inference, including distributed training, Kubernetes GPU clusters, and petabyte-scale data pipelines with an accent on custom distributed datastores and job orchestration systems. Focus on scaling workloads and research between clusters in multiple datacenters and building fault tolerance systems for large-scale pretraining.
Location: On-site in San Francisco
Company
is building next-generation AI creative tools, dedicated to making AI intuitive and controllable for creatives.
What you will do
- Design multi-stage pipelines that turn petabytes of raw data into clean, annotated datasets.
- Manage distributed training and inference on 1000+ GPU Kubernetes clusters.
- Profile and optimize dataloaders streaming thousands of images per second.
- Customize and train models to filter billions of images.
- Build fault tolerance systems for large-scale pretraining.
Requirements
- Experience with Python, PyArrow, DuckDB, SQL, PyTorch, Pandas, NumPy.
- Experience with Kubernetes.
- Fundamental knowledge of containerization, operating systems, file-systems, and networking.
- Intuition for distributed systems and a great mental model of how systems interact and function under different conditions.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →