Назад
Company hidden
5 дней назад

Member of Engineering (Pre-training / Data Engineering, AI)

Формат работы
remote (только Europe/United_states)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
UK/US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Member of Engineering (Pre-training / Data Engineering): Architecting and maintaining high-performance pipelines that process trillions of raw tokens into high-quality datasets for foundation models with an accent on ingestion, deduplication, streaming systems, and petabyte-scale data handling. Focus on algorithmic sorting, distributed pipeline optimization, and bridging raw web crawls to GPU clusters to directly influence model performance.

Location: Remote (EMEA/East Coast); London, UK; Remote (EMEA)

Company

hirify.global is an AI company building agentic systems and coding assistants powered by frontier models to accelerate software development towards AGI for security-conscious enterprises.

What you will do

  • Build and maintain high-performance pipelines for processing trillions of tokens into diverse, high-quality datasets for pre-training foundation models and coding agents.
  • Engineer ingestion, deduplication, and streaming systems handling petabyte-scale data from raw web crawls to GPU clusters.
  • Optimize data modeling, algorithmic sorting, and distributed pipelines to enhance model performance.
  • Collaborate closely with Pretraining, Postraining, Evals, and Product teams to align datasets with model capabilities and use cases.

Requirements

  • Strong background in production-grade, distributed data systems for machine learning.
  • Experience with orchestration tools like Slurm, Airflow, or Dagster.
  • Observability & reliability with CI/CD, Grafana, Prometheus.
  • Infra skills: Git, Docker, k8s, cloud managed services, batch inference (e.g., vLLM).
  • Expert-level Python, strong algorithmic foundations, proficiency with Polars, Dask, or PySpark.
  • Performance obsession with large-scale GPU clusters and distributed pipelines.

Nice to have

  • Experience building trillion-scale SOTA pretraining datasets.
  • Translating research to production at scale.
  • Experience with OCR, web crawling, or evals.
  • Prior experience pre-training LLMs.

Culture & Benefits

  • Fully remote work with flexible hours.
  • 37 days/year of vacation & holidays.
  • Health insurance allowance for you & dependents.
  • Company-provided equipment, well-being, always-be-learning & home office allowances.
  • Frequent team get-togethers including monthly 3-day collaboration in Paris (Mon-Wed, open invitation to stay longer) and annual off-sites.
  • Diverse & inclusive people-first culture with low ego, kind-hearted team focused on collaboration and mission.

Hiring process

  • Intro call with a Founding Engineer.
  • Technical interview(s) with a Founding Engineer.
  • Team fit call with the People team.
  • Final interview with a Founding Engineer.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →