Data Ingestion Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Data Ingestion Engineer (AI): Building and operating large-scale ingestion systems to transform web data into high-quality training corpora for frontier AI models with an accent on distributed systems, data extraction, and pipeline scalability. Focus on experimenting with crawling strategies, optimizing dataset delivery, and closing the feedback loop between data collection and model performance.
Location: On-site in San Francisco, London, or New York
Company
is an AI startup dedicated to building open-weight foundational models by leveraging talent from top research institutions.
What you will do
- Build and operate large-scale data ingestion systems including web crawling, extraction, and dataset versioning
- Develop specialized crawlers to acquire high-priority data sources for training
- Analyze ingested data to identify quality gaps, redundancy, and performance bottlenecks
- Collaborate with researchers to evaluate how extraction methods impact model capabilities
- Scale ingestion pipelines to handle multi-TB to PB-scale data efficiently
- Debug production issues and maintain robust, observable ingestion infrastructure
Requirements
- Experience building web crawling or large-scale data acquisition systems using Ray, Beam, or Spark
- Familiarity with LLM training processes and an intuition for high-quality data
- Ability to work with PB-scale datasets and ensure system observability and maintainability
- Strong experimental mindset to iterate on system improvements based on data performance
- Excellent communication skills to articulate system behavior and architectural tradeoffs
- Must be able to work on-site in San Francisco, London, or New York
Culture & Benefits
- Competitive salary and equity packages
- Comprehensive medical, dental, and vision insurance
- Fully paid parental leave and family planning support
- Daily provided lunch and dinner
- Relocation support for eligible candidates
- Regular team off-sites and celebrations
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →