Data Acquisition Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Data Acquisition Engineer (AI): Designing and operating large-scale web crawlers and data pipelines for pre-training frontier LLMs with an accent on distributed systems and high-throughput ingestion. Focus on maximizing data recall from high-value sources, building observability tooling, and aligning sourcing with model training needs.
Location: Remote (EMEA or US East Coast). Includes monthly 3-day in-person collaboration in Paris (Monday-Wednesday).
Company
is an AI research company building agentic systems and frontier models to accelerate software development and reach AGI.
What you will do
- Design, build, and operate a large-scale web crawler responsible for acquiring openly accessible internet data.
- Develop specialized deep crawlers targeting high-value sources to improve recall and coverage.
- Own the long-term roadmap for data acquisition in collaboration with data researchers.
- Build observability, monitoring, and debugging tooling to ensure infrastructure reliability.
- Collaborate with pre-training, post-training, and evaluations teams to align data priorities.
- Build high-throughput ingestion pipelines for rapidly onboarding and evaluating partner data.
Requirements
- Must be based in EMEA or US East Coast.
- Strong distributed systems background with proven experience building large-scale infrastructure (data pipelines, web crawlers).
- Proficiency in Python, including performance optimization and debugging complex production systems.
- Hands-on experience with web crawling, HTTP protocols, and distributed job queues.
- Familiarity with AWS, Kubernetes, and Docker for managing high-throughput workloads.
- Knowledge of data privacy, robots.txt adherence, and responsible crawling practices.
Nice to have
- Prior experience pre-training LLMs.
- Experience building trillion-scale SOTA pre-training datasets.
- Experience translating research into production at scale.
Culture & Benefits
- Fully remote work with flexible hours.
- 37 days of vacation and holidays per year.
- 16 weeks of flexible, full-pay parental leave.
- Health insurance allowance for employees and dependents.
- Company-provided equipment and allowances for home office and learning.
- Frequent team gatherings and mandatory monthly off-sites in Paris.
Hiring process
- Introductory call with a Founding Engineer.
- Technical interview(s) with Engineering team members.
- Team fit call with the People team.
- Final interview with a Founding Engineer.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →