Research Engineer - Data (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Research Engineer - Data (AI): Build and drive the data foundation for research efforts in materials, energy, and physical sciences with an accent on sourcing scientific datasets, integrating experimental data, and ensuring high-quality inputs for frontier models. Focus on designing scalable pipelines, data quality systems, and tooling for reproducibility and researcher collaboration.
Location: Lab in Menlo Park, prefer located in Menlo Park or San Francisco but flexible based on role.
Compensation: $350,000-400,000 annual base commensurate with experience.
Company
AI and physical sciences company building state-of-the-art models to accelerate breakthroughs across materials, energy, and beyond. Backed by world-class investors and growing rapidly.
What you will do
- Own data strategy across the training stack, identifying gaps and shaping the roadmap with research leads.
- Source, evaluate, and procure external datasets in chemistry, physics, materials science, and more.
- Build pipelines for ingesting, processing, and versioning large-scale heterogeneous datasets.
- Design data quality systems including deduplication, filtering, and normalization at scale.
- Integrate lab experimental data, simulations, and model outputs into the training stack.
- Develop tooling for data inspection, querying, metadata tracking, and reproducibility.
- Collaborate on token budgets, data mixing, and curriculum design with ML engineers.
Requirements
- Bachelor’s degree or equivalent.
- Experience building large-scale data pipelines for LLM pretraining or midtraining.
- Expertise in data quality techniques like MinHash, SimHash, perplexity filtering, and PII scrubbing.
- Work with scientific data formats (papers, patents, simulations, lab exports) and normalization.
- Distributed processing with Spark, Ray, or Dask at TB/PB scale.
- Dataset versioning, lineage tracking (DVC, Delta Lake).
- Strong Python for production tooling; collaborate with ML researchers.
- Research mindset with experiments and iteration.
Nice to have
- Curating scientific datasets for domain-adaptive pretraining or tuning.
- Synthetic data generation and verification.
- Background in physical science or engineering.
- Multimodal data integration (text, numerical, molecular, spectral).
Culture & Benefits
- Visa sponsorship: Yes, with legal support.
- Operate at frontier pace with deep expertise, ownership, and drive.
- Team of top scientists, engineers, and problem-solvers defining the frontier.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →