Назад
Company hidden
1 день назад

Research Engineer - Data (AI)

350 000 - 400 000$
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Research Engineer - Data (AI): Build and drive the data foundation for research efforts in materials, energy, and physical sciences with an accent on sourcing scientific datasets, integrating experimental data, and ensuring high-quality inputs for frontier models. Focus on designing scalable pipelines, data quality systems, and tooling for reproducibility and researcher collaboration.

Location: Lab in Menlo Park, prefer located in Menlo Park or San Francisco but flexible based on role.

Compensation: $350,000-400,000 annual base commensurate with experience.

Company

AI and physical sciences company building state-of-the-art models to accelerate breakthroughs across materials, energy, and beyond. Backed by world-class investors and growing rapidly.

What you will do

  • Own data strategy across the training stack, identifying gaps and shaping the roadmap with research leads.
  • Source, evaluate, and procure external datasets in chemistry, physics, materials science, and more.
  • Build pipelines for ingesting, processing, and versioning large-scale heterogeneous datasets.
  • Design data quality systems including deduplication, filtering, and normalization at scale.
  • Integrate lab experimental data, simulations, and model outputs into the training stack.
  • Develop tooling for data inspection, querying, metadata tracking, and reproducibility.
  • Collaborate on token budgets, data mixing, and curriculum design with ML engineers.

Requirements

  • Bachelor’s degree or equivalent.
  • Experience building large-scale data pipelines for LLM pretraining or midtraining.
  • Expertise in data quality techniques like MinHash, SimHash, perplexity filtering, and PII scrubbing.
  • Work with scientific data formats (papers, patents, simulations, lab exports) and normalization.
  • Distributed processing with Spark, Ray, or Dask at TB/PB scale.
  • Dataset versioning, lineage tracking (DVC, Delta Lake).
  • Strong Python for production tooling; collaborate with ML researchers.
  • Research mindset with experiments and iteration.

Nice to have

  • Curating scientific datasets for domain-adaptive pretraining or tuning.
  • Synthetic data generation and verification.
  • Background in physical science or engineering.
  • Multimodal data integration (text, numerical, molecular, spectral).

Culture & Benefits

  • Visa sponsorship: Yes, with legal support.
  • Operate at frontier pace with deep expertise, ownership, and drive.
  • Team of top scientists, engineers, and problem-solvers defining the frontier.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →