Назад
Company hidden
5 часов назад

Eval Engineer (AI)

160 000 - 240 000$
Формат работы
remote (только USA)/hybrid
Тип работы
fulltime
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Eval Engineer (AI): Build evaluation systems that measure hirify.global's URL-to-LLM-ready data conversion across millions of websites and edge cases with an accent on metrics, pipelines, datasets, and LLM-as-judge. Focus on designing realistic benchmarks, closing feedback loops to models and RL, and integrating evals into CI/CD for production reliability.

Location: San Francisco, CA (Hybrid) OR Remote (Americas, UTC-3 to UTC-10)

Salary: $160,000–$240,000/year (U.S.-based in San Francisco, CA; adjusted fairly based on your country's cost of living) • Equity: 0.01%–0.10%

Company

hirify.global provides an API to convert any URL into clean, structured, LLM-ready markdown or data. Fast-growing startup with millions in ARR and 50k+ GitHub stars, spun out from building web data infrastructure at Mendable.

What you will do

  • Build and own the full eval stack: define metrics, pipelines, datasets for scrape, crawl, extract, and map quality across diverse web formats.
  • Design benchmarks reflecting real customer data distribution, including SPAs, paywalls, dynamic content, and edge cases; create collection and labeling systems.
  • Develop LLM-as-judge pipelines, validate against human judgment, build human review tools, and handle failure modes.
  • Close the loop: turn eval signals into RL rewards and model training feedback; integrate into CI/CD to catch regressions.
  • Run rapid experiments testing hypotheses, interpret results, and communicate clearly to influence product and model decisions.

Requirements

  • 3+ years in ML engineering, applied AI, or data quality with production systems
  • Build own eval infrastructure: pipelines, datasets, rubrics, judges; experience running evals at scale.
  • Deep knowledge of LLM evaluation methodology, LLM-as-judge correlation with humans, rubrics, inter-rater agreement.
  • Strong grasp of "good" unstructured web data quality (markdown, structured extraction schemas).
  • US Citizenship/Visa required for SF hybrid; N/A for remote
  • Production-minded: balance depth, coverage, cost; fast iteration with clear communication.

Nice to have

  • Previous experience at scraping, automation, or security-focused startup
  • Ex-founder

Culture & Benefits

  • Remote-first culture with optional new SF office; collaborate with distributed team.
  • High autonomy and ownership; small team with direct founder access and real impact.
  • Unlimited PTO (minimum 3 weeks encouraged); 12 weeks paid parental leave; sabbatical after 4 years.
  • Full medical, dental, vision coverage (100% employee, 50% family); 401(k); life/disability insurance.
  • Wellness stipend ($100/month); learning budget ($150/year); team offsites; pet insurance; pre-tax benefits.

Hiring process

  • Application review and automated assessment (~30 min).
  • Intro chat (~25 min); technical interview (~1 hr challenge).
  • Interview with founders (~30 min); paid work trial (1–2 weeks on real tasks).
  • Fast decision.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →