Engineering Manager, Agent Prompts & Evals (AI)
ΠΡΡΡ & Π‘ΠΎΠΏΡΠΎΠ²ΠΎΠ΄
ΠΠ»Ρ ΠΌΡΡΡΠ° Ρ ΡΡΠΎΠΉ Π²Π°ΠΊΠ°Π½ΡΠΈΠ΅ΠΉ Π½ΡΠΆΠ΅Π½ Plus
ΠΠΏΠΈΡΠ°Π½ΠΈΠ΅ Π²Π°ΠΊΠ°Π½ΡΠΈΠΈ
TL;DR
Engineering Manager, Agent Prompts & Evals (AI): Leading the team that owns the infrastructure for shipping model and prompt changes with confidence, including eval frameworks, system prompt pipelines, and regression-detection systems. Focus on measuring model behavior, building collaboration with other teams, and shaping the team's investment in frontier eval development and model launch automation.
Location: San Francisco, CA or New York City, NY. Expect all staff to be in one of our offices at least 25% of the time.
Salary: $1 - $2 USD
Company
Anthropicβs mission is to create reliable, interpretable, and steerable AI systems that are safe and beneficial for users and society.
What you will do
- Lead and grow a team of prompt engineers and platform software engineers.
- Own the product-side eval platform and system prompt infrastructure, including versioning, deployment, rollback, and review tooling.
- Be a steady hand through model launches, serving as the backstop when things get chaotic.
- Build durable collaboration with other evals groups across the company, focusing on ownership boundaries and shared roadmaps.
- Recruit, close, and retain engineers who want to work at the intersection of product engineering and model behavior.
- Shape where the team invests next, considering paths into frontier eval development, model launch automation, and deeper prompt engineering support.
Requirements
- 8+ years in software engineering with 3+ years managing engineering teams, including experience leading a platform, infra, or developer-tooling team.
- A track record of building tooling and processes that make it easy for other teams to do the right thing.
- Comfort managing a team with a mixed charter: platform ownership, service-to-other-teams, and a launch-driven operational rhythm.
- Enough technical depth to engage on system design, review pipeline architecture, and be credible in debates with strong ICs.
- A product mindset and willingness to wear multiple hats when the work calls for it.
- Demonstrated ability to build and maintain peer relationships with partner orgs, negotiating ownership and aligning roadmaps.
Nice to have
- Prior exposure to LLM evals, ML experimentation platforms, or model quality work.
- Experience with A/B testing infrastructure, feature flagging, or gradual rollout systems.
- Background in devtools, CI/CD platforms, or testing infrastructure at scale.
- A history of managing teams that sit between two larger orgs and making that position an asset rather than a liability.
- Interest in AI safety and alignment.
Culture & Benefits
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.
- Lovely office space in which to collaborate with colleagues.
ΠΡΠ΄ΡΡΠ΅ ΠΎΡΡΠΎΡΠΎΠΆΠ½Ρ: Π΅ΡΠ»ΠΈ ΡΠ°Π±ΠΎΡΠΎΠ΄Π°ΡΠ΅Π»Ρ ΠΏΡΠΎΡΠΈΡ Π²ΠΎΠΉΡΠΈ Π² ΠΈΡ ΡΠΈΡΡΠ΅ΠΌΡ, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡ iCloud/Google, ΠΏΡΠΈΡΠ»Π°ΡΡ ΠΊΠΎΠ΄/ΠΏΠ°ΡΠΎΠ»Ρ, Π·Π°ΠΏΡΡΡΠΈΡΡ ΠΊΠΎΠ΄/ΠΠ, Π½Π΅ Π΄Π΅Π»Π°ΠΉΡΠ΅ ΡΡΠΎΠ³ΠΎ - ΡΡΠΎ ΠΌΠΎΡΠ΅Π½Π½ΠΈΠΊΠΈ. ΠΠ±ΡΠ·Π°ΡΠ΅Π»ΡΠ½ΠΎ ΠΆΠΌΠΈΡΠ΅ "ΠΠΎΠΆΠ°Π»ΠΎΠ²Π°ΡΡΡΡ" ΠΈΠ»ΠΈ ΠΏΠΈΡΠΈΡΠ΅ Π² ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΊΡ. ΠΠΎΠ΄ΡΠΎΠ±Π½Π΅Π΅ Π² Π³Π°ΠΉΠ΄Π΅ β