Job Details

Back to Search

Job Information

Job Title :

Research Scientist, AI Evaluation Science

Job Code :

AWM-9747-Research Scientist, AI Evaluation Science

Job Announced :

5/12/2026

Job Closed :

5/17/2026

Pay Rate:

Negotiable

Duration:

Permanent

Other Information

Organization Name:

Apple

Organization Url:

www.apple.com

Address :

Seattle, WA, 98194, USA

City :

Seattle

State :

Washington

Country :

United States

Zip Code :

98194

Job Description

Role Number: 200649482-3337

Summary

AI systems are only as trustworthy as the methods used to evaluate them. At Apple, where AI powers experiences for billions of people, getting evaluation right is not a support function—it is a foundational science. Our team, part of Apple Services Engineering, is building that scientific foundation: rigorous, scalable evaluation methodology for LLMs, agentic systems, and human-AI interaction.

What makes this team unusual is its interdisciplinary core. You will work alongside measurement scientists (psychometrics, validity theory), ML researchers, and platform engineers—bringing together ML research, statistical rigor, and production engineering.

We are looking for a Research Scientist who treats evaluation methodology itself as a first-class research problem—someone with deep technical fluency in preference learning, reward modeling, or calibration theory, and the drive to advance the field while solving real problems at scale. We're hiring at multiple levels (early-career to senior researchers). What unites all candidates is depth of thinking about evaluation as a research problem.

Description

This is primarily a research role. You will formulate open problems in evaluation science, design experiments, publish findings, and drive projects from conception through completion. While you will also partner with platform engineers to ensure your methods are productionized into SDKs and APIs, the focus of the role is original research.

Our research team brings together ML scientists and measurement scientists to tackle evaluation as both a machine learning and a measurement problem, building methods that are technically innovative and scientifically valid. You will also work closely with a platform engineering team that translates research into production-ready SDKs and APIs used across Apple.

The successful candidate will have a strong publication record in evaluation-adjacent ML areas and a demonstrated ability to implement complex methods from recent papers, run large-scale experiments, and communicate results to both technical and non-technical audiences.

Minimum Qualifications

Ph.D. in Computer Science, Machine Learning, or a closely related field, with a research focus in evaluation-adjacent areas (preference learning, RLHF, human feedback, calibration, automated assessment)
Strong publication record at top-tier conferences (NeurIPS, ICML, ICLR, ACL, EMNLP), including first-author publications demonstrating independent research contributions
Deep technical expertise in at least one evaluation-adjacent ML area, with strong mathematical foundations: preference learning and reward modeling (RLHF, DPO, reward hacking, specification gaming); OR calibration theory, proper scoring rules, and statistical reliability; OR human-AI interaction methodology (active learning, annotation quality, preference elicitation)
Demonstrated ability to implement complex methods from recent papers and run large-scale experiments
Track record of translating research into practical systems—prototypes, tools, or methods adopted by others
Excellent written and verbal communication skills, including the ability to write clear research papers and explain complex concepts to diverse audiences

Preferred Qualifications

Publications specifically on evaluation methodology—papers about how to evaluate, not just papers that use evaluation to demonstrate model improvements
Strong hands-on experience with modern ML frameworks (PyTorch, JAX, or TensorFlow) and training or fine-tuning large language models
Experience with theoretical foundations of evaluation: measurement theory and validity frameworks, statistical learning theory (calibration, reliability, decision theory), or preference elicitation and aggregation
Specific research experience in one or more of: reward modeling and RLHF for alignment; LLM-as-judge approaches (calibration, rubric design, bias mitigation); benchmark design and validation (IRT, contamination detection); human evaluation methodology (protocol design, quality control); or agentic and multi-agent system evaluation
Demonstrated passion for evaluation as a research area: conference presentations, workshops, or tutorials on evaluation topics; open-source contributions to evaluation tools or benchmarks; active engagement with the evaluation research community
Experience with cross-disciplinary research, such as collaboration with social scientists, psychometricians, or domain experts

Other Details

About Organization

Other Jobs

View other jobs from this employer

Apply Back