- We take the position that evaluating GenAI systems is a social science measurement challenge. Specifically, the measurement tasks involved in evaluating GenAI systems—regardless of the measurement approaches and instruments used—are highly reminiscent of the measurement tasks found throughout the social sciences.
- These papers highlight structural similarities between benchmarks and psychological tests (e.g., items, scoring rubrics, and aggregation functions) and emphasize the importance of interrogating validity,
- Systematization is the process of narrowing the background concept into the systematized concept; operationalization is the process of drawing on the systematized concept to develop the measurement instruments; application is the process of using the measurement instruments to obtain the measurements; and interrogation is the process of interrogating the validity of the systematized concept, the measurement instruments, and their resulting measurements. Together, these four processes comprise the process of measurement.
- This structured approach differs from the way measurement is typically done in the ML community, where researchers and practitioners appear to jump from background concepts (e.g., refusal to comply with harmful prompts) to measurement instruments (e.g., a specific set of harmful prompts and a function for assessing refusal), conflating systematization and operationalization
- We recommend using the following set of lenses to interrogate validity, adapted from Messick (1987) by Jacobs & Wallach (2021): face validity, content validity, convergent validity, discriminant validity, predictive validity, hypothesis validity, and consequential validity.
- systematizing a concept means taking the broad constellation of meanings and understandings associated with that concept—the background concept—and narrowing it into an explicit definition—the systematized conceptthat specifies precisely what will be measured and why
- We note that although systematizing a concept can be challenging, it is hard to know precisely what is being measured without a systematized concept
- Having systematized the concept of interest, the first step in the operationalization process is to specify how the observable phenomena will be represented by defining a set of variables—often called indicators8—that reflect the observable phenomena.
- In practice, ML researchers and practitioners rarely systematize their concepts of interest or provide clear definitions of indicators and aggregation functions.
- Of the key actions summarized above, the most important are 1) separating systematization and operationalization and 2) interrogating validity.
| |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Wallach, Hanna, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett et al. "Position: Evaluating generative ai systems is a social science measurement challenge." _arXiv preprint arXiv:2502.00561_ (2025). - [Link](https://arxiv.org/pdf/2502.00561) |