5.4 Summary

48 5. EVALUATING INTERACTIVE IR USER STUDIES OF DIFFERENT TYPES

search focus and related core facets. Using these facets, one can properly deconstruct the major parts

of the user study at hand. In system or interface features evaluation research, the core facets usually

include system feature (which can depict system or interface feature manipulation) and behavior

and experiences (which include the user experience and performance evaluation metrics). Digging

into the connections among these core facets as well as their associations with other relevant facets

(e.g., language and education background of participants, pre-study training, or warm up task) can

help us understand the strengths and limitations caused by the study decision decisions and the

methods employed to compensate for the associated limitations. erefore, compared to examining

dierent aspects of user studies separately, building a multi-level facet map (from core facets to

external relevant facets) based on our faceted framework can better facilitate user study evaluation

as it enables researchers to evaluate the design decisions, compromises and compensations within a

network of interrelated facets and factors.

5.3 METAEVALUATION OF EVALUATION MEASURES

Similar to system or interface features evaluation, meta-evaluation of evaluation metrics also

contains several evaluation-related components or facets and involves ground truth selections.

However, in this type of research, the evaluation is conducted by comparing and judging dierent

evaluation measures, rather than IR systems. In the context of IIR studies, meta-evaluation analysis

usually involves the comparison between system-oriented performance measures, user-reported

measures, and search behavior measures of dierent types. Given the rapid development of feature

extraction and feature engineering techniques, to better capture user characteristics in IR, recent

works also take into account other rarely used user-oriented features, such as facial expression and

physiological features (e.g., Gwizdka et al., 2017; Gwizdka and Zhang, 2015; Moshfeghi and Jose,

2013). is improvement in feature diversity creates more room for meta-evaluation as well as the

development of user behavioral modeling.

As a type of evaluation studies, meta-evaluation of evaluation measures also faces the

ground truth problem, which usually involves the subfacets under the behavioral and experience

facet. To evaluate the eectiveness of system-oriented (e.g., Mean Average Precision, normalized

Discounted Cumulative Gain) and user-centered evaluation measures, researchers compared them

with user reported ground truth, such as users’ query-level and session-level satisfaction, in-situ

and post-search relevance judgments, and experienced work and cognitive loads, aiming to gure

out if these features are indicative of at least some aspects of user judgment, experience, and search

performance (e.g., Jiang, He, and Allan, 2017; Luo et al., 2017; Smucker and Jethani, 2010). While

this approach enables the evaluation of evaluation metrics with users, it also comes with the chal-

lenges and limitations discussed above (e.g., individual dierences, internal subjectivity and com-

plexity of self-reported measures). Hence, a solid evaluation of ground truth measure(s) is a critical

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5.4 Summary