48 5. EVALUATING INTERACTIVE IR USER STUDIES OF DIFFERENT TYPES
search focus and related core facets. Using these facets, one can properly deconstruct the major parts
of the user study at hand. In system or interface features evaluation research, the core facets usually
include system feature (which can depict system or interface feature manipulation) and behavior
and experiences (which include the user experience and performance evaluation metrics). Digging
into the connections among these core facets as well as their associations with other relevant facets
(e.g., language and education background of participants, pre-study training, or warm up task) can
help us understand the strengths and limitations caused by the study decision decisions and the
methods employed to compensate for the associated limitations. erefore, compared to examining
dierent aspects of user studies separately, building a multi-level facet map (from core facets to
external relevant facets) based on our faceted framework can better facilitate user study evaluation
as it enables researchers to evaluate the design decisions, compromises and compensations within a
network of interrelated facets and factors.
5.3 METAEVALUATION OF EVALUATION MEASURES
Similar to system or interface features evaluation, meta-evaluation of evaluation metrics also
contains several evaluation-related components or facets and involves ground truth selections.
However, in this type of research, the evaluation is conducted by comparing and judging dierent
evaluation measures, rather than IR systems. In the context of IIR studies, meta-evaluation analysis
usually involves the comparison between system-oriented performance measures, user-reported
measures, and search behavior measures of dierent types. Given the rapid development of feature
extraction and feature engineering techniques, to better capture user characteristics in IR, recent
works also take into account other rarely used user-oriented features, such as facial expression and
physiological features (e.g., Gwizdka et al., 2017; Gwizdka and Zhang, 2015; Moshfeghi and Jose,
2013). is improvement in feature diversity creates more room for meta-evaluation as well as the
development of user behavioral modeling.
As a type of evaluation studies, meta-evaluation of evaluation measures also faces the
ground truth problem, which usually involves the subfacets under the behavioral and experience
facet. To evaluate the eectiveness of system-oriented (e.g., Mean Average Precision, normalized
Discounted Cumulative Gain) and user-centered evaluation measures, researchers compared them
with user reported ground truth, such as users’ query-level and session-level satisfaction, in-situ
and post-search relevance judgments, and experienced work and cognitive loads, aiming to gure
out if these features are indicative of at least some aspects of user judgment, experience, and search
performance (e.g., Jiang, He, and Allan, 2017; Luo et al., 2017; Smucker and Jethani, 2010). While
this approach enables the evaluation of evaluation metrics with users, it also comes with the chal-
lenges and limitations discussed above (e.g., individual dierences, internal subjectivity and com-
plexity of self-reported measures). Hence, a solid evaluation of ground truth measure(s) is a critical
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.218.1