49
component of the user study evaluation here. Without a reliable ground truth measure, the entire
evaluation study would collapse.
In addition to the ground truth problem, there is a paradox in meta-evaluation research,
which often requires researchers to make compromises and tough choices in study design: me-
ta-evaluation of evaluation metrics need to be conducted based on a relatively large, session-level
dataset which usually includes both search behavior data and annotations from users and/or exter-
nal assessors. However, in the context of controlled laboratory studies, it is often hard to recruit a
large group of participants doing complex search and annotation tasks. Consequently, researchers
usually nd it dicult to denitively answer an IIR evaluation question with user studies data.
e most common compromise made to resolve this paradox is asking participants to do
more search sessions (described by the subfacet called amount of tasks), usually in within-subjects
design (included in the experimental design facet), for ensuring the richness of search and annota-
tion data. For instance, in Luo et al. (2017), each participant conducted 20 tasks within 2 h in a
controlled lab environment. Although this data collection strategy can improve the richness of data
and help control individual dierences, as a compromise it comes with a series of limitations, such
as potential learning eect and user fatigue. Fortunately, a variety of compensations can partially
mitigate the negative eect of these limitations. For example, when doing complex search and
annotation tasks, participants were asked to take a break between tasks to reduce possible fatigue
(recorded under quality control facet) (e.g., Jiang, He, and Allan, 2017). Also, the learning eects in
doing a sequence of search tasks can be mitigated by randomized task order (characterized under
the task facet) (e.g., Mao et al., 2016).
In sum, for the evaluation of meta-evaluation user studies, one should also start with the
specic research focuses (e.g., system-oriented features, search behavior features, physiological and
emotional indicators), deconstruct the study design into facets and reorganize them as a facet map
consisting of multiple levels of interrelated facets and factors. In faceted evaluation, the core facets
directly related to the major goals and study design compromises should be identied (e.g., behav-
ior and experience, task) and analyzed based upon their roles in addressing the proposed research
problems. en, other relevant subfacets can be added around the core facets in the facet map. e
facet values as well as the potential connections among them should be identied and evaluated
based on the extent to which the design decisions and manipulations in these subfacets jointly help
answer the research questions and/or control the “damages” caused by the aforementioned study
design decisions.
5.4 SUMMARY
is chapter explained and illustrated the faceted evaluation approach in three dierent lines of
user studies. Particularly, we proposed a multi-faceted, replicable evaluation method, facet mapping,
5.3 METAEVALUATION OF EVALUATION MEASURES