49
component of the user study evaluation here. Without a reliable ground truth measure, the entire
evaluation study would collapse.
In addition to the ground truth problem, there is a paradox in meta-evaluation research,
which often requires researchers to make compromises and tough choices in study design: me-
ta-evaluation of evaluation metrics need to be conducted based on a relatively large, session-level
dataset which usually includes both search behavior data and annotations from users and/or exter-
nal assessors. However, in the context of controlled laboratory studies, it is often hard to recruit a
large group of participants doing complex search and annotation tasks. Consequently, researchers
usually nd it dicult to denitively answer an IIR evaluation question with user studies data.
e most common compromise made to resolve this paradox is asking participants to do
more search sessions (described by the subfacet called amount of tasks), usually in within-subjects
design (included in the experimental design facet), for ensuring the richness of search and annota-
tion data. For instance, in Luo et al. (2017), each participant conducted 20 tasks within 2 h in a
controlled lab environment. Although this data collection strategy can improve the richness of data
and help control individual dierences, as a compromise it comes with a series of limitations, such
as potential learning eect and user fatigue. Fortunately, a variety of compensations can partially
mitigate the negative eect of these limitations. For example, when doing complex search and
annotation tasks, participants were asked to take a break between tasks to reduce possible fatigue
(recorded under quality control facet) (e.g., Jiang, He, and Allan, 2017). Also, the learning eects in
doing a sequence of search tasks can be mitigated by randomized task order (characterized under
the task facet) (e.g., Mao et al., 2016).
In sum, for the evaluation of meta-evaluation user studies, one should also start with the
specic research focuses (e.g., system-oriented features, search behavior features, physiological and
emotional indicators), deconstruct the study design into facets and reorganize them as a facet map
consisting of multiple levels of interrelated facets and factors. In faceted evaluation, the core facets
directly related to the major goals and study design compromises should be identied (e.g., behav-
ior and experience, task) and analyzed based upon their roles in addressing the proposed research
problems. en, other relevant subfacets can be added around the core facets in the facet map. e
facet values as well as the potential connections among them should be identied and evaluated
based on the extent to which the design decisions and manipulations in these subfacets jointly help
answer the research questions and/or control the “damages” caused by the aforementioned study
design decisions.
5.4 SUMMARY
is chapter explained and illustrated the faceted evaluation approach in three dierent lines of
user studies. Particularly, we proposed a multi-faceted, replicable evaluation method, facet mapping,
5.3 METAEVALUATION OF EVALUATION MEASURES
50 5. EVALUATING INTERACTIVE IR USER STUDIES OF DIFFERENT TYPES
in order to facilitate the deconstruction of complex user study design and also to support the con-
struction of facet networks in analyzing the connections and “collaborations” among dierent facets
(see Figure 5.1). Given the goals of faceted user study evaluation, the main missions of facet map
are: (1) identify the main facets and components of a user study; and (2) clarify the roles of dierent
facet values as well as the underlying connections among them in answering the predened research
questions (e.g., manipulating experimental conditions, controlling the context of search interaction,
damage control for the study design compromises in other facets). e faceted evaluation should
include both the assessment of each individual facet (e.g., the quality of task description and exper-
imental system) and (more importantly) the evaluation of the collaboration among dierent facets
within the same research context.
Improving the characterization and evaluation of IIR user studies will create sizable im-
pacts on multiple aspects of IIR research and also lead to new challenges and opportunities for
future works. Based upon the previously explained structure and missions of the faceted scheme,
the following chapters will further discuss the implications of this faceted framework for IIR user
study design, reporting and evaluation practices as well as the possible directions for future works
in related areas.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.114.221