13
As it is shown above, IIR user studies in this group (system/interface features evaluation)
go beyond the classical system-oriented evaluation approach by taking into account users’ per-
ception, evaluation, and behavioral pattern in search interaction. e design and evaluation of
search systems have been strongly inuenced by computer science and algorithmic perspectives for
decades. In contrast to traditional IR evaluation works, IIR evaluation take a cognitive, user-cen-
tered perspective, put user (instead of explicit query or document) at the central position of search
interaction, and evaluate IR system based on the extent to which the system successfully represents
user’ knowledge states and supports their work and search tasks (Belkin, 2000; Belkin et al., 2004;
Ruthven, 2008). To accomplish this goal, researchers need to go beyond explicit queries and ranked
documents to capture and represent multiple user-focused facets (e.g., users’ knowledge states, task
stage, search task diculty) in search system evaluation.
2.4 METAEVALUATION OF EVALUATION METRICS
In addition to the two major types of IIR user studies discussed above, some of the studies take
a step back from specic systems and problems and seek to evaluate the user-oriented measures
applied in IR system evaluation. In this type of studies, researchers usually measure user behaviors
(e.g., query formulation, search result browsing and examination, eye movement) and experience
(e.g., search satisfaction, task diculty) with dierent sets of evaluation metrics (e.g., in-situ and
post-search evaluation metrics, traditional search behavioral metrics and neuro-physiological met-
rics) and evaluate the eectiveness of these measures against a series of predened user-oriented
ground truth (e.g., search satisfaction and usefulness judgment). e major goal of this line of
studies is to nd or design reliable measures that can be applied in future standard IIR evaluation
and search interaction studies.
For example, to address the limitations of traditional TREC-style relevance judgments,
Jiang, He, and Allan (2017) explored two improvements (i.e., collecting in situ judgment to make
relevant judgment contextual; collecting multidimensional assessments to address dierent aspects
of relevance and usefulness) and evaluated the new framework of relevance using six dierent
user experience measures as ground truth. Mao et al. (2016) designed a laboratory study where
they compared relevance annotations with document usefulness measures and demonstrated that
a measure based on usefulness rather than relevance annotated has a better correlation with user
satisfaction. In addition, they also found that external assessors can provide high-quality usefulness
annotation when addition search context information was provided to them. Chen et al. (2017) me-
ta-evaluated a series of online and oine metrics (e.g., online behavioral features, oine relevance
judgments) to study the extent to which they can infer actual search satisfaction in dierent task
scenarios. As it is shown in the above examples, this line of research sheds light on the connection
between user characteristics and IIR evaluation measures and often proposes innovative user-ori-
2.4 METAEVALUATION OF EVALUATION METRICS
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.87.83