3
to be necessary for a good result” (Robertson, 2008, p. 447). is balance among dierent facets
is much more dicult to reach in user studies than in classical system-focused IR evaluations,
primarily because of various individual dierences and the dynamic nature of human cognition,
perception, and associated behaviors during the course of information seeking episodes. erefore,
the aims of this work on IIR user studies evaluation are:
• identify the facets/dimensions of user study, especially the easily ignored user-side
factors which may signicantly alter the results of study; and
• evaluate the strengths and limitations of dierent types of designs and compromises
made in user studies with respect to answering their respective research questions.
Indeed, there is no single best toolkit for user studies which can be universally applied for all
IIR-related research problems. However, it is still benecial for researchers to be clear about: (1)
given the research problem(s), what kind of balance they hope and need to reach among dierent
dimensions or facets of user study; and (2) what are the potential impacts of the decisions and
compromises made in user study design on the associated results and ndings?
1.3 SYSTEMATIC REVIEW AND USER STUDY EVALUATION
In this work, our approach to developing knowledge and addressing the research problems discussed
above is to build a faceted framework and apply it in deconstructing, characterizing, and evaluating
various types of IIR user studies. To develop and implement the faceted approach, we rst system-
atically reviewed and collected 462 IIR user study papers published in multiple high-quality IR
venues. en, we developed an initial faceted framework from a small portion of the selected papers
and revised and updated when new factors and facets emerged or being extracted from the paper
coding process.
In terms of the criteria of paper selection, we only included the user studies where researchers
proposed IIR-related research questions, recruited participants (instead of using simulated users),
and clearly articulated the major components of their study design (e.g., experimental system, task
design, test collection and corpus, interface). Hence, user studies where researchers only used search
behavior datasets from the existing user study collections (e.g., Text Retrieval Conference (TREC)
test collections,
1
NII Testbeds and Community for Information access Research (NTCIR) test col-
lections
2
), or large-scale search logs (e.g., Bing search logs, Yandex search logs) without designing
any new study were excluded from our analysis as they did not really pertain to the dilemma of
balancing dierent facets and making dicult compromises in user study design (e.g., Deveaud et
al., 2018; Feild and Allan, 2013; Jiang et al., 2017; Luo, Zhang, and Yang, 2014; Spink et al., 2004).
1
https://trec.nist.gov/
2
http://research.nii.ac.jp/ntcir/index-en.html
1.3 SYSTEMATIC REVIEW AND USER STUDY EVALUATION