18 3. METHODOLOGY: PAPER SELECTION AND CODING SCHEME
during the time period on the selected venues. In the cases where the paper types are not clear in the
title and abstract, we read the main text until we gured out if the paper actually met our criteria.
In terms of the implementation of the paper selection criteria dened above, during the course of
paper selection, we carefully examined the methodology section of each paper that was recently
published in the selected venues and ltered out the researcher papers that (1) clearly indicated
that the experimental data was extracted from existing large-scale datasets and test collections (e.g.,
TREC, NTCIR, CLEF, AOL search logs), or (2) showed no clear evidence conrming that the
authors actually carried out a new user study to collect fresh data on user behavior, evaluation, and/
or interaction experience.
Similar to the ndings from Kelly and Sugimoto (2013), during paper selection we found
many cases where IIR researchers published multiple papers using one dataset from the same user
study. For these user study papers, to avoid repetition and potential bias we only included the one
that oers the most sucient details about the user study design. In this way we can better focus
on the diverse facets of user studies reported in dierent communities, and also obtain more details
regarding IIR research topics and focuses, specic methods and techniques, and result reporting
styles for faceted user study evaluation.
Overall, we obtained 462 IIR user study papers that jointly cover a wide range of research
problems and topics, methods, and ndings in the area of IR. In terms of the focuses of the selected
venues, while SIGIR has been publishing papers on both the topics of IIR evaluation and under-
standing user behavior and experience in recent years, the main focus is still on evaluating search
system features, such as interface component, search assistant, and underlying ranking algorithms.
For example, Singla, White, and Huang (2010) evaluated and compared the performances of
dierent trail-nding algorithms in web search and demonstrated the value of mining and recom-
mending search trails based on users’ search logs collected via a widely distributed browser plugin.
Wan, Li, and Xiao (2010) developed a multi-document summarization system to automatically
extract easy-to-understand English summaries (EUSUM) for non-native readers. ey ran a user
study to evaluate the novel system and their results indicated that the EUSUM system can produce
more useful, understandable summaries for non-native readers than the baseline summarization
techniques.
As it is illustrated in these two representative cases, system-evaluation-focused studies
published in SIGIR (as well as other reputable IR venues) often start with a practical IR problem
and present an innovative system, or a new component of a system, as a solution to the problem
highlighted. en, the role of user study here is to evaluate the system with users and to empirically
demonstrate the value of the system in terms of a series of predened facets or dimensions, such
as usefulness, relevance, usability, perceived workload, and eectiveness. Despite this longstand-
ing focus on IR evaluation, there is growing research eort on understanding search behavior
and experience within IR community in recent years (e.g., Edwards and Kelly, 2017; Lagun and