55
Replication study: Despite the similarities in multiple facets, user studies are usually uniquely
designed for their respective research questions that may involve largely dierent systems, tasks, and
user populations. As a result, each IIR user study appears to be an isolated island and thus cannot
be quantitatively and collectively evaluated. To resolve this issue, major IR venues (e.g., SIGIR,
CLEF) encourage researchers to replicate existing user studies and test the reliability and validity
of the associated results (Ferro et al., 2016; Ferro and Kelly, 2018; Suominen et al., 2018). However,
future research eorts and new platforms are still needed for replication studies in both eld and
controlled lab settings to answer two empirical questions: (1) Are the conclusions drawn from pre-
vious user studies reliable? (2) Would a change in user study facet(s) (especially the under-reported
facets) signicantly alter the results of a user study?
To address the meta-issues of replicating user studies, HCI researchers organized the Rep-
liCHI workshop in the SIGCHI community to discuss the signicant ndings and contributions
from revisiting works (cf. Wilson et al., 2014). However, the idea and movement of RepliCHI was
ultimately rejected by the community as it was considered be to only relevant to a small sub-com-
munity of HCI research. In contrast, given the longstanding emphasis on evaluation in IR, devel-
oping new framework and methods to facilitate replication studies would certainly be of interest
to the entire community. Opening a new track for user study replication and addressing the two
aforementioned questions can help the IR community better answer the research questions of in-
terest and properly evaluate the existing user study facets as well as the entire toolkit.
“Bad” results presentation: In most, if not all, of the published IR empirical studies, researchers
successfully obtained “right” results that conrm at least some of the research hypotheses, theo-
ries, and arguments. However, it is worth noting that in a wide range of unpublished user studies,
researchers actually encountered unexpected, “bad” results which fail to respond to the research
questions in expected way(s): For instance, in some user studies, none of the proposed research
hypotheses are conrmed by the collected empirical evidences; the prediction performance of new
model is signicantly worse than that of the baselines; most of the statistical test results (e.g., coef-
cients of regression models, dierences between predened groups) are not statistically signicant,
and so on. ese surprises and “bad” results often get unnoticed and are kept unpublished. Yet, if
we carefully examine the facets of the user studies behind these “bad” results, they may become
important gateways towards striking ndings and/or methodological innovations.
Aside from the surprising ndings that result in quantum leaps and unexpected innovations in
the corresponding research areas, we also need to explore other possible reasons (especially unnoticed
problems) that create unexpected, “bad” results in data analyses. e existence of unexpected results
may ultimately lead our attention to a series of critical, under-reported facets and factors. With
respect to the participants and systems, for example, participants may come from largely dierent
knowledge background and thus have a very dierent understanding of the same task descriptions.
6.3 LIMITATIONS AND CHALLENGES