54 6. IMPLICATIONS AND LIMITATIONS OF THE FACETED FRAMEWORK
6.3 LIMITATIONS AND CHALLENGES
is work sought to develop a multi-faceted evaluation framework based on state-of-the-art IIR
user studies and to fully explore the richness of user study facets and the relations among them via
a faceted approach. Beside the implications discussed above, there are also two main limitations
related to our analysis: the sources of our data, and the method we have applied for analyzing/
coding it.
Source of data: We conducted our analysis based on a narrowly dened set of IIR user study
papers, and it is important to be mindful of this when drawing conclusions. However, since the six
venues from which we extracted user study papers are major scholarly communication venues for
the IR community and usually publish high-quality, state-of-the-art IIR papers, selecting papers
and developing a coding scheme from this paper pool can to some extent guarantee the quality and
generalizability of the proposed faceted framework in future evaluation. Nevertheless, it is still pos-
sible that we missed out some subfacets,factors, or specic techniques discussed and/or manipulated
in the research papers published in some other IR-related venues (e.g., WWW, WSDM, CIKM,
ICTIR, JCDL, and ASIS&T AM).
Reliability of coding: Many of the facet values in the published research papers are not very
straightforward (e.g., task type, source of tasks, user characteristics), which makes it dicult to ex-
tract these facets in some cases. Besides, the coding analysis in our work was qualitative in nature for
most of the facets and thus not as replicable as the analysis of statistical measures (e.g., Sakai, 2016).
A broader scope of coding analysis may help validate the existing faceted evaluation framework and
revise some parts of it when necessary.
In addition to the limitations discussed above, IIR user studies also face several challenges
based on which several additional implications could be proposed for the IR community.
Reporting practice: e under-reported facets and missed information concerning research
context (e.g., task source, time length of the study) often lead to gaps and blind spots in the facet
network of user studies. ese blind spots may eventually prevent researchers from conducting a
more comprehensive, in-depth meta-evaluation of user study and make it dicult to explain some
of the patterns extracted from data (Kelly and Sugimoto, 2013). Also, the relationships between
dierent aspects and dimensions (e.g., participants and tasks, experimental search system features
and behavioral measures) in user study design are not always clear in the reported IIR user studies.
Future IIR research should improve their reporting practices and clarify the connections between
the main facets (including commonly under-reported facets), research questions and focuses, and
the compromises made in study design. Additionally, in evaluation-oriented studies, researchers
also need to clearly explain the rationale of adopting a particular ground truth measure(s) and re-
port the process of evaluating ground truth measure(s). e proposed faceted framework can serve
as a toolkit for study design and a checklist for IIR user study reporting practices.
55
Replication study: Despite the similarities in multiple facets, user studies are usually uniquely
designed for their respective research questions that may involve largely dierent systems, tasks, and
user populations. As a result, each IIR user study appears to be an isolated island and thus cannot
be quantitatively and collectively evaluated. To resolve this issue, major IR venues (e.g., SIGIR,
CLEF) encourage researchers to replicate existing user studies and test the reliability and validity
of the associated results (Ferro et al., 2016; Ferro and Kelly, 2018; Suominen et al., 2018). However,
future research eorts and new platforms are still needed for replication studies in both eld and
controlled lab settings to answer two empirical questions: (1) Are the conclusions drawn from pre-
vious user studies reliable? (2) Would a change in user study facet(s) (especially the under-reported
facets) signicantly alter the results of a user study?
To address the meta-issues of replicating user studies, HCI researchers organized the Rep-
liCHI workshop in the SIGCHI community to discuss the signicant ndings and contributions
from revisiting works (cf. Wilson et al., 2014). However, the idea and movement of RepliCHI was
ultimately rejected by the community as it was considered be to only relevant to a small sub-com-
munity of HCI research. In contrast, given the longstanding emphasis on evaluation in IR, devel-
oping new framework and methods to facilitate replication studies would certainly be of interest
to the entire community. Opening a new track for user study replication and addressing the two
aforementioned questions can help the IR community better answer the research questions of in-
terest and properly evaluate the existing user study facets as well as the entire toolkit.
“Bad results presentation: In most, if not all, of the published IR empirical studies, researchers
successfully obtained “right results that conrm at least some of the research hypotheses, theo-
ries, and arguments. However, it is worth noting that in a wide range of unpublished user studies,
researchers actually encountered unexpected, “bad results which fail to respond to the research
questions in expected way(s): For instance, in some user studies, none of the proposed research
hypotheses are conrmed by the collected empirical evidences; the prediction performance of new
model is signicantly worse than that of the baselines; most of the statistical test results (e.g., coef-
cients of regression models, dierences between predened groups) are not statistically signicant,
and so on. ese surprises and “bad results often get unnoticed and are kept unpublished. Yet, if
we carefully examine the facets of the user studies behind these bad” results, they may become
important gateways towards striking ndings and/or methodological innovations.
Aside from the surprising ndings that result in quantum leaps and unexpected innovations in
the corresponding research areas, we also need to explore other possible reasons (especially unnoticed
problems) that create unexpected, bad results in data analyses. e existence of unexpected results
may ultimately lead our attention to a series of critical, under-reported facets and factors. With
respect to the participants and systems, for example, participants may come from largely dierent
knowledge background and thus have a very dierent understanding of the same task descriptions.
6.3 LIMITATIONS AND CHALLENGES
56 6. IMPLICATIONS AND LIMITATIONS OF THE FACETED FRAMEWORK
Also, in some IIR evaluation studies, the variation in system aordances may be too subtle.
Consequently, users’ search behavioral signals are not sensitive enough to capture the implicit varia-
tions. In crowdsourcing user studies, the lack of data quality control methods may lead to biased and
contaminated results in analyses. Besides, the adoption of specic search behavior and experience
measures can also determine the signicance and direction of the results. Specically, for instance,
compared to browsing and search result examination behaviors, the users’ query formulation strat-
egies may be more sensitive to external treatments and interventions. Similarly, compared to the
task-level measures, task-stage or action-level dual-task measures can better capture the variations
in users’ perceived levels of cognitive loads (Gwizdka, 2010). Establishing a new platform (e.g.,
forum, workshop on bad results and possibly problematic designs of user studies) for researchers to
present, re-analyze, and openly discuss bad” results can largely enhance our understanding of the
roles of dierent facets as well as the possible ways in which an IIR user study design can go wrong.
e bridges between small-scale user studies and large-scale IR experiments: It is certainly useful
to learn about user characteristics and search interaction patterns in small-scale, laboratory studies
with well-controlled conditions. However, it is also important to explore the possible connections
and implicit bridges between small-scale user studies and large-scale IR experiments. e appli-
cation and generalization of the ndings from small-scale user studies can be highly valuable for
search systems design and evaluation. To fully understand these implicit connections, we need to at
least answer the following questions.
1. To what extent can we generalize the models we learned from small-scale user studies
to large-scale datasets collected in the wild (e.g., large-scale search log data collected
through commercial search engines, data from TREC, CLEF, and NTCIR)?
2. To what extent can we automatically extract the main facet values directly from large-
scale search log data (e.g., task facets, session lengths, search behavioral measures)
and use them in experiments in various ways (e.g., as features in predictive models;
as ground truth for evaluation; as feedbacks or rewards for reinforcement learning)?
3. Based on the ndings from small-scale user studies, what assumptions can we make
about dierent user study facets in order to support large-scale IR experiments where
high-quality, complete user annotation data is not always available?
Answering these questions can help us better clarify the contributions of and the connections
between small-scale IIR user studies and large-scale IR experiments.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.17.12