Conclusion and Future Directions

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

54 6. IMPLICATIONS AND LIMITATIONS OF THE FACETED FRAMEWORK

6.3 LIMITATIONS AND CHALLENGES

is work sought to develop a multi-faceted evaluation framework based on state-of-the-art IIR

user studies and to fully explore the richness of user study facets and the relations among them via

a faceted approach. Beside the implications discussed above, there are also two main limitations

related to our analysis: the sources of our data, and the method we have applied for analyzing/

coding it.

Source of data: We conducted our analysis based on a narrowly dened set of IIR user study

papers, and it is important to be mindful of this when drawing conclusions. However, since the six

venues from which we extracted user study papers are major scholarly communication venues for

the IR community and usually publish high-quality, state-of-the-art IIR papers, selecting papers

and developing a coding scheme from this paper pool can to some extent guarantee the quality and

generalizability of the proposed faceted framework in future evaluation. Nevertheless, it is still pos-

sible that we missed out some subfacets,factors, or specic techniques discussed and/or manipulated

in the research papers published in some other IR-related venues (e.g., WWW, WSDM, CIKM,

ICTIR, JCDL, and ASIS&T AM).

Reliability of coding: Many of the facet values in the published research papers are not very

straightforward (e.g., task type, source of tasks, user characteristics), which makes it dicult to ex-

tract these facets in some cases. Besides, the coding analysis in our work was qualitative in nature for

most of the facets and thus not as replicable as the analysis of statistical measures (e.g., Sakai, 2016).

A broader scope of coding analysis may help validate the existing faceted evaluation framework and

revise some parts of it when necessary.

In addition to the limitations discussed above, IIR user studies also face several challenges

based on which several additional implications could be proposed for the IR community.

Reporting practice: e under-reported facets and missed information concerning research

context (e.g., task source, time length of the study) often lead to gaps and blind spots in the facet

network of user studies. ese blind spots may eventually prevent researchers from conducting a

more comprehensive, in-depth meta-evaluation of user study and make it dicult to explain some

of the patterns extracted from data (Kelly and Sugimoto, 2013). Also, the relationships between

dierent aspects and dimensions (e.g., participants and tasks, experimental search system features

and behavioral measures) in user study design are not always clear in the reported IIR user studies.

Future IIR research should improve their reporting practices and clarify the connections between

the main facets (including commonly under-reported facets), research questions and focuses, and

the compromises made in study design. Additionally, in evaluation-oriented studies, researchers

also need to clearly explain the rationale of adopting a particular ground truth measure(s) and re-

port the process of evaluating ground truth measure(s). e proposed faceted framework can serve

as a toolkit for study design and a checklist for IIR user study reporting practices.

Replication study: Despite the similarities in multiple facets, user studies are usually uniquely

designed for their respective research questions that may involve largely dierent systems, tasks, and

user populations. As a result, each IIR user study appears to be an isolated island and thus cannot

be quantitatively and collectively evaluated. To resolve this issue, major IR venues (e.g., SIGIR,

CLEF) encourage researchers to replicate existing user studies and test the reliability and validity

of the associated results (Ferro et al., 2016; Ferro and Kelly, 2018; Suominen et al., 2018). However,

future research eorts and new platforms are still needed for replication studies in both eld and

controlled lab settings to answer two empirical questions: (1) Are the conclusions drawn from pre-

vious user studies reliable? (2) Would a change in user study facet(s) (especially the under-reported

facets) signicantly alter the results of a user study?

To address the meta-issues of replicating user studies, HCI researchers organized the Rep-

liCHI workshop in the SIGCHI community to discuss the signicant ndings and contributions

from revisiting works (cf. Wilson et al., 2014). However, the idea and movement of RepliCHI was

ultimately rejected by the community as it was considered be to only relevant to a small sub-com-

munity of HCI research. In contrast, given the longstanding emphasis on evaluation in IR, devel-

oping new framework and methods to facilitate replication studies would certainly be of interest

to the entire community. Opening a new track for user study replication and addressing the two

aforementioned questions can help the IR community better answer the research questions of in-

terest and properly evaluate the existing user study facets as well as the entire toolkit.

“Bad” results presentation: In most, if not all, of the published IR empirical studies, researchers

successfully obtained “right” results that conrm at least some of the research hypotheses, theo-

ries, and arguments. However, it is worth noting that in a wide range of unpublished user studies,

researchers actually encountered unexpected, “bad” results which fail to respond to the research

questions in expected way(s): For instance, in some user studies, none of the proposed research

hypotheses are conrmed by the collected empirical evidences; the prediction performance of new

model is signicantly worse than that of the baselines; most of the statistical test results (e.g., coef-

cients of regression models, dierences between predened groups) are not statistically signicant,

and so on. ese surprises and “bad” results often get unnoticed and are kept unpublished. Yet, if

we carefully examine the facets of the user studies behind these “bad” results, they may become

important gateways towards striking ndings and/or methodological innovations.

Aside from the surprising ndings that result in quantum leaps and unexpected innovations in

the corresponding research areas, we also need to explore other possible reasons (especially unnoticed

problems) that create unexpected, “bad” results in data analyses. e existence of unexpected results

may ultimately lead our attention to a series of critical, under-reported facets and factors. With

respect to the participants and systems, for example, participants may come from largely dierent

knowledge background and thus have a very dierent understanding of the same task descriptions.

6.3 LIMITATIONS AND CHALLENGES

56 6. IMPLICATIONS AND LIMITATIONS OF THE FACETED FRAMEWORK

Also, in some IIR evaluation studies, the variation in system aordances may be too subtle.

Consequently, users’ search behavioral signals are not sensitive enough to capture the implicit varia-

tions. In crowdsourcing user studies, the lack of data quality control methods may lead to biased and

contaminated results in analyses. Besides, the adoption of specic search behavior and experience

measures can also determine the signicance and direction of the results. Specically, for instance,

compared to browsing and search result examination behaviors, the users’ query formulation strat-

egies may be more sensitive to external treatments and interventions. Similarly, compared to the

task-level measures, task-stage or action-level dual-task measures can better capture the variations

in users’ perceived levels of cognitive loads (Gwizdka, 2010). Establishing a new platform (e.g.,

forum, workshop on bad results and possibly problematic designs of user studies) for researchers to

present, re-analyze, and openly discuss “bad” results can largely enhance our understanding of the

roles of dierent facets as well as the possible ways in which an IIR user study design can go wrong.

e bridges between small-scale user studies and large-scale IR experiments: It is certainly useful

to learn about user characteristics and search interaction patterns in small-scale, laboratory studies

with well-controlled conditions. However, it is also important to explore the possible connections

and implicit bridges between small-scale user studies and large-scale IR experiments. e appli-

cation and generalization of the ndings from small-scale user studies can be highly valuable for

search systems design and evaluation. To fully understand these implicit connections, we need to at

least answer the following questions.

1. To what extent can we generalize the models we learned from small-scale user studies

to large-scale datasets collected in the wild (e.g., large-scale search log data collected

through commercial search engines, data from TREC, CLEF, and NTCIR)?

2. To what extent can we automatically extract the main facet values directly from large-

scale search log data (e.g., task facets, session lengths, search behavioral measures)

and use them in experiments in various ways (e.g., as features in predictive models;

as ground truth for evaluation; as feedbacks or rewards for reinforcement learning)?

3. Based on the ndings from small-scale user studies, what assumptions can we make

about dierent user study facets in order to support large-scale IR experiments where

high-quality, complete user annotation data is not always available?

Answering these questions can help us better clarify the contributions of and the connections

between small-scale IIR user studies and large-scale IR experiments.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Conclusion and Future Directions

Create new playlist

Sign In

Sign Up

Table of Contents for
Conclusion and Future Directions