1 Introduction

Struggling Search, as Hassan et al. [11] introduced, describes the searching process wherein users experience difficulty in finding required information. Generation of struggling search tasks is essential for both the study of users’ struggling search behaviors and the performance evaluation of interactive search systems. However, creating such a task generator is not easy. As a consequence, there has not been a unified task set for research in struggling search.

Existing methods of generating struggling search tasks can be roughly divided into two categories. One is to create struggling tasks by increasing task complexity, e.g., “There are five countries whose names are also carried by chemical elements. France has two (Ga–Gallium and Fr–Francium), ... Please name the left country” [23]. The other tries to simulate a mission or a problem in real-life scenes and adhere to small-scale situated laboratory experiments, e.g., “You once heard that the Dave Matthews Band owns a studio in Virginia but you don’t know the name of it. The studio is located outside of Charlottesville and it’s in the mountains. What is the name of the studio?” [2]. Both task generation methods reply on researchers or professionals of certain areas. This means the task generation requires extensive experience and fertile imagination of a person and since there is no common pattern to follow, the task generation in previous works usually ended up with small-sized task sets.

Though small-sized task sets could work well in some experimental laboratory studies [2, 23], they are not sufficient for large-scale and robust system evaluation or user studies. The potential effect of participant fatigue limits experiments to a small number of topics and similar situated tasks, making the evaluation inclined to side with a subjective or biased perspective [25]. Besides, the current task generation module relying on a few experts or researchers is not cost-efficient for task generation at scale. This dictates the need for a robust and cost-efficient method to generate struggling search tasks for evaluation. Crowdsourcing has been shown to be a powerful means for recruiting low-cost participants who are readily available around the clock [8, 9]. This provides us with an alternative source of acquiring reliable human input. We therefore propose the use of crowdsourcing to generate struggling search tasks.

We focus on struggling search that manifests in fact finding or checking tasks. We propose a generic method to generate struggling search tasks for large-scale experimental study and develop an online crowd-powered platform, TaskGenie Footnote 1. Our method leverages paraphrased (redundant) information in online wikis and decomposes the task generation into several low-effort steps, suitable for crowd workflows to create questions that are difficult and can simulate struggling search. This method can easily be applied to topically dedicated wikis such as wikinewsFootnote 2 for news and wikivoyageFootnote 3 for travel, while in this paper we take English Wikipedia as the resource to generate a topically diverse set of struggling search tasks. Applying crowd participants, we generated 80 struggling search tasks across diverse topics. To check the feasibility of our task generation module, we carry out rigorous user-centric experiments and evaluations.

In the previous works [27, 28], we conducted the experiments on the whole web, analyzing the characteristics of elicited user behaviors to verify that the generated tasks are qualified struggling search tasks. In this paper, we further explore the characteristics of generated tasks by conducting experiments on a strictly confined searching domain (i.e., Wikipedia), and we evaluate the cost of this task generating module. Results confirm that the proposed method can generate qualified struggling search tasks and is cost-effective. We consolidated the tasks and publicly released a task setFootnote 4 of 80 struggling search tasks. Also, we released the anonymized user logs gathered during task evaluation.

The remaining of this paper is constructed as follows. Section 2 presents the related work on struggling search and task design. Section 3 introduces the task generation framework and the implementation details of our test bed, TaskGenie. Section 4 provides an overview of the task generation experiment. In Sects. 5 and 6, we evaluate on the quality of generated tasks through user behavior analysis and verify that the generated tasks conform to characteristic of struggling search tasks. In Sect. 7, we roughly estimate the cost of our task generation method. Section 8 releases the consolidated task set. In Sect. 9, we discuss about the caveats and intuition of some research methods we applied in this work. Conclusions and future works are described in Sect. 10.

2 Related Literature

We discuss related work in the following areas: struggling search and task design for struggling search.

Struggling Search. Struggling search describes a situation whereby a searcher experience difficulty in finding the information they seek [11]. Within a single search, struggling could lead to frustrating of difficulty and dissatisfying search experiences, even if searchers ultimately meet their search objectives [10]. Characteristics of user behaviors have been used to identify whether a user was dealing with struggling search tasks—searchers dealing with a struggling search tasks can experience difficulty in locating required information, tend to issue multiple similar queries and conduct quick-back clicks as they are cycling on finding useful information [2, 10]. Struggling search has been studied using a variety of experimental methods, including log analysis [20], laboratory studies [2] and crowdsourced games [1]. Hassan et al., studied how to detect and support struggling search by extracting search sessions from real user logs [12, 20]; Aula et al. evaluated the influence of task difficulty on struggling search behaviors by setting up a small-scale laboratory experiment and an IR-based online study [2]. We evaluate the quality of generated tasks by analyzing the user behaviors elicited by the tasks, based on the behavior features that have been shown to be useful for identifying struggling search [10, 12].

Task Design for Struggling Search. Researchers in sense-making found that users will suffer difficulty when there is an information gap between what they know and what they want to know [21], as they can seldom describe their questions clearly or find a way to get close to the answer. This sheds light on task design for struggling searching tasks; key information or the task solving strategy should not be directly given by the task. Also, it has been found that task complexity can increase the task difficulty thus affect learner perceptions of struggling [22]. On the other hand, difficulty of tasks has been viewed from both objective and subjective perspectives [17]. From the subjective perspective, the same task could be difficult and complex to one without background knowledge while be easy for the other who is an expert in the related domain [2, 5]. To some extent this indicates that task design for struggling search should either try to avoid the cases that are highly influenced by domain knowledge or try to cover as many topics as possible. From the objective perspective, task difficulty can be related to task characteristics and independent of task performers, which has been supported by other works [6]—task with unknown goals, unexplored information space, accompanied by uncertainty and ambiguity would consequently mean that it could lead to a high task complexity, in turn resulting in users struggling [17]. Getting inspiration from the previous work [27], we propose an online task generation framework for generating struggling search tasks at scale, covering various knowledge domains and are objectively difficult.

3 Task Generation Framework

3.1 Intuition and Method

We focus on a particular type of search tasks that exhibit search behavior suggestive of struggling—fact finding/checking tasks (“Looking for specific facts or pieces of information” [13]). Struggling search tasks differ from typical information retrieval tasks in that the typical informational search tasks are more like information locating problems which are well-defined, systematic and routine [26]. For example, consider the following struggling search task—“Dave Matthews Band owns a studio in Virginia, the studio is located outside of Charlottesville and it’s in the mountains. What is the name of the studio?” [2]. Consider that the answer to this question does exist in the document collection, but it cannot be simply matched to search queries or resolved using the state-of-the-art information retrieval techniques. Rather, it can only be described using fragmented pieces of information and obtained by searchers through navigating and comprehending content within the information space. A searcher needs to collect relevant information from the documents, comprehend it, reason about it and very often repeat the process for several rounds, until he/she reaches a conclusion with a certain confidence. This process involves information-seeking behavior, including searching, browsing, berry-picking and sense-making [19].

How can we easily find or frame questions with implicit answers at scale? In this paper, we leverage paraphrased sentences, which are abundant in common writings. To create a clear and logical flow while writing, an author tends to perform reasoning narratively. This naturally results in redundancy [7]. For instance, a statement following a causative sentence connector (i.e., a conjunctive adverb) [16], such as “in other words” or “that is to say,” is likely to be a paraphrase which repeats the same meaning of the former sentence(s) in a more colloquial manner [4]. In theory, the information conveyed by the paraphrased sentences can be recovered by a searcher who has read through the preceding content. Thus, removing the paraphrased sentence will not cause information loss. The sentences following such connecting phrases are typically declarative statements. It is therefore straightforward to turn them into questions, with the statement containing the answer.

For example, in Fig. 1, we can hide the underlined sentence and turn it into a question—“Does Polypteridae belong to the Actinopteri?” (since “Polyteridae” and “Actinopteri” appear elsewhere in the article in different forms). By hiding the specific sentence that contains the answer, the answer will not be directly identifiable through information locating. A searcher may identify text fragments like “Polypteridae” and “Actinopteri” as their starting points. However, to understand their relation and answer the question, the searcher may need to know more and therefore be forced to explore the Web or Wikipedia further.

Fig. 1
figure 1

Example of a paraphrased sentence in Wikipedia

This inspires us to generate struggling search tasks through the following steps:

  1. 1.

    Identify a paraphrased sentence;

  2. 2.

    Hide it from the document;

  3. 3.

    Create an informational question based on the given paraphrased sentence.

Since the answering sentence is hidden from the document, it is hard to obtain the answer through direct information locating; the paraphrased sentence usually lacks an accurate description or explanation of the entailing information points, a task generated based on it simulates a real-life situation where people have incomplete prior knowledge or means to meet their information need. This will elicit a searcher’s struggling search behavior. The searcher may start from arbitrary documents that seem relevant, browse through parts or the whole collection and reason about the possible answer. If the searcher is unfamiliar with the topic, he has to learn about it, since answering the question would require comprehension of related knowledge. Meanwhile, as the hidden sentence contains only redundant information, the searcher should be able to find the answer eventually.

3.2 TaskGenie: A Crowd-Powered Platform

Based on the task generation method, we built an online platform for task generation called TaskGenie, aiming to (i) generate struggling search tasks through crowdsourcing and (ii) study user behavior within the generated struggling search tasks. As shown in Fig. 2 (please also check the details of Fig. 2 here Footnote 5.), this platform serves in two phases: Task Generation, facilitating the creation of new struggling search tasks; and Task Completion, facilitating search experiment on solving the tasks.

Fig. 2
figure 2

TaskGenie interface; a user guidelines regarding task completion; b search interface in task completion; c user guidelines regarding task generation; d task completion manual; e interface of the platform in task generation

Task Generation. For task generation, users are first guided to choose a conjunctive phrase from a drop-down list (Fig. 2: e*; “in other words,” “that is to say”). They are then presented with a filtered set of articles that contain (highlighted) statements with these conjunctive phrases (Fig. 2: e). Users are asked to understand the highlighted sentence in the article context and grasp the information that the sentence contains. Finally, they are asked to create a question based on the paraphrased sentence, provide the answer and source page of the question (Fig. 2: e**). Assuming that a task generated from a paraphrased sentence is closely related to its surrounding context, we automatically save the paraphrased sentence and its context (i.e., the two sentences ahead of the positions of the paraphrase sentence) as the supporting information for the answer to the generated question.

Task Completion. We present the users with a generated task in the form of a question (Fig. 2: d*) that can be answered using a search engine (Fig. 2: b). All tasks are pulled randomly from our database while the background mechanism ensures each task is finally resolved equal times. Users can choose to change the task only once; if they do not like the task assigned to them. Users are tasked with finding the answer to the question by searching using our search engine. To ensure that the users are genuinely invested in reasoning, understanding and finding the correct answer and not merely guessing, we ask users to provide a justification in an open text field that supports their answer. Users are encouraged to copy–paste excerpts that provide evidence or justify their answers. Finally, we collect the users’ opinions of the search task they completed from the following perspectives (Fig. 2: d**)—(a) Task Qualification (whether or not the users found the question difficult in comparison with their usual experience of searching the Web or Wikipedia for answers); (b) Task Difficulty Score (how difficult/complex the users found the question to be). We divide the task difficulty scale into five equal parts using the following labels with corresponding score intervals on a sliding scale of 1–100—Easy (1-20), Moderate (21–40), Challenging (41–60), Demanding (61–80), Strenuous (81–100). Users could select the task difficulty level and indicate an exact score using the scrollbar. Next we asked the users to indicate the reasons due to which they found the question to be difficult and provided options (using checkboxes) that were drawn from the previous work analyzing struggling search  [18]. To prevent forced choices in case users did not find the task to be difficult, they could select the checkbox with the label “Not Difficult.”

3.3 System Implementation

Pluggable Web Search Engines. As a platform for task generation and evaluation, TaskGenie is designed to be compatible with main stream web search engines (e.g., Google, Bing) which provide a standardized search API. These search engines can be plugged into TaskGenie as a backend search system to support task generation and get evaluated in task completion. In this paper, Bing Web Search API is used in the experimental study.

Domain Controlling and User Activity Logging. TaskGenie, the search domain can easily be adjusted to support searching through different domains. For example, we set Wikipedia as the domain for task generation (i.e., get all the webpages containing paraphrased sentences from Wikipedia) and we set the entire web as the domain for task completion. During task generation and completion, we also logged worker activity on the platform including queries, clicks, key presses, etc. using PHP/Javascript and the jQuery library.

DOM Processing. During the task generation phase, it is useful to highlight paraphrased sentences to make it more convenient for searchers to locate a target sentence. During the task completion phase on the other hand, to emulate a struggling search situation, it is essential to hide the direct answers in the retrieved documents. So that in the two phases, we need to either highlight or hide the paraphrased sentences. Drawing inspiration from the previous workFootnote 6, we implement this by filtering and manipulating DOM using Javascript. Given a retrieved webpage (DOM), we access all its child nodes recursively and match the regex of causative sentence connectors (in other words, etc.) with the content of each node. The matched sentences are therefore either hidden or transformed into a different sentence according to their syntax.

4 Task Generation

4.1 Wikipedia: Paraphrased Sentences

There are plenty of online archives or wikis. In this work, we choose Wikipedia as the source for our struggling search task generating framework, and in other words and that is to say as the conjunctive phrases to identify paraphrased sentences. Wikipedia is one of the richest sources of encyclopedic information on the Web and generates a large amount of traffic. Prior work has highlighted the variety of factors that drive users to Wikipedia  [24]. We explored the entire English Wikipedia (2018 version) and found 10,824 articles with on average one occurrence of the paraphrase “in other words,” and 2195 articles with the paraphrase “that is to say.” Our findings suggest that Wikipedia is a good source for paraphrased sentences which can potentially serve in the creation of difficult search tasks across diverse topics.

4.2 Study Design

We recruited 200 participants from Figure8Footnote 7, a premier crowdsourcing platform. At the onset, participants willing to participate were informed that the task entailed “generating a task for others within the Wikipedia domain.” Participants were then redirected to the external platform, TaskGenie, where they completed the mission. We logged all worker activity on the platform. During the task generation process, TaskGenie presents criteria to help a user control the quality of the generated question. We urge the users to ensure that (1) the selected sentence is a paraphrased sentence that contains enough information for creating a question; (2) they search for the answer on the Wikipedia to ensure that the generated question is challenging. This means that although the answer cannot be found easily, it can be eventually obtained through searching and exploring. We incentivize participants to strictly adhere to these criteria by rewarding participants with a post hoc bonus payment if they successfully create a struggling search task.

We restricted the participation to users from English-speaking countries to ensure that they understood the instructions adequately. On successfully creating a task, users received a mission completion code which they could then enter on the Figure8 platform to receive their monetary rewards.

4.3 Task Collection

To ensure the reliability of generated tasks, we filtered out participants in this phase using the following criteria:

  1. i.

    Participants who did not follow the required syntax in creating a question in the task generation. Since the aim of this phase is to generate a readable question (we described the basic syntax of a question in our instructions), those who did not meet the criteria were discarded.

  2. ii.

    Participants who create questions lacking a self-sufficient description in the way a question is phrased (for example, “Reincarnation is possible?”) and generate random questions ignoring the paraphrased sentence in the source page (for example, “Is Wikipedia the best page to find anything?”).

Using the aforementioned criteria, we filtered out 65 task generation cases, resulting 135 generated tasks. For the 135 generated tasks, we hired two students to search for the answer of each task on the web. We eliminated 55 tasks that either duplicated or for which the answer could be found within the two interactions with the search system. We finally got 80 tasks that qualified as struggling search tasks.

5 Evaluation I: Task Characteristics

To evaluate the quality of generated tasks and validate they are struggling search tasks, we first examine the session-level features of users’ search behaviors shown to be useful for identifying struggling search in the previous work [11, 20]: topical characteristics, query characteristics, click characteristics and task difficulty.

5.1 Study Design

To validate whether the generated tasks are struggling search tasks and are generally suitable for the study of struggling search, we conducted a web search experiment using the set of 80 generated tasks.

Through Figure 8, we recruited 400 Level 3 participants (260 males and 140 females, with their age ranging from 18 to 57 years). Workers willing to participate in the web-based task evaluation experiment were asked to “search for the answer of a given task” using our platform TaskGenie: Task Completion. For the web search experiment, we base the TaskGenie search system on top of the Bing Web Search API and set the search domain as the entire web. We logged the user activity throughout the task completion. Using the task filtering criteria mentioned before, we filter out 31 spam participants who entered arbitrary strings for the answer or supporting information and those who did not finish the experiment. Thus, the evaluation is based on the 369 valid search sessions.

5.2 Topical Characteristics

We analyzed the topical distribution of the tasks and found that tasks in diverse topics could be generated through our task generation module. To categorized the generated tasks, we used the top two-level categories of Curlie Footnote 8 (i.e., Open Directory Project; dmoz.org). Assuming that the topic of tasks is consistent with the topic of its wiki source pages, we categorize the generated tasks by analyzing the topics of their source wiki pages. To this end, we used an automatic url-based classifier [3] for topic categorization. We assigned the most frequently occurring topic for the source web page as the topic of each generated task.

Figure 3 shows the prevalence of topics in the generated tasks. We note that the task generation domain we chose, Wikipedia, contains few articles which correspond to everyday activities. Thus, only a few generated tasks were about topics spanning our daily lives such as Shopping and Entertainment. However, the generated tasks cover various topics.

Fig. 3
figure 3

Percentage of topics in generated tasks (gray color) and the corresponding success rate (green color) for each topic; category “Science” is further divided into second-level categories such as “biology” and “astronomy

Corresponding to each topic, we measured the success rate of tasks. For each generated task, we regard a corresponding answer is successful if: a searcher’s answer is correct, and the searcher provides meaningful supporting information that corroborates the answer (i.e., the supporting information is semantically similar to the that given by the task creator). We evaluated the similarity between supporting information given by searchers and that given by the task creator using an automatic text-level similarity evaluation method [14]. Of all the search sessions across different topics in our set, around 37% correspond to successful cases, which is comparable lower to that observed from real user logs (i.e., 40% in  [20]). As shown in Fig. 3, the success rate varied across the different topics, ranging from 25% in world to 48% in science astronomy.

According to the type of answer that satisfies a given task, we further analyzed the generated tasks from two standpoints: yes/no tasks (37 in total, the answers to 19 of them are “yes,” the answers to 18 of them are “no”) and fact finding tasks (43 in total). Through a two-tailed T-test that compared the success rate across the two types of tasks, we did not find a significant difference. We also found no significant difference between tasks generated from “in other words” and those generated from “that is to say.”

5.3 Query Characteristics

It has been found that searchers’ struggling is reflected in their queries [2, 10]. We examine the characteristics of queries elicited by the generated tasks focusing on the following features: query features (i.e., number of queries, query length), query transition features (i.e., query similarity, query reformulation), which have been shown to be useful for determining struggling search sessions [11, 12].

Query Features. Users in general issued more queries to handle a struggling search tasks [11, 20]. On average, the generated tasks comprised 5–6 queries (\(M=5.55\)) with average query length of 6 terms. Successful task solving sessions (5.48 queries, 4.76 terms per query on average) were slightly shorter than the unsuccessful counterparts (5.72 queries, 6.78 terms per query on average). We present an example to illustrate queries within a search session.

Fig. 4
figure 4

Samples of search sessions in user logs

Figure 4 shows the sample process a searcher moved through a session to solve the task “Is a flowering plant a fruiting plant?.” We note that to solve a task generated in this work, a searcher generally issued even more queries with longer query length than the “3–4 queries averaging around 4 terms per query” observed from daily-life struggling search logs in the previous work [20]. This difference may also be attributed to the difference in tasks that were studied. The information inquired by the generated informational tasks are more specific and difficult to resolve than the tasks studied in previous works (e.g., find a source page of a video).

We observed that the first query in both successful and unsuccessful search sessions are typically the task description itself or an excerpt sentence extracted from the task description (8.93 terms on average) which are longer than the intermediate queries (5.81 terms on average) and the final queries (4.18 terms on average). Existing works show that there are generally two different cases that correspond to struggling with respect to the first query of a search session: (i) the query is too common as it is general and ambiguous or (ii) the query is quite uncommon as it might be overly specified [20]. From this, we note that the long over-specified first query does not lead searchers to a target page and might elicit struggling search consequently. However, this struggling does not determine the final success or failure of the whole search session, which is consistent with the outcomes in prior work in [20].

Query Similarity. It has been shown that in a struggling search session the later queries can be quite similar to the initial query. Users experiencing the struggle tend to reformulate queries that closely resemble the initial search [11, 20]. Based on prior works, we expect that in a struggling search task a user thinks of less diversified queries to explore alternatives. Thus in user logs, unique terms in the initial query persist through the future queries. To examine this, we measure the similarity between queries in the session. The similarity between any two queries \(Q_{i}\) and \(Q_{j}\) is computed using Jaccard Index:

$$\begin{aligned} \begin{aligned} \frac{\left| Q_{i} \cap Q_{j}\right| }{\left| Q_{i} \right| +\left| Q_{j}\right| -\left| Q_{i} \cap Q_{j}\right| } \end{aligned} \end{aligned}$$
(1)

where \(\left| Q_{i}\right|\) is the number of unique terms in query \(Q_{i}\), and \(\left| Q_{i} \cap Q_{j}\right|\) is the number of matched terms in \(Q_{i}\) and \(Q_{j}\).

Before measuring the similarity between queries in a session, we first normalize the queries, including lowercasing query text, deleting stop words, stemming and unifying white space characters. For \(\left| Q_{i} \cap Q_{j}\right|\), we consider two terms are matched if they are (i) exact matched: two queries match exactly and (ii) approximate matched: two queries match if the Jaro–Winkler distance (score) of them is larger than 0.6. In this work, we only consider the lexical-based query similarity. Assuming that for the concepts or information points in the generated tasks, users can seldom find alternative terms to search without learning through searching, we eliminate semantic matched cases (i.e., two queries match if semantic similarity of them over certain threshold [11]). Figure 5 shows the average similarity between queries to the first query. We found that in both successful and unsuccessful search sessions, searchers generally issue similar queries in the first three rounds. This is consistent with the outcomes in previous studies mentioned earlier [11, 20]. We found that in successful sessions, queries gradually get less similar to the initial query as the searching progresses (though the difference was not found to be statistically significant using a two-tailed T-test at the 0.05 level). Prior work established that struggling searchers cycle through queries as they attempt to conceive a correct query to locate target information (i.e., the query similarity in struggling search sessions is generally greater than 0.4) [11]. Our findings corroborate that struggling search manifests during users’ quest to satisfy the information need, even if they finally succeed in their search missions.

Fig. 5
figure 5

Avg. query similarity in each step

Fig. 6
figure 6

Avg No. of clicks per query

Query Reformulation. We delve into how users employ terms from one query to another in web search. We consider the three main query transition types which have been used in previous works [11]: Term Addition: \(\ge\) 1 word added to the first query; Term Removal: \(\ge\) 1 word removed from the first query; Term Substitution: \(\ge\) 1 word substituted with other lexically matched terms. Term matching is done by using lexical matching described earlier.

We found that term removal is generally the most popular strategy; almost all the search sessions contain term removal cases. This can be explained by the task description that users consumed the information prior to beginning the search session. Due to the nature of Wikipedia, most generated tasks pertain to topics which people may not encounter in their daily life. Thus, we reason that most people struggled to come up with alternative terms to describe the vague information need in the tasks. In such cases, over 2 terms were removed on average in the last query (\(M=2.41, SD=1.89\)). The high standard deviation can be explained by differences between the generated tasks. For instance, a task with a long (short) information need description could elicit a long (short) initial query, finally converging to a few keywords. Term substitution occurs more frequently in successful sessions than in unsuccessful sessions (though not statistically significant, \(p = 0.052\)) which is consistent with the previous work [11].

5.4 Clicks Characteristics

Prior works have shown that searchers experiencing “struggle” tend to exhibit no click actions or quick-back clicks (i.e., result clicks with a dwell time less than 10 s [15]) after certain queries [2, 11, 20]. This has been attributed to the difficulty experienced in locating target information. We examine the characteristics of users’ clicks on the SERPs in search sessions pertaining to the generated tasks.

On average, searchers exhibited 1.67 clicks after each query \((M=1.67, SD=1.49)\), and over 62% of search sessions contain quick-back clicks. We further computed the average number of clicks for a sequence of queries in a session. Figure 6 shows the average change in the number of user clicks per query. We found that within the initial 4 queries there’s no significant difference between successful and unsuccessful sessions in terms of the average number of clicks per query while the difference becomes more pronounced thereafter. Particularly, searchers in unsuccessful sessions issued less than 1 click on average after their last two queries. This is consistent with the previous work, which also found that users in struggling search tasks tend to give up clicking on post-query results on the final query in an unsuccessful session [20]. From the click characteristics we find that solving the generated tasks, users are elicited with clicks in struggling, part of which could be the indicator of the eventual mission failure.

In contrast to our findings, Hassan et al. found that after several rounds of queries without locating any target information, struggling searchers tend to click on more results [11]. These contrasting findings can be explained by the difference of task types and difficulty levels. The generated tasks in our setup are generally fact finding tasks with unambiguous final goals, while the tasks in previous works are more akin to open-ended exploratory tasks (e.g., “software purchase advice,” “career development advice”).

5.5 Task Difficulty Analysis

Corresponding to analysis of objective user behavior, we also investigate searchers’ subjective perception of task difficulty. In general, participants scored the task difficulty as 57 on average (\(M=57, SD=17\)), which means tasks are in general challenging yet not demanding. We note that all participants agreed these tasks are much more difficult than the typical IR tasks. Among them 77% searchers thought the given tasks were more difficult in comparison with their general web search experience, rating task difficulty as 61 on average (i.e., demanding; \(M=61, SD=13\)).

Based on the reasons collected from the previous work [18], we investigated the reasons why tasks made users perceive a “struggling search” experience during web search through self-reports. Figure 7 illustrates the overall impact of different reasons that contribute to users experiencing a “struggle” while completing the generated tasks across the entire web. We found that the top 3 main reasons cited for task difficulty were (1) task complexity, wherein workers believed that there were several components of the task that needed to be addressed; (2) difficult to find useful pages, wherein searchers met difficulties locating proper web pages to acquire target information; and (3) specific requirements, wherein the struggle experience was due to the information need being so specific, consequently making it more difficult to find. The reasons spread across various aspects including task features (40%), user aspects (26%), the interaction between user and system (24%), and the readability of documents (10%).

Fig. 7
figure 7

Overview of the reasons why workers felt struggled in web search. Reasons are collected from 4 standpoints: task features (a, b), user aspects (c, d); user–system interaction (e, f); and document features (g, h)

We found that within Wikipedia domain the paraphrased sentences are generally distributed across curated articles about history, literature, physics, biology, etc., which people may not encounter in daily life. Thus, we observe the generated tasks correspond to subjective knowledge of users rather than more general scenarios that one may encounter in everyday life. This increases the task difficulty for most of the users; the information need of the generated tasks also requires users to process varied information from different perspectives. Moreover, self-reported difficulty reasons indicate that expanding the search domain increases the difficulty in locating useful pages to satisfy the information need (note that searchers were unaware of the fact that the source for all generated tasks was Wikipedia).

We also analyzed the influence of reasons on users’ perception of struggling. Results of the generalized linear regression indicate that there was a collective significant effect between the reasons and users’ perception of struggling in web search experiment (\(\chi ^2\) \(= 83.1, p < .01\)). The individual predictors were examined further and indicated that complexity ( \(t=4.19, p<.001\)), specific requirements ( \(t=1.57, p<.05\) ), domain knowledge ( \(t=2.03, p<.05\)), difficulty in finding useful pages ( \(t=6.88, p<.001\)) and too much information ( \(t=4.36, p<.001\)) were significant predictors in the model, while searchers’ poor learning experience, the system performance and whether the target document is hard to read are not the key factors that influence users’ struggling experience.

6 Evaluation II: Stability of Task Characteristics

Given our findings in Evaluation I, we conclude that the generated tasks conform to the characteristics of struggling search tasks. In Evaluation II, we further verify the quality of generated tasks in case of users’ advanced operations during search. In other words, to make sure that the tasks would not turn to simple look-up/information locating tasks once users applied advanced search operators of search engines Footnote 9 such as “Narrows search domain” (site: ) and “Finds webpages that contain all the terms or phrases” (AND or &).

6.1 Study Design

Since the tasks were directly generated from a confined domain (i.e., Wikipedia), we were interested in investigating whether struggling search behavior can be elicited even within such a confined domain. To this end, we studied the user behavior of task completion in Wikipedia domain and compare it with the behavior of users solving typical IR tasks in the same domain.

Tasks We randomly selected 10 tasks from the generated task set for the confined domain experiment of searching on Wikipedia. For comparison, we chose the 10 IR tasks that were selected randomly from TREC 2014 Web Track datasetFootnote 10 and used in the previous work by Gadiraju et al. to study user behavior [9]. Table 1 presents examples of the selected seed tasks. All tasks (struggling and traditional IR tasks) are made publicly availableFootnote 11.

Table 1 Examples of generated tasks and traditional IR tasks

Study Procedure We recruited 200 Level 3 participants (63 females and 137 males aging from 18 to 57 years) from Figure8, which means each task was solved by 10 users. Participants were redirected to the external platform, TaskGenie–Task Completion and asked to “search for the answer of a given task within the search system.” During the experiment for both types of tasks, participants were explicitly informed that they could choose to use advanced keyword operators provided by the search engine (based on Bing API). We logged all the user activity within the platform.

We discarded 16 users who did not enter an answer or support information or entered arbitrary strings in either text field. The analysis and evaluation are thus based on the 184 search sessions (96 struggling search sessions and 88 typical IR sessions).

6.2 Results

6.2.1 Queries and Advanced Operators for Queries

On average, users entered 3 queries (2 distinct queries) with an average query length of 8 terms in a struggling search session; while they issued less than 2 queries with an average query length of 4 terms in a typical IR search session . Through a two-tailed T-test, we found a significant difference between the generated tasks and typical IR tasks in terms of the number of queries enter by users; \(t(182) = 2.71, p < 0.05\). Though being explicitly informed, few participants in both types of task solving sessions applied the advanced operators for queries in practical. This may be caused by our experimental setting for the search domain. We constrained users’ search domain to Wikipedia, which means the search results are highly filtered and refined. As a consequence, users did not need to do many advanced operations on filtering or refining search results while issuing queries.

In general, almost all users in the struggling search experiment issued repeated queries or queries with high similarity (i.e., section 5.3, Jaro–Winkler \(> 0.6\)), which suggests the struggling of users during search. Similar to our findings in Sect. 5.3, we found that across more than 79% of struggling sessions, the last query (M = 5.86) entered by users is much shorter than the first query (M = 9.72) entered in the session; \(t(94) = 1.82, p < 0.05\). However, in typical IR sessions, there is no such significant evolvement of the length of issued queries.

Comparing to our findings in Sect. 5.3, we note that the number of issued queries in a constrained domain is less than that in the whole Web. This indicates that users may suffer less struggling in query reformulation in a constrained domain. However, for the generated tasks users still need to fire more than 3 queries to reach the answer in such the constrained domain, wherein they only need to issue around 1 query to solve typical IR tasks. Based on these results, we reason that even in a constrained searching domain, the generated tasks still conform to the characteristics of the struggling search task in terms of query features, while the task difficulty is reduced compared to that in the whole Web.

6.2.2 Clicks Characteristics

We analyzed the clicks of users on results corresponding to each of the queries. We found that though narrowing down the search results, the generated tasks can still elicit users’ struggling clicking behaviors.

We noticed that on average users fired 1 click per query. Though in the constrained searching domain, there were still 68% of search sessions (65 out of 96 generated task search sessions) containing quick-back clicks. This is even higher than the ratio of quick-back clicks we found in Web experiment (Sect. 5.4, 62%). For the search sessions containing two or more post-query clicks, the average click interval corresponding to struggling search tasks (\(M = 29.59, SD= 7.72\)) is significantly shorter than that in typical IR tasks (\(M = 109.16, SD = 42.39\)); two-tailed T-test, \(p<.05\). This further corroborates that quick-back clicks happened more frequently in task solving cases of generated tasks than in that of typical IR tasks. Besides, we found that for typical IR tasks, participants could generally get information through one click on the SERPs and did not need to navigate further through links in the result Wikipedia pages. However, for the generated struggling search tasks, participants sometimes still needed to click and navigate from one Wikipedia page to one or more other Wikipedia pages (No. of navigated pages: \(M = 1.88, SD = 1.02\)). We conducted a two-tailed T-test to compare the amount of navigation across the task types, and results show a significant difference: \(t(182) = 1.96, p < .05\). From this, we note that for the generated struggling search tasks, even locating the target information, users still need to get familiar with the statements and concepts in its related context, which increases the difficulty of generated tasks.

6.2.3 Perception of Task Difficulty

The perceived task difficulty corresponding to typical IR tasks was found to be 35 (M=35, SD=25), indicating a moderate level of perceived task difficulty. In contrast, the perceived task difficulty in case of the generated tasks was found to be 56 (M=56, SD=22), indicating that users perceived the generated tasks to be challenging yet not too demanding to handle. And we noticed that comparing to the experiment in the whole Web (Sect. 5.5, Task Difficulty \(= 57\)), the average task difficulty score did not vary a lot when the search domain was confined. This suggests the stability of generated struggling search tasks in terms of users’ task difficulty perception. A two-tailed T-test revealed the significant difference in the users’ perception of task difficulty across the generated struggling search tasks and typical IR tasks; \(t(183) = 4.97; p < 0.01\).

7 Evaluation III: Task Generation Cost

Finally, we estimate the cost of the proposed task generation method to provide a reference for the experimental setup in future work. We investigate the cost from two aspects: participants’ efforts (behaviors) and payment.

We found that on average a task can be generated within 12 mins Footnote 12 by the crowdworkers from Figure8. In this experiment, once choosing the causative sentence connector (i.e., the conjunctive adverb “in other words,” “that is to say”), users fired around 1.4 clicks on the SERPs of Wikipedia (i.e., the source for generating struggling search tasks in this paper) to find a proper paraphrasing sentence. We found that the average number of clicks fired by participants who were given “in other words” (\(M = 1.17, SD = 1.04\)) was less than that fired by participants who were given “that is to say”(\(M = 1.94, SD = 0.98\)). Besides, our investigation in Sect. 4.1 shows that in Wikipedia there are many more articles containing “in other words” than articles containing “that is to say.” Based on these results, we reason that in Wikipedia the paraphrasing sentences starting with “in other words” is a better source for generating struggling search tasks. Locating the paraphrasing sentence, users navigate to around 3 webpages through the links in the context of this paraphrasing sentence to learn about it and transform it into a reasonable question. We did not find any significant difference in the number of navigation clicks fired by participants who were given “in other words” and by those who were given “that is to say.”

On Figure8, we compensated all the 200 users at an hourly rate of 7.5 USD (i.e., 1.5 USD per task) to generate struggling search tasks, which costs a total of 300 USD. We note that the payment for each task is does not exceed the price range of current crowdsourcing market Footnote 13. Comparing to the previous task generation method that requires experts or professionals of certain areas resulting in small-sized task sets [2, 23, 25], the method we proposed in this paper can generate a large number of struggling search tasks through crowdsourcing which is shown to be cost-effective.

8 Publicly Released Task Set

For the benefit of the community, along with TaskGenie platform, we also publicly released the generated task set and user behavior logs (anonymized) gathered in our user study. We consolidated the 80 generated tasks with different aspects including: question, answer, source page (i.e., suffixes of the sharing url “https://en.wikipedia.org /wiki/”), task type (i.e., “yes/no” or fact finding), task topic (i.e., the ODP categories), task difficulty level (i.e., according to average difficulty score) and success rate. Figure 8 presents some samples of the generated struggling search tasks. The complete task set is available online (the URL is provided in Introduction). This task set can be used to reliably simulate struggling search among users. For each task, we provide the basic success rate and task difficulty level that can be used in the development and evaluation of methods to support users while they struggle in search tasks. Also, we provide the user behavior data collected in this work including queries, clicks, etc. Moreover, our proposed framework can be used to generate struggling search tasks as per the topical/domain-related needs at hand.

Fig. 8
figure 8

Struggling search tasks generated using TaskGenie

9 Discussion

Why we need “humans”? Although paraphrased sentences are a good source to create difficult questions, framing these questions automatically is far more challenging due to the variety in paraphrased sentences and their context; existing methods cannot automatically generate struggling search tasks in this manner. Humans, on the other hand, can easily identify those paraphrased sentences which are suitable for creation of struggling search tasks. TaskGenie allows us to collect and study user behavioral logs while they solve struggling search tasks, and also supports the generation of struggling search tasks. Note that TaskGenie can easily be customized to execute only a single phase (task completion or task generation) if desired.

Effects of the document collection. In this work, we chose Wikipedia as the domain for generating struggling search tasks. And for simplicity, we only considered paraphrased sentences using the conjunctions “in other words” and “that is to say” as the indicators for redundant information that is summarized. However, our framework can be easily customized to include other conjunctions concomitant with paraphrased sentences. We also showed that the generated tasks correspond to a variety of topics. Moreover, our framework can be readily used to generate struggling search tasks for specific domains by depending on the corresponding wikisFootnote 14. These include WikiTravel about traveling and places, tvTrope about television and movies, WikiNews about the news and events. All these could be a potential source for paraphrased sentences. Thus, we argue that using this framework, a comprehensive struggling search task set that fits domain-related requirements can be realized.

Effects of the retrieval model. In this work, generated tasks are not quantitatively balanced across topics. However, through a post-study analysis we found that advanced searching grammar could help in balancing topics of generated tasks in a task set by locating paraphrased sentences pertaining to specific topics. For example, by issuing a call to the Bing API with an advanced option “in other words”: recreation targeting Wikipedia domain we could locate all the Wikipedia articles containing the phrase “in other words” and corresponding to the topic of “recreation.” We observed that in the task generation phase, despite instructions that encourage workers to select articles with highlighted paraphrased sentences more arbitrarily and neglect the ranking order, some participants still selected the top-ranked results. As a consequence we found a few duplicates in the generated tasks. Nevertheless, we collected 80 distinct tasks generated by users within the task generation framework that adequately elicited struggling search behavior of users.

Task Pre-filtering Method. In this paper, authors manually filtered struggling search tasks from the generated set of tasks. A manual task filtering step guarantees the high quality of struggling search tasks, but it gets progressively more expensive with the growing size of the task set. By analyzing the generated tasks, we note that when struggling search tasks are expressed in natural language, they are potentially more complex from a readability standpoint in comparison with typical IR tasks. Through K-means (K=2; Euclidean distance) for task type clustering based on the two parameters of average word complexity and readability of the generated tasks, we found that the readability of tasks could be an indicator of struggling search tasks. Such clustering resulted in identifying struggling search tasks with an accuracy of 80%, providing a pre-filtering method for scalable filtering of the generated tasks that can be leveraged in the future.

We note that the reading comprehension ability of a worker plays an important role in the worker’s understanding of the preceding context and the accurate generation of a struggling search task using a paraphrased sentence. In the current setup, we recruited Level 3 workers from Figure8. However, we reason that to optimize the efficient generation of struggling search tasks using our framework one can consider pre-screening crowd workers based on their proficiency in reading comprehension.

10 Conclusions and Future Work

By leveraging summarized (redundant) information in paraphrased sentences we proposed a task generation method and implemented it in an online crowd-powered framework. Through our task generation framework, we collected diverse questions from crowd workers with implicit task descriptions and unambiguous answers that can be found by exploring the relevant information space. While this also results in some simple look-up tasks, these can be easily filtered out using existing criteria. We conducted a web search experiment to evaluate the task quality based on characteristics of elicited user behaviors and a advanced operating search experiment to test the stability of the task quality. We showed that high-quality struggling search tasks can be generated using our framework. We did a concise investigation on the task generation cost of this method and found the method to be as cost-effective as other ordinary crowdsourcing experiments. We believe that our framework, the task set, together with our insights in this paper will help in advancing and developing methods to support users in struggling search. In the imminent future, we will test the struggling search tasks in different search engines and explore a benchmark about how different search engines support such struggling fact finding or checking tasks.