ANADOLU SELÇUKLU DEVLETİ: 25

CHAPTER 1

INTRODUCTION

1.1 Motivation

Searching in a collection of documents is done using indexes of these documents. The simplest approach is that the single index stores information for every single docu- ment, then search query can be run over this index. Apparently, this basic approach is not scalable, as the index grows linearly proportional to the number of documents. Distributing the documents across separate nodes allows conducting search in paral- lel. In this distributed approach, there is a broker node that sends search queries to these nodes with the documents and the corresponding indexes, and then gather the results from them. Broker sends the query to all nodes without making any selection. Therefore, all documents are still be processed by the query. This approach is called as exhaustive search on a distributed architecture.

Allocating the documents in different nodes can be done in various ways. For exam- ple, documents can be grouped by their topical similarity, therefore we can say that related documents will be all together. It is also possible to allocate them according to their source information, i.e. URL information for web based search. If we distribute the documents to the nodes using such representative information, i.e., store topical clusters at each node; then at the broker node, we can estimate which clusters may yield better results for the query (called as the resource selection stage [1]). To do that, broker stores information about all the clusters, which is called as centralized sample index, CSI. This technique, called as selective search [1], does not process all the documents, however it still can be as effective as the exhaustive search.

In search concept, main purpose is to satisfy the user’s expectation, which naturally involves a diversity among different needs. To achieve this, different aspects of the query should be taken into consideration before returning results. However, queries aren’t always clear enough. Because of the ambiguity of the queries, it is difficult to understand user’s actual intention. Search engines use the similarities between the documents and the query for retrieval. However, this approach may lead to result lists that include only similar documents. Therefore, there should be an additional proce- dure to diversify the results to cover different possible user intentions. On the other hand, there is a trade-off between diversity and relevance: covering more aspects than the user’s need can cause including irrelevant documents.

For example, if we consider the web search engines, many users type only couple of words to find the information they are looking for. A user who wants to find something about fruit apple, may forget to add word ‘fruit’ to query, because of that, most probably the user will be misled to ‘Apple the Company’. The lack of more explanatory words like ‘fruit’, make it difficult to match user expectations. For this kind of ambiguous queries, search engines are expected to generate a result set which includes different aspects of the same query. Despite the fact that they should consider query ambiguities, the trade-off between diversity and relevance prevents them to add documents randomly to the result set just because they improve the diversity of the set [2].

For selective search, there are several layers where diversification can be applied. For example, at the broker node, diversification can be applied just before the result set is being returned, or each cluster may try to diversify their partial result set before sending them to broker. In all of these cases; search result diversification can be done either in an explicit way, where search engines have an external knowledge about queries; or implicitly (as will be discussed later). For both cases, the main purpose is to maximize the coverage to solve ambiguity issues and minimize redundancy, while keeping relevance still high.

1.2 Contribution of Thesis

In this thesis, we contribute the literature in several ways, listed as follows.

• We propose alternative ways of creating a centralized sample index (CSI) based on various document properties and evaluate their impact on the resource se- lection and subsequently, overall diversification effectiveness.

• We expand queries using word embeddings and apply diversification to obtain diverse expansion terms, as in [3]. As a novel contribution, we process such expanded queries on the CSI, i.e., during the resource selection stage, to obtain more diverse resources.

• For selective search, we explore the performance of search result diversifica- tion at different layers (i.e., at the broker vs. in the clusters) and at different stages, namely, before resource selection and before/after result merging. To the best of our knowledge, while diversification performance is investigated over a distributed setting for exhaustive search (i.e., with random assignment of documents to nodes), our work is the first to make such a detailed analysis in the context of selective search and by employing representative strategies for both implicit and explicit search result diversification.

1.3 Organization of Thesis

In Chapter 2, we will review the studies in the literature for the result diversification for selective search. This chapter will give background information about algorithms and approaches we adapted for our experiments; i.e., well-known implicit and explicit diversification algorithms, resource selection, query expansion and document alloca- tion techniques used in our experiments. In Chapter 3, we will explain the techniques that are used to create different CSIs and also how we expand the queries using word embeddings. In Chapter 4 we will present different approaches of diversification for selective search on which we build up our experiments. Then, respectively in Chap- ters 5 and 6, detailed information about experiments will be presented. Finally, in Chapter 7, we will conclude our work with a summary and possible future work.

CHAPTER 2

RELATED WORK

In this chapter, firstly we will present selective search and its document allocation and CSI generation steps that we also adapted for our work. We will also explain different studies for resource selection phase of selective search. Then, we will present various diversification approaches studied in the literature so far. Finally, we will discuss word embeddings and its usage to extend queries.

2.1 Selective Search

Selective search is an approach to search in very large scale environment. Partitioned documents are splitted into clusters. In the center of the selective search, there is a broker which technically manages the process. It sends the query to the clusters. In Figure 2.1, broker is responsible to send the query to selected clusters, that are 1 and

3. However, to decide which clusters to send, it first runs the query in centralized sampled index. After sending the query to selected clusters, broker then collects and merges the results. This technique, which was previously known as a cluster based retrieval [4] aims to reduce resource usages while keeping the accuracy of the results.

Kulkarni and Callan [1]’s studies show that selective search approach can be used instead of exhaustive search, since overall effectiveness of the resultant sets are not so different. Main advantage and purpose of the selective search is reducing the overall time spent by the search engine. In next sections, we will respectively take a look at the studies in clustering the documents, choosing a centralized index for them, and finally selection techniques over these centralized sample indexes.

Query

Run query

Broker CSI

Resource Selection

Select Blue

Figure 2.1: Selective search over partitioned clusters using CSI.

2.1.1 Document Allocation Policies

For selective search, each physical shards consist of collection of documents. The main issue here is to determine the relation between documents in same shards. These allocations can be done using any kind of information of documents. Currently, there are three different types of document allocation policies studied, these are random selection, source based selection and topic based selection [5]. Random based al- location choose documents randomly. Source based is mainly depends on the url addresses of web documents.

In this thesis, we applied topically partitioned shards using Kullback-Liebler diver- gence method that is introduced by Xu and Croft [6]. As shown in Algorithm 1 and in Figure 2.2; after documents in the collection are sampled, these sampled documents are used to iteratively generate cluster centroids that is used to assign documents.

Figure 2.2: Topically Document Allocation simulation.

Algorithm 1 Topically Partitioning using K-means and Kullback-Leibler divergence

Input: Document collection C, Number of Clusters K, Times of K-means N, Sam- pling Percentage P

Output: Topical clusters CT

1: SAMPLEDOCS ← RandomSampleDocs(C, K, P)

2: CT ← InitializeKMeans(SAMPLEDOCS, K)

3: CENTROIDS ← CalculateKLCentroids(RANDOMDOCS) // Kullback-Liebler divergence

4: for N Times do

5: for Sampled Document d ∈ C do

6: for k ∈ {1, . . . , K} do

7: FIND DISTANCE(CTi, D)

8: end for

9: Assign d to CTn where CTn is the closest cluster

10: end for

11: end for

return CT

2.1.2 Centralized Sample Index (CSI)

After documents are allocated different clusters, each one should be represented at broker in an index. In this index, instead of including information from all documents; it is possible to maintain selection quality at broker by including a subset of them. Current experiments in literature shows that this subset, namely centralized sample index can return effective results and represent the real clusters good enough [1].

Aly et al. [7] introduce a randomly generated centralized sample index approach. In this approach, each cluster will be represented equally at broker, because each one of clusters will send same percentage of documents that are randomly chosen. We adapted this approach to create CSI.

2.1.3 Resource Selection

Selective search is based on searching multiple clusters at the same time. To achieve that, subset of clusters are questioned by queries and selected clusters’ results are merged into a single list [8].

Here the main issue is how to choose correct clusters. In the literature, resource selection is mainly divided into two main categories, big document approach and small document approach. Former one is the first generation of resource selection techniques. In this technique, resources are treated as a bag of words, which are concatenation of its documents or its sampled documents. Based on this, resource selection task is reduced to a document retrieval task [9].

Another resource selection model is the small document model, which ranks the sam- ple documents according to occurrences of documents in the result set. In this thesis, we tried well known resource selection algorithms that apply small document model, these are Redde [10], ReddeTop [11], [12], CRCSLin, CRCSExp [13] and GAVG

[14].

Redde resource selection method is proposed after noticing that big document ap- proaches does not do well in environments that contain small and large documents. Instead of using meta documents, this approach uses the information about collection

size and result list. Running query in sampled documents creates a ranked list and from this list, a score for resource is calculated by only looking the total number of documents from that resource in top documents. This score is the score that shows relevancy of resource and query but it is not the final score. For finalizing score, as shown in Equation 2.1, Redde uses the ratio of the size of resources sampled docu- ments to the size of resource.

where

Score(Ci, q) = n ×

|Ci|

|Si|

(2.1)

• Ci is the cluster to score

• q is the query

• n is the number of documents in top results from Ci

• Si is the set of sampled documents from Ci

• |Ci| is the size of Ci, i.e. number of documents in Ci

• |Si| is the size of Si, i.e. number of sampled documents from Ci

In ReddeTop, which is proposed by Arguello et al. [12], scores of the documents are used instead of their ranks, it is shown in Equation 2.2

where

Score(Ci, q) =

d∈Si,d∈R

P (d, q) ×

|Ci|

|Si|

(2.2)

• R is the top N result obtained from CSI

• P (d, q) is the score of document d in query q

In addition to Redde and ReddeTop; CRCSLin and CRCSExp [13] other methods that take into consider ranks of the documents in the result list. Their equations are shown in Equation 2.3 and Equation 2.4 The main difference between CRCSLin and

CRCSExp is that latter consider ranks exponentially, where the former has linear equation.

Score(Ci, q) =

d∈Si,d∈R

n − r ×

−β×r

|Ci|

|Si|

|Ci|

(2.3)

where

Score(C , q) =

d∈Si,d∈R

× |Si|

(2.4)

• r is the rank of document d

• α and β are the coefficients

GAVG is the another resource selection algorithm which uses document scores as a variable to determine best clusters [14]. As it is shown in Equation 2.5, it calculates geometric averages of the document scores.

Score(Ci, q) = ( rr

d∈Si,d∈top m ofSi∈R

P (d, q))m

(2.5)

2.2 Diversification for Selective Search

For selective search, processed queries yield a result set which is a sorted document list with respect to their relevancies to the query. Search result diversification is the process of reordering of these result sets in a way that they will handle the ambiguity of the existing query. This problem is defined as a NP-hard problem [15].

Search result diversification methods are mainly divided into two categories, implicit and explicit techniques [16]. Implicit diversification methods don’t need external knowledge, and mainly depend on relevance results of the documents. To cover other aspects of the query, it re-rank these lists.

Many implicit diversification techniques try to maximize the total score of the set by taking into account relevance and dissimilarity scores. For example, MaxSum disper- sion algorithm [17], as shown in Algorithm 2, tries to maximize objective function,

Equation 2.6. Requirement to have document vectors in the place that implicit algo- rithm is being applied, is one drawback of this method. It needs document vectors to calculate SIM and DIV functions.

f (di, dj) = (1 − λ)(SIM (q, di) + SIM (q, dj)) + 2λDIV (di, dj) (2.6)

where

• di, dj documents to compare

• q query

• λ trade-off variable between similarity and dissimilarity

• SIM similarity function

• DIV dissimilarity function

Algorithm 2 MaxSum Dispersion Algorithm

Input: Document set S, result set size k

Output: Re-ranked list R, |R| = k

1: R ← InitializeEmptyResultList()

2: for i ∈ {1, . . . , k/2} do

3: FIND (di, dj)=argmaxdi,dj ∈S (f (di, dj))

4: SET R ← R U {di, dj}

5: SET S ← S \ {di, dj}

6: end for

7: if k is odd then

8: select an arbitrary document di ∈ S

9: SET R ← R U {di}

10: end if

return R

SY is another implicit diversification algorithm, that aims to find near duplicate doc- uments in the collection set [18]. In a sorted result set, it computes document to

document relevancy scores and removes the ones that have a score more than thresh- old, as shown in Algorithm 3.

Algorithm 3 SY Algorithm

Input: Document set S, result set size k, threshold λ

Output: Re-ranked list R, |R| = k 1: R ← InitializeEmptyResultList() 2: i ← 0

3: while i < k and i < |S| do

4: j ← i + 1

5: while j < |S| do

6: if SIM (S[i], S[j]) > λ then

7: REMOVE S[j] from S

8: else

9: j ← j + 1

10: end if

11: end while

12: SET R ← R U S[i]

13: i ← i + 1

14: end while

return R

Unlike implicit algorithms, explicit search result diversification algorithms use an ex- ternal knowledge, like subtopics of existing queries to cover different aspects. Query aspects are used to re-rank the candidate result list. xQuAD is one promising explicit diversification algorithm in the literature [19]. It tries to maximize its objective func- tion, Equation 2.7, by diversifying the relevance scores between the original query and documents, using the relation between subtopics of the query and documents. The product term in the equation computes novelty values for each aspect. xQuAD algorithm is shown in Algorithm 4.

fxQuAD(di) = (1 − λ)P (di|q) + λ P (qi|q)P (di|qi) rr (1 − P (dj|qi))

(2.7)

qi 

dj ∈R 

where

• P (di|q) is relevance of di to query q

• P (qi|q) is likelihood of the aspect qi for query q

• P (di|qi) is relevance of di to the aspect qi

Algorithm 4 xQuAD Algorithm

Input: Document set S, result set size k, tradeoff λ

Output: Re-ranked list R, |R| = k 1: R ← InitializeEmptyResultList() 2: i ← 0

3: while i < k and i < |S| do

4: FIND d∗ ← argmaxd∈S fxQuAD(d)

5: S ← S − d∗

6: R ← R U d∗

7: i ← i + 1

8: end while

return R

In another study, Hong and Si [20] propose two alternative approaches for search result diversification techniques. First of them is diversifying the document rank- ing while selecting clusters. This approach basically applies various diversification algorithms to result set which gained by processing the query over centralized sam- ple index, CSI. This algorithm, named as Diversification approach based on sample Documents (DDiv).

In the same study, Diversification approach based on Source-level estimation (DivS) is the other proposed diversification technique. It treats clusters like a single big document, and computes their probability of relevance of query. Main advantage is that DivS also naturally supports wider range of resource selection algorithms. On the other hand, it requires an additional computation power to compare scores of query aspects for each clusters.

Hong and Si [20] conclude their work that both DDiv and DivS can outperform tra- ditional methods. They also showed that resource level result diversification over CSI results can outperform source level diversification techniques. In this thesis, we will adapt their work and apply DDiv algorithm to compare different applications of search result diversification.

2.3 MMRE: Diversified Query Expansion using Word Embeddings

In this thesis, we also adapted an approach from Bouchoucha et al. [3] to expand the query words. They use the Maximal Marginal Relevance (MMR) implicit diversifica- tion algorithm [21] for selecting expansion terms. Their algorithm, called as MMRE (MMR-based Expansion), shown in Algorithm 5, searches for best documents which maximize MMRE function in Equation 2.8.

fmmre(wi) = λP (wi|q) − (1 − λ)MAXw‘∈R (P (w‘|q)) (2.8)

where

• P (wi|q) is relevance of wi to query q

• R is already expanded words

Algorithm 5 MMRE Algorithm Input: Dictionary D, result set size k Output: Expanded words R, |R| = k 1: R ← InitializeEmptyResultList()

2: i ← 0

3: while i < k do

4: FIND w∗ ← argmaxw∈D fmmre(w)

5: D ← D − w∗

6: R ← R U w∗

7: i ← i + 1

8: end while

return R

During our query expansion experiments, we used word embedding technique [22], [23]. In word embedding, words are represented as a multi dimensional vectors, which helps the task of similarity computation among words [24]. In this model of representation of words, similar words have similar vectors. We used GloVe dictio- nary to represent word embeddings of words in query sets, as GloVe is shown to outperform earlier word representation models [25].

CHAPTER 3

ANALYZING CSI PERFORMANCE WITH QUERY EXPANSIONS

In this chapter, we will present different CSI creation techniques that we used during our experiments. First we will explain how we created the query-biased access count based CSI and pagerank based CSI. We used these CSIs to compare the impact on overall diversification performance by running over different query sets. In addition to the MMRE query expansion algorithm, we also applied a query expansion algorithm based on only word similarities that will be explained in this chapter.

3.1 Query Biased - Access Count Based CSI

As explained in Chapter 2, we adapted random based CSI in our work. However, this approach mainly depends on random function, and it always generates different document set. To create a more robust and more representative CSI, we applied a query-biased technique and generated an access count based CSI, we showed this technique in Algorithm 6.

For this technique, first of all we needed a query set to collect return set. We used AOL query set, which includes approximately 100.000 queries. For each query, we counted how many times each document appeared in top 10 of query results.

After we retrieved the query-biased access counts of every documents in the total collection, we created two CSIs. First as we show in Algorithm 6, each cluster is represented same by applying sample rate in cluster level. In addition to this, as shown in Figure 3.1, we also collected best %1 of the documents in that list. In this work, we used the second alternative. We removed spam documents from the CSI.

d87 d54 d2 d62 d22 d34 ... d643

Remove spam documents Descending order

d87 d54 d62 d22 ... d643 Add Top N documents

Figure 3.1: Query-biased Account count based CSI simulation.

Spam documents are retrieved from Cormack [26]’s studies which based on the same dataset.

Algorithm 6 Query-biased Access Count based CSI

Input: Query Set Q, Document Collection D, Number of Clusters K, Sampling Rate S

Output: Document Set DS

1: for Query q ∈ Q do

2: Resulti ← ProcessQuery(D, q)

3: end for

4: POPULARDOCS ← COUNT in Resultn

5: for k ∈ {1, . . . , K} do

6: SamplingCounti ← S x Numberof Documents ∈ Shardi

7: DS ← TOP SamplingCounti Documents ∈ Shardi

8: end for

return DS

3.2 Pagerank Based CSI

There is also another way to detect popular documents in the collection. We used pagerank information of documents in the collection. %1 of the all documents are collected from the collection according to their Pagerank scores. Highest scored doc- uments are selected first. We applied spam prunning as we mentioned in the case of the access-based CSI.

3.3 Query Expansion with Word Embeddings

As we mentioned in Chapter 2, we adapted MMRE algorithm and applied it using word embeddings to generate different query sets. In addition to this diversified query set, we also generated another set which based on similarities between query and dictionary words.

For each query, first we simplified them by taking the average of word vectors. So that, each query is represented as a single vector. Then for each word in the dictionary, we computed similarity scores using word embeddings. Each query is expanded by an extra 5-words which are closest to the query vector.

Here, together with CSI techniques, we wanted to investigate the effect of expanded query sets on overall diversification metrics. It’s important to compare them together since we processed these expanded query sets over CSI for selective search. We aimed to reach more precise clusters by processing the expansions over CSI, since expansions handle possible vocabulary gaps.

CHAPTER 4

SEARCH RESULT DIVERSIFICATION FOR SELECTIVE SEARCH

In this chapter, we will explain different diversification approaches that can be applied on different layers of selective search. In addition to traditional approach, we adapted resource level diversification technique from [20]. Moreover, we also applied in- cluster level diversification. In the first section, we will discuss selective search along with search result diversification. Then, we will introduce three different diversifica- tion approaches that reveal the effectiveness of diversification at different stages of selective search.

4.1 Search Result Diversification for Selective Search

Traditionally, textual search systems divides collections into subsets, and the query is redirected through these collections. Even though, there is a gain in parallel query processing for this approach, it is still an exhaustive method. The query is processed for all documents. On the other hand, this computational cost can be redacted by only searching relevant shards among all. This approach, namely selective search, as introduced by Kulkarni and Callan [1], partitions collections into different clusters. Each cluster is represented by sampled documents in the centralized sampled index (CSI). Query is managed by broker. Broker processes the query over CSI, and selects best clusters. In Figure 2.1 it is shown after query is processed over CSI, only selected clusters will be chosen as a target.

Search result diversification’s primary aim is always to serve most optimum results with respect to diversity and relevance metrics. To achieve this, for query result lists, recent diversification algorithms mainly re-rank them. Many prior work determine

Resource Selection

CSI

result, DDiv

N/A

Merged result, BDiv

InCluster

Broker Node

Layers

Figure 4.1: Search result diversification at different stages and layers.

the best document using relevance and diversity estimation altogether in the linear equation [17]. Existing search result diversification algorithms are splitted into two main categories: implicit and explicit. In this work, we applied both of them.

It is possible to apply these search result diversification algorithms in different query result set. Naini et al. [27] studied the performance of diversification for exhaustive search by applying diversification at broker and nodes. For selective search; as we show in Figure 4.1; there exist different candidate sets for diversification. Broker produces result set by processing the query over CSI; moreover it also merges the result sets from chosen clusters. So that, at broker layer, CSI result diversification and merged result diversification are possible.

Furthermore, queries are processed inside the clusters, therefore these results can also be diversified before sending them back to broker. It is shown in Figure 4.1 as InCluster.

4.2 Diversification at Broker Node

First, we applied straightforward approach for search result diversification for selec- tive search. This approach diversifies the final result set at broker node, just after collecting the results from selected clusters.

As shown in Figure 4.2, query is first processed over CSI. Resource selection algo-

Figure 4.2: Diversification of merged results at Broker node.

rithm chooses best clusters using this result set. Then broker forwards the query to the selected clusters. Each cluster runs the query over their document collection, then returns their result set back to the broker. Broker merges all these results from dif- ferent clusters. Then, diversification is applied over this merged result set at broker node.

4.3 DDiv: Diversification Based on CSI

As we discussed in Chapter 2, Hong and Si [20] proposed a different diversification approach for selective search. In their work, they showed DDiv can outperform tradi- tional methods and their other proposed method, DivS as well. Therefore, we adapted DDiv which applies diversification before the relevant clusters are selected.

As shown in Figure 4.3, query is processed over CSI. Then, diversification is applied directly over this result set. Clusters will be chosen using this diverse set by resource selection algorithms. This approach diversifies in resource level to find diverse clus- ters and subsequently, final rankings as well.

Query

Broker Run query CSI

CSI results diversiﬁcation Resource Selection

Select Blue, Yellow

Figure 4.3: DDiv: Diversification based on CSI results.

4.4 Diversification inside the Selected Clusters

We applied another result level diversification approach in our experiments. Unlike the first straightforward approach, this method diversifies the result sets inside each clusters.

As shown in Figure 4.4, after query is forwarded to the relevant clusters by broker, each cluster diversifies its own result set for this query. Then, they return diversified set to the broker. This method can explore more distant documents, since each of clusters are represented with their diversified set at broker node.

Figure 4.4: Search result diversification inside clusters.

CHAPTER 5

EXPERIMENTAL SETUP

In this chapter, we will explain our all experimental setup. First, we will mention about data set we used. Then respectively, clustering technique, different CSIs, query expansion, query processing, resource selection and diversification approaches will be explained. Finally, in the last section we will describe evaluation metrics.

5.1 Data Set

During the experiments, we used Clueweb B as a document collection1. There are total of 50,220,538 documents exist in this collection. Total number of terms for this collection is 163,629,158.

We used spam list, generated by [26] to prune spam documents in our experiments. We also adapted pagerank scores of Clueweb B documents that is available online2.

We used Trec Web Track query sets for query processing [28, 29, 30, 31]. Trec query set 2009, 2010, 2011, 2012 are combined for experiments. They are also available online3. These 4 sets have total of 198 queries. For our explicit diversification exper- iments, we also used query aspects of this query sets.

Moreover, we also used AOL query set to create a query biased CSI [32]. AOL query set includes 100000 queries. We used a different query set, otherwise CSI could be biased over same query set. This makes evaluation difficult, especially for baseline comparisons, because query processing over that CSI would already return

1 https://lemurproject.org/clueweb09.php

2 https://lemurproject.org/clueweb09/pageRank.php

3 https://trec.nist.gov/data/webmain.html

best popular documents.

5.2 Document Allocation

To simulate complete selective search framework, we created topic based clusters from Clueweb B document collection. We used K-means algorithm as we discussed in Chapter 2. We used a subset of documents from the collection to apply K-means. These documents are selected randomly. Moreover, each cluster centroid is also ini- tialized randomly from this sample set. Then, rest of the sample set is also distributed to the clusters. Here, we used Kullback-Liebler divergence as described by Kulkarni and Callan [5] while computing the similarities between documents and centroid of clusters. Then K-means applied multiple times.

In his work, Hafızog˘lu [33] proved that under selective search environment, best pre- cision and diversity scores are achieved by splitting this dataset into 100 clusters. We used very same clusters with his work.

As we mentioned in Chapter 3, we applied spam prunning all three indexes.

5.3 Centralized Sample Indexes

After documents are allocated according to their topics, we created 3 different cen- tralized sample indexes. First CSI is created by random sampling. Each cluster is represented in CSI by 1% sampling rate. As a result of this sampling, CSI includes 502,200 documents.

In addition to random CSI, we also created a CSI which is created using AOL query set. This set is processed over all documents in the collection and according to their number of existence in the top 10, best documents are selected from each cluster. Similarly, we applied same strategy to create Pagerank CSI. However, this time we used pagerank information of Clueweb B dataset.

For search result diversification method comparisons, we used random based CSI. At the end of the Chapter 6, we also showed some additional comparisons between

access and random based CSIs.

5.4 Query Expansions using Word Embeddings

As we explained in Chapter 3, we used Global Vectors for Word Representations, GloVe4 on our query expansion experiments. We used 6B tokens, that includes 400,000 words in the dictionary. In this representation, each entity is represented by 100 dimensional vectors.

We created 2 new query sets that derived from Trec query set. For both, we used cosine similarity to compute relevancy between words and the query average. For di- versified expansion set, we diversified the most relevant 50 words to the query average vector. For MMRE function, we set lambda as 0.5.

5.5 Query Processing

As a query processing algorithm, Best Matching 25 is used [34]. Constants k1 and b is set to 1.2 and 0.5, respectively. We cleaned stop words from texts during query processing.

5.6 Resource Selection

We used Redde as a resource selection algorithm, since prior research favors it [20]. Hafızog˘lu [33] made a cluster coverage analyze in same dataset, which shows that top %10 of these clusters have the %99 of relevant documents. That’s why we also picked the best %10 of clusters. Each cluster returns their best 20 results, that’s why our evaluations are mostly based on top 20, 10 and 5 results. We applied Redde algorithm for top 200 documents of the result set.

4 https://nlp.stanford.edu/projects/glove/

5.7 Diversification

We applied both implicit and explicit search result diversification algorithms in this study. As an implicit algorithm, SY is implemented. xQuAD algorithm is also imple- mented, which is a very successful example of explicit algorithms. These algorithms are used as explained in Chapter 2. For xQuAD, BM25 query processing scores are normalized by sum of scores in result set. For both algorithms, we used λ values: 0.25, 0.5 and 0.75. Except the adapted DDiv method, we always diversified top 100 documents of the result set. For DDiv method, since diversification is applied directly over CSI, we diversified top 200 of CSI result set.

5.8 Evaluation Techniques

We used α-nDCG as a diversity performance metric for all experiments [35]. Using this technique, we showed diversity scores for top 5, 10 and 20 results. We used nDeval5 as a tool from Trec Web Track archives. nDeval calculates diversity metrics for a given query results, including α-nDCG.

In addition to diversity metrics, we also showed precision values for top 5, 10 and 20 results. These scores are computed using a tool named, trec_eval, which is also part of Trec Web Track. For both evaluation metrics, Trec provides a ground truths to compare.

5 http://www-personal.umich.edu/ kevynct/trec-web-2014/

CHAPTER 6

EXPERIMENTAL RESULTS

In this chapter, we will present our experimental results. First, we will present results for different CSIs together with query expansions. Next, we will compare different search result diversification methods.

6.1 CSI Results

We created random, access and pagerank based CSIs. We compared them under selective search environment as described in Chapter 2. In tables, mmre represents the diversified query expansions. Other expanded set we described in Chapter 3, is named as exp.

According to their effectiveness values in Table 6.1, access based beats other CSI alternatives.

We also analyzed CSI performances with query expansions. In Tables 6.2 and 6.3, it is shown that access based CSI can not gain from query expansions. On the other hand random and pagerank CSIs can improve overall diversity metrics for selective

Table 6.1: Effectiveness Results for CSI types.

P@5 P@10 P@20 a-nDCG@5 a-nDCG@10 a-nDCG@20

access 0.3381 0.3244 0.2865 0.2811 0.3162 0.3489

random 0.3289 0.3162 0.2825 0.2758 0.3129 0.3438

pagerank 0.3188 0.3025 0.2617 0.2639 0.3013 0.3295

Table 6.2: Diversity effectiveness results for CSI types with query expansions.

csi query expansion a-nDCG@5 a-nDCG@10 a-nDCG@20

- 0.2811 0.3162 0.3489

access mmre 0.2791 0.3132 0.3439

exp

0.2796 0.3133 0.3404

- 0.2639 0.3013 0.3295

pagerank mmre 0.2742 0.3117 0.3392

exp 0.2782 0.3121 0.3388

- 0.2758 0.3129 0.3438

random mmre 0.2816 0.3171 0.3481

exp 0.2819 0.314571 0.3440

Table 6.3: Precision effectiveness results for CSI types with query expansions.

csi query expansion P@5 P@10 P@20

- 0.3381 0.3244 0.2865

access mmre 0.3239 0.3112 0.2764

exp

0.3289 0.3076 0.2685

- 0.3188 0.3025 0.2617

pagerank mmre 0.3218 0.3091 0.2665

exp 0.3259 0.3025 0.2579

- 0.3289 0.3162 0.2825

random mmre 0.331 0.3173 0.2812

exp 0.3401 0.3193 0.2789

search.

In conclusion, we found that access based CSI can outperform other CSI techniques both in diversity and precision metrics. However, we also realized that it is possi- ble to improve the overall diversity quality by expanding query words using word embeddings. Random based CSI with mmre query expansions outperform all other alternatives at top 10 and top 5 results and it is barely beaten by access CSI at top 20 results. Therefore, random based CSI is a better choice in case of query expansions will run over CSI.

Table 6.4: Diversity effectiveness results for Implicit Diversification.

lambda a-nDCG@5 a-nDCG@10 a-nDCG@20

0.25 0.1899 0.1892 0.1924

BDiv 0.5 0.2338 0.2525 0.2685

0.75

0.2665 0.3048 0.3286

0.25 0.2787 0.3145 0.3453

DDiv 0.5 0.2801 0.3126 0.3452

0.75 0.2754 0.3097 0.3427

0.25 0.2576 0.27463 0.2808

InCluster 0.5 0.2626 0.2908 0.3106

0.75 0.2705 0.3107 0.3397

6.2 Implicit Diversification Results

We compared methods described in Chapter 4 in the implicit setup. We used these abbrebiations to identify methods in the tables: BDiv for Diversification at Broker Node; DDiv for Hong and Si [20]’s Diversification approach based on sample Docu- ments; InCluster for Diversification inside the Selected Clusters.

In the Tables 6.4 and 6.5, it is shown that DDiv method beats their compatitors as Hong and Si [20] claimed in their work. After we verified this, we compared mmre method with them. This showed us that adapted method mmre can outperform best scores of all these methods, as shown in Table 6.6.

Since mmre beat other methods, we combined this resource level diversification method with BDiv method. Here, we aimed to outperform mmre performance by applying a diversification to the final result set at broker node. This method represented as BDiv+mmre in the tables. Also, other resource level diversification method, DDiv could also improve its result set in same way. Therefore, we applied same strategy this method as well. In the tables, its abbreviation is DDiv+BDiv. None of these additional methods could beat mmre approach in implicit setup.

Table 6.5: Precision effectiveness results for Implicit Diversification.

lambda P@5 P@10 P@20

0.25 0.1279 0.0741 0.0411

BDiv 0.5 0.2081 0.1645 0.1112

0.75

0.2792 0.2401 0.1812

0.25 0.3279 0.3122 0.2782

DDiv 0.5 0.3289 0.3137 0.2812

0.75 0.3259 0.3132 0.2815

0.25 0.2426 0.1665 0.0959

InCluster 0.5 0.2538 0.2091 0.1475

0.75 0.2904 0.2553 0.2003

Table 6.6: Effectiveness results for Implicit Diversification with all methods’ best results.

P@5 P@10 P@20 a-nDCG@5 a-nDCG@10 a-nDCG@20

BDiv 0.2792 0.2401 0.1812 0.2665 0.3048 0.3286

DDiv 0.3279 0.3122 0.2782 0.2787 0.3145 0.3453

InCluster 0.2904 0.2553 0.2003 0.2705 0.3107 0.3397

mmre 0.331 0.3173 0.2812 0.2816 0.3171 0.3481

BDiv+mmre 0.2822 0.2371 0.1799 0.2745 0.3078 0.3296

DDiv+BDiv 0.2731 0.2365 0.1802 0.2653 0.3029 0.3283

Table 6.7: Diversity effectiveness results for Explicit Diversification.

lambda a-nDCG@5 a-nDCG@10 a-nDCG@20

0.25 0.3001 0.3368 0.3738

BDiv 0.5 0.3163 0.3514 0.3826

0.75

0.3215 0.3522 0.3824

0.25 0.2655 0.2998 0.3320

DDiv 0.5 0.2695 0.3047 0.3354

0.75 0.2729 0.3071 0.3392

0.25 0.2842 0.3197 0.3552

InCluster 0.5 0.2918 0.3268 0.3613

0.75 0.2945 0.3308 0.3647

6.3 Explicit Diversification Results

In the explicit setup, we found that BDiv outperforms other methods. In Tables 6.7 and 6.8, it is shown that DDiv method is actually behind the others with respect to precision and diversity.

As we have shown in previous section, we merged some methods together to improve their performances. In explicit setup, BDiv method clearly beats others, therefore we used mmre method together with BDiv to improve cluster selection. BDiv results could also improve by combining it with DDiv. Since it can also improve cluster selections. That’s why, we investigated their results, too. In the Table 6.9, it is shown that none of the methods could actually beat BDiv method.

As an additional experiment, we employed access based CSI for the best methods found above. In Table 6.10, we showed their results in implicit setup. Best scored method, DDiv with random based CSI still outperforms access based CSI. For explicit setup, similarly, BDiv with random based CSI beats its access based alternative as shown in Table 6.11.

Table 6.8: Precision effectiveness results for Explicit Diversification.

lambda P@5 P@10 P@20

0.25 0.3675 0.3523 0.3117

BDiv 0.5 0.3766 0.3533 0.3033

0.75

0.3838 0.3533 0.2957

0.25 0.3157 0.3076 0.2734

DDiv 0.5 0.3198 0.3102 0.2744

0.75 0.3279 0.3112 0.2749

0.25 0.3371 0.3223 0.2845

InCluster 0.5 0.3431 0.3218 0.2853

0.75 0.3472 0.3269 0.2873

Table 6.9: Effectiveness results for Explicit Diversification with all methods’ best results.

P@5 P@10 P@20 a-nDCG@5 a-nDCG@10 a-nDCG@20

BDiv 0.3766 0.3533 0.3033 0.3163 0.3514 0.3826

DDiv 0.3279 0.3112 0.2749 0.2729 0.3071 0.3392

InCluster 0.3472 0.3269 0.2873 0.2945 0.3308 0.3647

mmre 0.331 0.3173 0.2812 0.2816 0.3171 0.3481

BDiv+mmre 0.3665 0.3472 0.2967 0.3154 0.3475 0.3789

DDiv+BDiv 0.3706 0.3447 0.2954 0.3088 0.3463 0.3786

Table 6.10: Effectiveness results for Access and Random based CSIs for Implicit Diversification.

csi P@5 P@10 P@20 a-nDCG@5 a-nDCG@10 a-nDCG@20

BDiv access 0.2832 0.233 0.1805 0.2705 0.3036 0.3287

random 0.2792 0.2401 0.1812 0.2665 0.3048 0.3286

DDiv access 0.331 0.3112 0.281 0.2773 0.3130 0.3447

random 0.3279 0.3122 0.2782 0.2787 0.3145 0.3453

Table 6.11: Effectiveness results for Access and Random based CSIs for Explicit Diversification.

csi P@5 P@10 P@20 a-nDCG@5 a-nDCG@10 a-nDCG@20

BDiv access 0.3645 0.3503 0.297 0.3166 0.3525 0.3799

random 0.3766 0.3533 0.3033 0.3163 0.3514 0.3826

DDiv access 0.335 0.3137 0.2789 0.2769 0.3104 0.3422

random 0.3279 0.3112 0.2749 0.2729 0.3071 0.3392

CHAPTER 7

CONCLUSION AND FUTURE WORK

In this thesis, we provided an in-depth analysis of search result diversification in the context of selective search, and proposed extensions to improve diversification effectiveness. First, we showed that creating a CSI based on the past document access statistics yields both more relevant and diverse results than a CSI based on randomly sampled documents. However, by processing queries expanded with diverse terms during resource selection, it is also possible to obtain diversification performance that is comparable to the latter. This finding is also important to show that even when such past statistics are not available, a random CSI together with diversified query expansion performs reasonably well.

Second, we investigated the diversification performance at different layers (i.e., at the broker vs. in the clusters) and at different stages, namely, before resource selection and before/after result merging. We found that when a representative implicit diver- sification method, namely, SY, is employed; the best diversification performance is obtained by selecting diverse resources as suggested by Hong and Si [20]. While doing so, simply processing expanded queries with diverse terms over the CSI, as we proposed in this thesis, outperform the previous approach in Hong and Si [20]. Inter- estingly, the findings vary for the explicit diversification. By employing a represen- tative explicit approach, xQuAD, we demonstrated that diversifying merged results at the broker is superior to diversifying partial results at the clusters or diversifica- tion during the resource selection. This is again a new and contradicting finding with respect to Hong and Si [20], and implies that when there are more clues for diversi- fication, it is better to conduct it at a more fine-grain level, i.e., over the result list, rather than attempting to diversify resources as a whole.

There are various future research directions for our work. In particular, in our addi- tional experiments we observed that by conducting a selective expansion of queries, i.e., expanding only a subset of them and/or setting different thresholds for expansion terms on a per query basis, it is possible to further improve the diversification effec- tiveness. We plan to explore such selective expansion approaches as our future work. As a second research direction, we will investigate the diversification efficiency for selective search. For instance, it is possible to create summary vectors for the doc- uments to conduct the diversification methods at the broker more efficiently. Such optimizations are also left for our future work.

REFERENCES

[1] Anagha Kulkarni and Jamie Callan. Selective search: Efficient and effective search of large textual collections. ACM Transactions on Information Systems (TOIS), 33(4):17, 2015.

[2] Rodrygo LT Santos, Craig Macdonald, Iadh Ounis, et al. Search result diversi-

fication. Foundations and TrendsQR in Information Retrieval, 9(1):1–90, 2015.

[3] Arbi Bouchoucha, Jing He, and Jian-Yun Nie. Diversified query expansion us- ing conceptnet. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 1861–1864. ACM, 2013.

[4] Xiaoyong Liu and W Bruce Croft. Cluster-based retrieval using language mod- els. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 186–193. ACM, 2004.

[5] Anagha Kulkarni and Jamie Callan. Document allocation policies for selective searching of distributed indexes. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 449–458. ACM, 2010.

[6] Jinxi Xu and W Bruce Croft. Cluster-based language models for distributed re- trieval. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 254–261. ACM, 1999.

[7] Robin Aly, Djoerd Hiemstra, and Thomas Demeester. Taily: shard selection using the tail of score distributions. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 673–682. ACM, 2013.

[8] Milad Shokouhi, Luo Si, et al. Federated search. Foundations and TrendsQR in Information Retrieval, 5(1):1–102, 2011.

[9] James P Callan, Zhihong Lu, and W Bruce Croft. Searching distributed col- lections with inference networks. In ACM SIGIR Forum, volume 51, pages 160–167. ACM, 2017.

[10] Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 298–305. ACM, 2003.

[11] Jaime Arguello, Jamie Callan, and Fernando Diaz. Classification-based re- source selection. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 1277–1286. ACM, 2009.

[12] Jaime Arguello, Fernando Diaz, Jamie Callan, and Jean-Francois Crespo. Sources of evidence for vertical selection. In Proceedings of the 32nd inter- national ACM SIGIR conference on Research and development in information retrieval, pages 315–322. ACM, 2009.

[13] Milad Shokouhi. Central-rank-based collection selection in uncooperative dis- tributed information retrieval. In European Conference on Information Re- trieval, pages 160–172. Springer, 2007.

[14] Jangwon Seo and W Bruce Croft. Blog site search using resource selection. In Proceedings of the 17th ACM conference on Information and knowledge man- agement, pages 1053–1062. ACM, 2008.

[15] Dorit S Hochbaum. Approximation algorithms for NP-hard problems. PWS Publishing Co., 1996.

[16] Rodrygo LT Santos, Jie Peng, Craig Macdonald, and Iadh Ounis. Explicit search result diversification through sub-queries. In European conference on informa- tion retrieval, pages 87–99. Springer, 2010.

[17] Sreenivas Gollapudi and Aneesh Sharma. An axiomatic approach for result diversification. In Proceedings of the 18th international conference on World wide web, pages 381–390. ACM, 2009.

[18] Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben, and Ujwal Gadiraju. Groundhog day: near-duplicate detection on twitter. In Proceedings of the 22nd international conference on World Wide Web, pages 1273–1284. ACM, 2013.

[19] Rodrygo LT Santos, Craig Macdonald, and Iadh Ounis. Exploiting query re- formulations for web search result diversification. In Proceedings of the 19th international conference on World wide web, pages 881–890. ACM, 2010.

[20] Dzung Hong and Luo Si. Search result diversification in resource selection for federated search. In Proceedings of the 36th international ACM SIGIR con- ference on Research and development in information retrieval, pages 613–622. ACM, 2013.

[21] Jaime G Carbonell and Jade Goldstein. The use of mmr, diversity-based rerank- ing for reordering documents and producing summaries. In SIGIR, volume 98, pages 335–336, 1998.

[22] Saar Kuzi, Anna Shtok, and Oren Kurland. Query expansion using word em- beddings. In Proceedings of the 25th ACM international on conference on in- formation and knowledge management, pages 1929–1932. ACM, 2016.

[23] Kezban Dilek Onal, Ismail Sengor Altingovde, and Pinar Karagoz. Utilizing word embeddings for result diversification in tweet search. In AIRS, pages 366–

378. Springer, 2015.

[24] Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In Pro- ceedings of the 52nd Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), pages 302–308, 2014.

[25] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://www.aclweb.org/anthology/ D14-1162.

[26] GV Cormack. Waterloo spam rankings for the clueweb09 dataset@ online, 2009.

[27] Kaweh Djafari Naini, Ismail Sengor Altingovde, and Wolf Siberski. Scalable and efficient web search result diversification. ACM Transactions on the Web (TWEB), 10(3):15, 2016.

[28] Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. Overview of the TREC 2009 web track. In Proceedings of The Eighteenth Text REtrieval Conference, TREC 2009, Gaithersburg, Maryland, USA, 2009.

[29] Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Gordon V. Cormack. Overview of the TREC 2010 web track. In Proceedings of The Nineteenth Text REtrieval Conference, TREC 2010, Gaithersburg, Maryland, USA, 2010.

[30] Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. Overview of the TREC 2011 web track. In Proceedings of The Twentieth Text REtrieval Conference, TREC 2011, Gaithersburg, Maryland, USA, 2011.

[31] Charles L. A. Clarke, Nick Craswell, and Ellen M. Voorhees. Overview of the TREC 2012 web track. In Proceedings of The Twenty-First Text REtrieval Conference, TREC 2012, Gaithersburg, Maryland, USA, 2012.

[32] Greg Pass, Abdur Chowdhury, and Cayley Torgeson. A picture of search. In

InfoScale, volume 152, page 1, 2006.

[33] Fatih Hafızog˘lu. Improving the efficiency of distributed information retrieval using hybrid index partitioning. Master’s thesis, 2018.

[34] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance frame-

work: Bm25 and beyond. Foundations and TrendsQR

3(4):333–389, 2009.

in Information Retrieval,

[35] Charles LA Clarke, Maheedhar Kolla, Gordon V Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 659–666. ACM, 2008.

Sayfalar

30 Haziran 2024 Pazar

25

Hiç yorum yok:

Yorum Gönder