30 Haziran 2024 Pazar

26

 Chapter 1


INTRODUCTION


Video has become an influential tool for capturing and delivering information in personal, professional and educational applications. This trend has raised the importance of efficient and effective video retrieval systems on large video collections. Most of the video retrieval systems rely only on hand-crafted features and manual annotations. Motion of the individual objects, the acutest information carried on by videos, is usually ignored in video retrieval. In the human-computer interaction front, sketching and speech are the two modalities that complement each other to describe motion in a video. Sketching indicates placement of objects together with their motion path. Through speech, users can indicate events and their relationships. For instance, in a soccer match, one can mark coordinates of a player by drawing a circle, and specify a goal scored by the player by drawing an arrow heading from the player to the goalpost. However, drawing an arrow would not itself suffice to describe a goal since the arrow can also indicate a forward pass, or a free kick. Users can explicitly indicate that the motion is a goal shot through speech.

However, we do not have adequate information about how speech and sketching can be used for video retrieval since there are no existing video retrieval systems based on these modalities. To gather information regarding the joint use of these modalities, a protocol must be designed to capture participants to talk and draw simultaneously. Also needed are some applications to collect speech and sketching data and to process these data for later use. Moreover, no adequate research has been conducted on the co-interpretation of these modalities and thus generating a plain-text search query.

This paper presents a protocol together with a set of software tools to stimulate

 

2 Chapter 1: Introduction



participants to draw and talk in a video retrieval task. We tested the protocol and the software tools on the soccer match retrieval use case scenario. We ensured ecologically validity by involving participants with relevant expertise in this area. These software tools are also customizable and the protocol is easy to follow. Next, we developed a model which is capable of interpreting simultaneously given sketching and speech inputs, and understanding the motion sequence described by a user. The model has a remarkable performance on the data we collected. Finally, a video retrieval system was built by integrating the model into a database back-end that is designed for big multimedia collections. Several user evaluation studies were conducted on the retrieval system. The results of the studies imply that the data collection software, the data collection protocol and the multimodal interpretation model serve as a powerful trio for building multimodal motion-based video retrieval systems on big multimedia collections.

The rest of this paper is structured as follows. Section 2 presents the related work on developing sketch and speech-based multi-modal systems. Section 3 briefly gives information on selecting a relevant video search use case. Section 4 describes the software tools and the protocol used in the data collection studies with some insights on the data collected. Section 5 defines the multimodal interpretation model and presents the performance results of the model. In Section 6, we illustrate the design of the database system used in the video retrieval system. Section 7 presents the user interface design of the video retrieval system, illustrates the protocol followed in the user evaluation studies, and reports the results of these studies. Section 8 concludes the article.

 






Chapter 2


RELATED WORK


A bulk of work has been done on the multimodal representation of motion in the human computer interaction literature. In the light of all the works, our research was planned to boost the natural use of sketching and speech on motion scenarios in video clips. In this context, sketch interpretation, natural language understanding and video retrieval systems are within the scope of our research.

Sketching is the first modality that has drawn human computer interaction re- searcher’s attention to developing motion-based video retrieval systems. The spatial information conveyed in sketching has been treated as the primary information to retrieve in video clips. A motion-based video retrieval system on soccer videos uses sketching for describing the motion of the ball and players [Al Kabary and Schuldt, 2013]. Users can also limit the area by drawing a boundary shape. To specify the events in addition to the motions drawn, a user-interface with selection buttons is provided. In the system designed by [Collomosse et al., 2009], users are allowed to illustrate much more complex scenarios without limiting the area. The system classifies all the ar- rows as motion cues and rest of the symbols as objects. Each motion cue is linked to the object that is closest within a certain distance. Both systems use only sketching

as the input method, limiting the users’ ability to articulate spatial information of objects and their motions. Speech, which enables specifying events and relationships of motions comfortably, is overlooked in these systems.

The works mentioned above imply that the symbols used to describe motions are different than others. In this context, usage patterns of depicting motion have become a hot topic in the human-computer interaction research. The user interface developed by [Thorne et al., 2004] defines a set of sketched symbols for basic body

 



gestures. For instance, one needs to draw a bowl-like shape to describe jumping, whilst a small semi-ellipse must be drawn for walking. The motion to animate is composed of such basic motion drawings. Primitive motion symbols are recognized by expressing these symbols in terms of arcs and lines [Hammond and Davis, 2006a]. Another interface for controlling robots leverages sketching to articulate the path of a robot [Skubic et al., 2007]. Moreover, Kitty, which was developed for cartoonists, also uses free-form sketching to construct paths for objects [Kazi et al., 2014].

In addition to these works, a research has been done on expressing motion of articulated objects in video clips. In the work by [Mauceri et al., 2015], 3 different query modes have been proposed for describing motion of objects: appending arrows to sub-parts of a formerly drawn object, appending arrows to sub-parts of an artic- ulated object template and sorting out key-frames of the motion by moving joints of an articulated object template. Through the user evaluation studies, it has been concluded that the query modes using articulated object templates are more accurate than drawing an object and appending arrows to sub-parts. However, it must be noted that using object templates reduces naturalness of a given motion query.

It is clear that most of the works presented above take the advantage of free-form strokes and arrows to illustrate motion. Comparing these two query modes, arrows have more expressive power than free-form strokes because of their arrow head which reveal direction. This has motivated us to use arrows rather than free-form strokes for depicting motion in the development of our video retrieval system.

There are some multimodal systems that support the joint use of speech and pen without performing video retrieval. For instance, QuickSet [Cohen et al., 1997] en- ables users to locate objects on a map and define movements of them by using pen gestures and speech. The system employs hidden Markov models and neural networks to recognize pen gestures. To interpret pen gestures and speech together, it extracts features of each modality, and tries to find the best match between them through statistical methods on the features. Once the joint interpretation with the highest probability was found, the objects are manipulated through the instructions inferred

 



from multimodal interpretation. Additionally, IBM has developed a text editor that uses speech and pen gestures together for manipulating texts [Vergo, 1998]. Users can type a text through speech and do some pen gestures supported by some utter- ance to manipulate a particular text. The system utilizes feature-based statistical machine translation methods [Papineni et al., 1997]. These methods append the fea- tures extracted from pen gestures into the natural language features obtained from speech, and applies a statistical machine translation on the features in order to infer the instructions given by a user.

Empirical studies on the cooperative use of speech and sketching have been another direction in human-computer interaction research. There are a few studies that have investigated the use of speech and sketching. [Adler and Davis, 2007a] have designed a study to observe how people use speech and sketching simultaneously and to finally implement a multi-modal white-board application based on these observations. They asked students from circuit design class to describe and draw a floor plan for a room, design an AC/DC transformer, design a full adder, and describe the final project they had developed for their circuit design class. In each session, experimenter and participant sit across a table. On either sides of tables, there are tablets each running an application that replicates whatever gets drawn in one tablet in another. The participant describes each of the 4 scenes above via speech and sketching. Throughout the sessions, the experimenter also adds to the participant’s drawing and directs questions, leading to a bi-directional conversation. Later, [Adler and Davis, 2007b] designed another study for collecting speech and sketched symbols to develop a multi- modal mechanical system design interface. They asked six participants to talk about and draw a pendulum system on a white board. Although these studies investigate the joint use of speech and sketching, they did not have multimedia retrieval in focus. Our protocol and a set of tools are focusing on the investigation of collaboratively using speech and sketching for multimedia retrieval.

Moreover, [Oviatt et al., 2000] have drawn attention to the unavailability of multi- modal corpora for developing new methods for multi-modal language processing. They

 



have underlined the need for methods and tools for obtaining multi-modal inputs. Our protocol and software tools also help human-computer interaction researchers to solve this issue by enabling collection of speech and sketching simultaneously.

The long progression in natural language understanding has also motivated us to focus on multimodal interpretation for motion-based video retrieval. [Yu et al., 2013] developed LibShortText, which is a library leveraging support vector machines and logistic regression on bag-of-words features. This library is useful when the dictio- nary of a natural language dataset is limited. Later, neural networks became popular as the natural language datasets grow in size. The advantage of these networks is that they are also capable of learning how to extract features from a short text seg- ment at the character or the word level (or both levels). The model designed by [dos Santos and Gatti, 2014] constructs features of a short text (namely embeddings) through the long short term memory (LSTM) structures. Another work done by [Joulin et al., 2016] is a mixture of old-style feature extractors and feed-forward neu- ral networks. The advantage of that architecture is its runtime, that is, the text classification model is constructed quicker than the other two. We designed our mul- timodal interpretation framework by leveraging these state-of-the-art systems, and the classification performance results suggest that the given data-collection protocol and the multimodal interpretation framework can be used end-to-end for building sketch-and-speech based video retrieval systems.

Furthermore, there have been several advancements in sketch recognition frame- works, which was another motivation for us to focus on much more complex user interfaces supporting natural interaction. The first milestone in the recognition of sketched symbols without definite rules was the IDM feature extraction method [Ouyang and Davis, 2009]. This method opened the way to more accurate sketched symbol classifiers. Next, [Tumen et al., 2010] investigated the use of this feature extraction method with different machine learning heuristics, obtaining far more ac- curate classifiers. Moreover, [Yesilbek et al., 2015] conducted several experiments to find the intersection of the hyper-parameter ranges yielding the best classification

 



accuracies on different sketched symbol datasets when using the SVM classifier with different feature extraction methods.

 





Chapter 3


DATA COLLECTION STUDY


We designed a protocol and implemented software tools to collect speech and sketching simultaneously from people having different levels of expertise in soccer. By having people with ranging expertise in soccer, we established an ecologically validity between the tools and the protocol. The tools and the protocol are also highly customizable, meaning that they can be modified for use in different motion- based video retrieval use cases. Next, we constructed a visual vocabulary to ease sketching, and prepared some video clips to present during data collection sessions. Finally, the data collected from the participants was annotated and prepared for use in several machine learning models for multimodal interpretation.


3.1 Retrieval Use Case


We aimed to design a multimodal interpretation framework on simultaneously given sketching and speech, and thus to build a system that performs motion-based video retrieval with the given input. For this reason, the use case was initially planned to frequently involve motion scenarios regardless of the context. However, the open- ended nature of sketching and speech makes it difficult to handle different contexts at the same video retrieval framework. For example, sketched symbols and words uttered for describing a soccer drill differ largely from the ones used for chess matches. To build a retrieval system operating on both chess and soccer matches, we must put in twice as much effort as we would do for either of them. Within this context, we decided to narrow down on the use case and choose a single case involving motion scenarios. Considering its popularity, we finally selected retrieval of soccer match videos as the use case of our video retrieval system.

 



3.2 Software Tools


We developed 3 applications, one to collect speech and sketching simultaneously from the participants, another to provide a video search front-end for the participant, and a third to annotate the symbols drawn during a session.


3.2.1 Data Collection


We developed two applications running on Android tablets, one for the experimenter [Altıok, 2017b] and another for the participant [Altıok, 2017c]. As seen in Figure 3.1, both applications have a canvas with a background image that can be customized based on the use case scenario. These applications use a shared canvas, that is, whatever gets drawn in one tablet is replicated in another tablet and is also written to a text file on both sides for later use. To replicate the drawings, these applications communicate with each other over a Wi-Fi network. If a user wants to highlight a part of her drawing, then she can get her stylus pen closer to the screen. The wizard (experimenter) sitting across the opposite side can see where the user is highlighting. Moreover, these applications also record their users’ videos through built-in tablet cameras while the applications are in use. Note that there are 3 buttons on the bottom-left corner. The leftmost button clears the entire canvas. The rightmost one activates the eraser mode. Users can erase any part of her drawing through eraser. By clicking the button in the middle, she can activate the drawing mode.

There are two outputs of each software:


A text file (alternatively called a sketch stream file) including a list of the fol- lowing actions recorded with a time stamp:


Starting a stroke

Ending a stroke

Adding a point to sketch

Hovering over a point

 




 


Figure 3.1: A screen-shot from data collection application, with soccer field back- ground image chosen for soccer match retrieval (with the explanations of UI compo- nents)




Clearing entire canvas


A video containing the entire session


For the transcription of each session video collected on the participant’s side, we marked start and end points between which participants talked about a video clip to search. Afterwards, we connected speaker output to a virtual microphone through a virtual hardware driver [Muzychenko, 2015] to redirect the participants’ voice to speech recognizer like a real-time utterance. Next, we transcribed the session video of the participant using IrisTK [Skantze and Al Moubayed, 2012] test recognizer application with Google Cloud Speech Recognition API support [Google, 2017]. Once we obtained the transcripts, we corrected mistranscribed words by hand.


3.2.2 Wizard Application for Video Search


Another application serves as the video search front-end for the participant [Altıok, 2017a]. This application uses video clips prepared for retrieval tasks.

 



The application consists of two windows, one for the experimenter and another for the participant.

Participant’s Window: This window is located at the participant’s screen. The window contains a video display where a video clip selected on the experi- menter’s window is played 3 times.


Experimenter’s Window: This window supports two main functionalities: playing the video clip to search and navigating through videos. The window is presented in Figure 3.2.


Playing a video clip to search: For each motion event, when the exper- imenter finds the correct video clip, the participant and the experimenter move on to the next event by clicking the topmost button on the left panel. Once the button is clicked, a video clip of the next event is randomly picked and played 3 times on the participant’s screen. If all events have been cov- ered, then the application displays a message indicating that the session is over.

Navigating through video clips: For each motion event, once the par- ticipant draws and verbally describes the scenario of a video clip, the ex- perimenter looks at the list of video clips on his side, and picks one for preview by clicking the green buttons. When the wizard is happy with his selection, he sends the video to play on the participant’s screen by clicking the ”send video” button at the bottom.



3.3 Visual Vocabulary


Due to the open-ended nature of sketching, participants needed help in establishing a visual vocabulary for describing objects and their motions. To this end we supplied them a visual vocabulary. We reviewed the notations used in Soccer Systems and

 




 

Figure 3.2: A screen-shot from wizard application for video search (experimenter’s side) with the explanations of UI components




Strategies [Bangsbo and Peitersen, 2000] (see Figure 3.3) and Goal.com’s live scoring system1 (see Figure 3.4). We wanted participants to memorize as few types of symbols as possible, therefore we combined these notations and removed the zig-zag symbol for dribbling. There appears to be no need for indicating dribbling explicitly. A player motion arrow drawn from a player right after sketching a ball motion arrow coming to the player would imply dribbling. One can also say a player is dribbling while drawing a player motion arrow to indicate dribbling. The resulting visual vocabulary is presented in Figure 3.5.



Figure 3.3: Drawing notation used in Soccer Systems and Strategies

 




 



Figure 3.4: Drawing notation used in Goal.com’s live scoring system




3.4 Video Clips


Our goal is to retrieve clips describing motions in soccer videos. This requires a data- base of videos augmented with motion data of players and the ball. In the soccer match retrieval case, there are two alternative ways of finding such video clips:

Using motion sensors in matches: Making arrangements with soccer clubs, mounting motion sensors on players and the ball and taking videos of these matches together with their motion data

Using a soccer simulation game: Finding an open-source soccer game and modifying its source code to record coordinates of the ball and players together with frames at every moment.


Ideally, the first option is preferable. However, this choice requires a massive infrastructure for motion sensors mounted on players and several cameras located around a soccer field. Due to the infeasibility in infrastructure, the second option is more doable to prepare videos augmented with motion. Although compiling videos from a soccer game is far from reality, these videos are similar to the ones in real life in terms of events that might happen in a match such as passes, shots and kicks. Presenting a soccer clip from a soccer game would not affect participants’ behavior of using speech and sketching. Hence, to prepare soccer video clips that will be presented to participants during sessions, we modified an open-source soccer simulation game named Eat the Whistle [Greco, 2008] in a way described above (see Figure 3.6 for a screen-shot). Afterwards, we recorded several soccer matches played between AI- controlled teams. For each match recorded, we compiled the subsequent frames into a

 




 


Figure 3.5: Drawing notation used in the data collection sessions



video. Finally, for each of the following events listed in Table 3.1, we prepared video clips from match videos.



Figure 3.6: A screen-shot from Eat the Whistle





3.4.1 Events for a Session


Searching for scenes of all events above might extend a session. After a while partic- ipant might get exhausted and not provide proper data. Thus we decided to divide

 




Corner kick Shot hitting the post

Direct free kick Shot saved by a goalkeeper

Indirect free kick Throw in

Header Shot going wide

Bicycle Foul

Goal Poke tackle

Pass Sliding tackle


Table 3.1: Motion events that might happen in a soccer match




the events among participants. For each participant, we chose either the first or the last 7 events such that the successive participants do not describe the same events.


3.5 Method


This section presents the physical set-up and the protocol followed.



3.5.1 Physical Set-Up


We aim to investigate the incorporation of speech and sketching in video retrieval. To this end, we initially planned a typical Wizard-of-Oz set-up where participants are asked to draw and talk about a scenario on a computer without being told that the computer is controlled by a human. However, as we have established in our pilot runs, after a while, participants hesitate talking while staring at a computer screen that does not produce any feedback indicating that their speech is being attended to. To motivate participants to talk, we allowed limited feedback provided by the experimenter. The experimenter does not engage into a conversation, however gives feedback while listening to the participant by nodding his head or saying ”OK, I’m listening to you” as needed.

As illustrated in Figure 3.7, each session takes place on a table with two tablets,

 



a computer with a screen plugged in. On the participant’s side, there is a screen to display videos to the participant, and a tablet to describe motion in the video clip through sketching. On the experimenter’s side, there is a computer including a software to pick up a video clip randomly and display it on the participant’s screen, and another tablet to see the participant’s drawing.


3.5.2 Protocol


In each session, the participant and the experimenter sit across a table. Initially, the experimenter provides a set of sample sketched symbols describing the visual vocabulary for constructing drawings. To practice the use of the visual vocabulary, the experimenter opens a video clip, and requests the participant to describe the movement of objects in the clip. Next, they move on to the search tasks. In the search tasks, there are different motion events to describe through speech and sketching. For each motion event, the experimenter picks up a video clip of the event randomly, and plays the clip on the participant’s screen 3 times. Then, the participant draws the motion of the objects in the clip on her tablet while talking at the same time. The experimenter then looks at the videos of the event on his screen, finds a video that is similar to the scenario drawn and uttered by the participant, and plays the video on the participant’s screen 3 times. Finally, the experimenter asks whether the video is the one presented at the beginning or not. If the participant says yes, they proceed to the next motion event. Otherwise, the participant tries to produce more accurate description of the scene to be retrieved by improving the drawing and speech, and the retrieval process repeats.


3.6 Output and Annotation


Out of 18 participants, 7 participants were soccer fans, 6 were soccer players, 3 were soccer coaches and 2 were soccer referees. Altogether, the sessions took 661 minutes and 24 seconds. Average duration of a session was 36 minutes and 44 seconds. On all the sketched symbols and utterances collected from these sessions, the following

 









 


Figure 3.7: Physical set up; (a) for the participant, (b) for the experimenter

 



annotation methods were applied.



3.6.1 Sketched Symbol Annotation


We implemented an application to annotate individual sketched symbols [Altıok, 2017d]. This software basically takes a sketch stream file generated by a collector application as input. As shown in Figure 3.8, the right pane includes the stream. A user can select a part of the stream, and see the drawing generated by the selection on the left panel. Then she can save the drawing with a name, in the following formats:


Points table: A list of (x,y, time, strokenum) where (x,y) refers to the coor- dinates of each sketch point, time refers to the time of generation of this point, and strokenum refers to the stroke ID the point belongs to. In each sketched symbol, stroke ID starts from 0 and is incremented by one when a new stroke starts.


Image: An image of the drawing.





Figure 3.8: A screen-shot from sketched symbol annotation application with the ex- planations of UI components



Using this software, annotation of the sketched symbols was done in two ways:

 




Ball 88

Ball Motion 313

Player Motion 250

Player (side 1) 480

Player (side 2) 396

Total 1527


Table 3.2: Distribution of the sketched symbols among symbol types



West 283

East 181

Total 464


Table 3.3: Distribution of labeled motion arrow symbols among the directions



Annotation by symbol type: We labeled symbols with one of the following classes: ball, ball motion arrow, player motion arrow, player from side 1 and player from side 2. The distribution of the symbols among such classes is pre- sented in Table 3.2 together with some samples (see Figure 3.9). Note that the ball symbol was not frequently used, thus ball symbols were excluded from the training set of our sketched symbol classifier. Location of the ball can be clearly inferred from the starting point of a ball motion arrow.


Annotation by direction: We also labeled the motion arrow symbols with their directions (heading to either the west or the east). Note that some motion symbols were excluded from the training set since they did not have an arrow head, which is an important indicator of direction. The distribution of the la- beled motion symbols among the classes in this annotation method is presented in Table 3.3.

 




 


Figure 3.9: Sample symbols collected in the sessions



3.6.2 Utterance Annotation


We also collected 9296 words in the sessions. A summary of the words uttered is presented in Table 3.4. Some sample utterances are also shown in Table 3.52.

For each of the 183 phrases describing soccer drills, we generated n-grams with n

ranging from 1 to 6. Next, each n-gram was labeled with one of the following motion events:

Bicycle


Corner kick


Dribbling


Free kick: All kicks far from or close to the goalpost are included in this category.


Header

 










Words uttered 9296

Minimum words uttered in a session 233

Maximum words uttered in a session 925

Average number of words uttered in

a session 516.4


Table 3.4: Summary of collected utterances in the sessions















Segment Event being described

...the ball is coming to the post... Shot hitting the post

...he is throwing the ball... Throw in

...he is heading to the ball... Header

...he is passing the ball in this way... Pass

...this is an indirect free kick... Indirect free kick


Table 3.5: Some sample utterances

 




Bicycle 198

Corner kick 406

Dribbling 2590

Free kick 687

Header 218

Pass 3694

Shot 4081

Tackle 2262

Throw-in 312

None 39663

Total 54111


Table 3.6: Summary of the n-grams generated from utterances



Pass


Shot: All of the shots heading to the goalpost, going wide, cleared by the goalkeeper and resulting in a goal are included in this category.


Tackle: All of the sliding tackles, poke tackles and fouls are included in this category.


Throw-in


None: The n-grams implying more than an event or none of the events above are labeled with None.


Using this annotation method, we aimed to build a text-classification model that is capable of finding motion events described in the sub-segments of a long text segment. The distribution of n-grams among the motion events is also presented in Table 3.6.

 



3.6.3 Sketched Symbol - Utterance Mapping


To measure the performance of our multimodal interpretation framework, we also labeled all motion arrows drawn in the data collection sessions with its corresponding motion event described by a participant. At this point, some motion arrows were not labeled because of describing none of the motion events. For instance, a ball motion arrow which indicates a ball coming to a player but whose source of arrival is not certain could not be associated with any motion event. Through the strategy described above, we have gathered up 397 motion arrow - motion event mappings from 148 scene descriptions. A sample mapping is illustrated in Figure 3.10.



Figure 3.10: A sample mapping between motion arrows and motion events

 



Notes to Chapter 3



1 A sample is available on-line at http://www.goal.com/tr/match/galatasaray-vs be relax


2 Note that the utterances were collected in Turkish.

 





Chapter 4


MULTIMODAL INTERPRETATION


Using the recent advancements on sketched symbol classification and natural lan- guage understanding, we have developed a multimodal interpretation framework that is able to understand the sequence of motion events from simultaneously given sketch- ing and speech inputs. To come up with the most accurate framework, we have run several experiments on different sketched symbol and text classification models. Next, we have come up with some different co-interpretation algorithms having dif- ferent combinations of sketched symbol and text classifiers. These co-interpretation models have exhibited notable performance on the data collected from participants. This implies that our multimodal interpretation algorithm is convenient with the data collected through our software tools and protocol for building bimodal and natural interfaces for expressing motion.


4.1 Sketched Symbol Classification


Although the state-of-the-art sketched symbol classifiers leverage neural networks [Seddati et al., 2016], such heuristics do not seem to give the best results on our sym- bol dataset since the amount does not seem sufficient (see Figure 3.2) for training. For small datasets, the best results have been obtained through support vector machines with the RBF kernel trained on the IDM feature space [Ouyang and Davis, 2009, Yesilbek et al., 2015, Tumen et al., 2010]. Thus, we evaluated only the SVM-RBF classification combined with the IDM feature extraction pipeline. The assessment was done using grid search with 5-fold cross validation on the (C,γ) SVM-RBF hyper- parameter pairs such that −1 ≤ log2 C ≤ 10 and −10 ≤ log2 γ ≤ 1. The distribution of accuracy rates among the symbol classes where the overall accuracy rate gets the

 



maximum is presented in Figure 4.1. As seen in the figure, the accuracy rates for every class is either close to or above 90%. This means that the SVM classification method with the IDM feature extraction mechanism is plentiful for figuring out individual sketched symbols expressing motion.



Figure 4.1: Classification accuracies of the SVM-RBF + IDM classifier / feature extractor combination





4.2 Natural Language Understanding


A speech input is basically a long text where a sequence of motion events is described. To find out the individual motion events mentioned in such texts, a text classifier must be trained in a way that it classifies sub parts as accurately as possible. To this end, we created a data set of n-grams from the transcripts collected from the participants. Because of the irregular nature of the spoken language, these n-grams

 



are not grammatically correct and thus called short texts. Then, we built text classification models by leveraging the recent short text classification frameworks. In our work, we tested the some variants of the neural network-based model proposed by [dos Santos and Gatti, 2014] and [Yu et al., 2013]’s SVM and logistic regression- based model (all implemented in a library called LibShortText) listed below. Note that despite not being solely designed for short texts, FastText [Joulin et al., 2016] was also included in the experiments because of its considerably short runtime at both classification and training.

Unigram + SVM: Support vector machine classification on the unigram fea- ture domain [Yu et al., 2013].


Bigram + SVM: Support vector machine classification on the bigram feature domain [Yu et al., 2013].


Unigram + LR: Logistic regression classification on the unigram feature do- main [Yu et al., 2013].


Bigram + LR: Logistic regression classification on the bigram feature domain [Yu et al., 2013].

CHARSCNN: A long-short-term-memory (LSTM) based deep neural network constructing mathematical representations of texts by learning word embed- dings at both word and character levels [dos Santos and Gatti, 2014].


CHARSCNN (word embeddings only): A variant of the CHARSCNN model such that the mathematical representations of texts are constructed by learning only word-based word embeddings.


CHARSCNN (character embeddings only): A variant of the CHARSCNN model such that the mathematical representations of texts are constructed by learning only character-based word embeddings.

 



Unigram + FastText: A FastText classifier trained on the unigram features [Joulin et al., 2016].


Bigram + FastText: A FastText classifier trained on the bigram features [Joulin et al., 2016].


To assess the classification performance of all these variants, the n-grams are ran- domly divided into 5 chunks, using 4 for training and 1 for testing. Hyper-parameter search for the SVM and LR-based classifiers was done internally by the LibShortText library [Yu et al., 2013], whilst the default parameters were used in the CHARSCNN and FastText models. Figures 4.2, 4.3, and 4.4 show the F1, precision and recall rates of all frameworks respectively.



Figure 4.2: F1 rates of different text classification architectures



As seen in the figures, all the rates are above 65%, meaning that a sufficient in- frastructure on understanding motion events from short texts seems to have been established. Additionally, although support vector machines had a remarkable per- formance on the data set, using these traditional models will not be a proper design

 




 


Figure 4.3: Precision rates of different text classification architectures







Figure 4.4: Recall rates of different text classification architectures



decision for building a multimodal video retrieval system in real life. This is because such models benefit from word frequencies to compute features, leading to misclas-

 



sification of the sentences with words not included in the training set. Considering all these reasons, we decided to use the FastText model on bigram features that had reached the highest F1 rates. All the F1 rates are above 95%, and it was expected to find out the motion events in a transcript most accurately during the user evaluation studies of the video retrieval system.


4.3 Co-Interpretation of Sketching and Speech


Using the sketched symbol and short text classifiers trained, we have designed a heuris- tic algorithm that is able to match the sketched motion symbols to the sequence of motion events found in a text. This algorithm gives the ability to convert multimodal input into plain text that will be used later to find the soccer match segments having the motions described. A sample matching is illustrated in Figure 4.5.




Figure 4.5: An illustration of sketched symbol - motion event matching



The multimodal interpretation algorithm consists of 2 steps. In the first phase, the transcribed speech input is split into short texts using the text classifier. Since the text classifier was trained on the n-grams with n ranging from 1 to 6, lengths of

 



the resulting segments must be between 1 and 6 words (both inclusive). The optimal segmentation must have the maximum segmentation log-probability, which is the sum of the maximum classification log-probabilities of all individual short text segments. This problem was modeled to be a dynamic programming problem as follows:

f (.) : The function returning the maximum log-probability of a short text segment

T [i] : The maximum sum of classification log-probabilities of short text segments when segmentation is performed on the first i word

W : The array of words in a transcribed text

T [i] = maxj=1,2,...min(6,i){T [i − j] + f (W [(i − j + 1) : i])} T [n] : The value to maximize

The input of the second step is the array of motion event classes with the highest probability. Initially, the None entries of the array, which do not express any motion event, are excluded. Next, the remaining entries are distributed among the sequence of motion arrow symbols drawn through another dynamic programming problem for- mulated as follows:

E : The array of motion events (excluding Nones)

Ball motion events = [Header, corner kick, pass, bicycle, free kick, shot, throw-in] Player motion events = [Tackle, dribbling]

b(.) : number of ball motion events in an array

p(.) : number of player motion events in an array


−b(E) if p(E) = 0 and the arrow indicates a player motion

r(arrow, E) = 1 − b(E) if p(E) > 0 and the arrow indicates a player motion

−p(E) if b(E) = 0 and the arrow indicates a ball motion

1 − p(E) if b(E) > 0 and the arrow indicates a ball motion

(4.1)


K : The array of sketched motion arrow symbol classes (sorted by drawing order, from the first to last)

 




 

T [i, j] = maxp=0,1,...,i−1{T [i − p, j − 1] + r(K[j], E[(i − p + 1) : i])} if j > 1

r(K[j], E[1 : i]) if j = 1

n : number of items in the E array m : number of items in the K array T [n, m] : The value to maximize

 



(4.2)

 

To explain further, this dynamic programming model tries to distribute the motion events detected in a transcribed text among the motion arrow symbols drawn by a user through preserving the temporal order of both events and drawings. There are 2 kinds of award functions, one for matching a ball motion arrow, and another for matching a player motion arrow. Each function yields 1 point if there is at least a motion event relevant to a motion arrow symbol. However, if there is at least a motion event irrelevant to an arrow symbol, it cuts down by 1 point for each irrelevant motion event. This leads to a homogeneous distribution of all motion event entries by preserving the relevance between motion arrows and events. Note that depending on the speech and sketch inputs, there might be some motion arrows with no motion events assigned.

This 2-step matching algorithm has been inspired by [Hse et al., 2004]’s work on segmenting sketched symbols. His work basically splits sketched symbols into primi- tive tokens such as arcs and lines. The segmentation process is based on a dynamic programming model where segments are fitted to either an arc or a line.

We evaluated our matching algorithm on the data set of sketched symbol - motion event mappings. Since the algorithm can assign multiple motion events to a single motion arrow symbol, a motion event - arrow match is said to be correct if the correct motion event is among the ones matched to the arrow. The matching algorithm was evaluated through different text classification models (CHARSCNN and FastText with bigram features). The evaluation results are presented in tables 4.1, 4.2 and 4.3. As seen in the tables, the FastText-based segmentation is more accurate than the CHARSCNN-based segmentation. This is because FastText is more accurate than

 




Criteria Value obtained using

CHARSCNN Value obtained using

FastText

Average accuracy rate in a search task (number of correctly matched mo- tion arrows in a search task / number of all mo- tion arrows in a search

task)

68.88% ± 33.61% 75.38% ± 30.16%

Overall accuracy rate (number of all correctly matched motion arrows drawn in all search tasks

/ number of all motion

arrows drawn in all search tasks) 63.48% (252/397) 69.02% (274/397)

Table 4.1: Accuracy rates of the multimodal interpretation algorithm





CHARSCNN on short text classification. Moreover, there are high accuracy rates on the motion events with many examples such as pass, tackle and free kick. However, this is not the case for the dribbling class since the variation of the words within this class is more than the other classes. It is thought that having more keywords but less samples in a motion event might lessen the accuracy rate. Furthermore, as seen in Table 4.3, our multimodal interpretation framework overperforms the random matcher. Since there are 8 distinct ways of matching to a ball motion (7 ball motion events + matching none of the events), a random matcher is 12.5% accurate. In the case of player motions, there are 3 different ways of matching (2 player motion events

+ matching none of the events) and thus the accuracy is 33.3%. Within this context,

 




Motion event Accuracy (using

CHARSCNN) Accuracy (using Fast-

Text)

Header 40% (2/5) 60% (3/5)

Corner kick 87.5% (7/8) 87.5% (7/8)

Pass 78.57% (77/98) 84.69% (83/98)

Bicycle 40% (2/5) 20% (1/5)

Free kick 90.91% (10/11) 90.91% (10/11)

Shot 73.44% (47/64) 96.88% (62/64)

Throw in 60% (6/10) 70% (7/10)

Tackle 73.68% (28/38) 86.84% (33/38)

Dribbling 46.2% (73/158) 43.04% (68/158)

Table 4.2: Distribution of accuracy rates of the multimodal interpretation algorithm among the motion events




our 2-step multimodal interpretation heuristic seems applicable for use in our final multimodal video retrieval system. For its higher accuracy, FastText has been chosen to be the text classification model in the video retrieval system.


4.4 User Interaction Mechanism


Based on the sketched symbol classifier, text classifier and the multimodal interpre- tation model, the user interaction mechanism of the video retrieval system is planned as shown in Figure 4.6. As seen in the figure, the video retrieval system has been designed to take speech and sketching simultaneously. Users are allowed to draw through a stylus pen. To distinguish individual symbols, users must stop drawing for a short while once a symbol is complete. If a user stops drawing for more than a sec- ond, coordinates of the points inside the recent symbol are sent to the sketched symbol classifier. Depending on the classification result, the recent symbol is beautified. On the speech side, the speech input is transcribed through Google’s speech recognition

 




Motion arrow type Accuracy (ran- dom matcher) Accuracy (using

CHARSCNN) Accuracy (us- ing FastText)

Ball motion 12.5% (1/8) 75.12% (151/201) 86.06% (173/201)

Player motion 33.3% (1/3) 51.5% (101/196) 51.5% (101/196)

Table 4.3: Distribution of accuracy rates of the multimodal interpretation algorithm among the motion arrow types



API [Google, 2017]. To overcome mistranscriptions due to network breakages, users are also allowed to type whatever they said directly into the system through keyboard. Finally, once a user finishes describing a scene, the transcribed text and the sketched symbol classification results are evaluated altogether by the multimodal interpretation algorithm.





Figure 4.6: The user interaction mechanism of the video retrieval system

 



4.5 Search Query Generation


The multimodal input must be converted into a plain text search query to perform retrieval on the database back end. There are 3 steps of generating back end search queries. As the first step, the end points of the motion arrows drawn in the canvas are detected. The end points of a motion arrow are assumed to be the furthest points in the arrow [Hammond and Davis, 2006b]. These end points are then marked as starting and ending points depending on the orientation of arrows. If an arrow is heading to the west, the endpoint with the smaller x coordinate is marked as the starting point and the other is marked as the end point. The situation gets reversed if it is heading to the east. To find out the orientation, we also trained a binary SVM classifier on the IDM feature domain. The 5-fold cross validation results of the model are depicted in Figure 4.71. As seen in the figure, the classifier seems accurate to figure out the orientation of an arrow if its head is drawn clearly.

As the second step, each motion arrow is associated with a player symbol. Our system does not care about the team names but their differences. The player doing a motion is assumed to be the closest one to the arrow within 40 pixels radius. The teams are distinguished by being filled of player symbols. Moreover, if a motion arrow is associated with a player, its starting point is updated to the center of the player associated2. Furthermore, if there is another arrow within 40 pixels radius of the ending point of an arrow, the teams doing these motions are assumed to be the same. As the last step, the multimodal interpretation algorithm is executed. For each arrow, the end points, the team information and its motion events are packed into a plain text query. Figure 4.8 illustrates the process of generating a plain text search

query from multimodal input.

 




 


Figure 4.7: Classification accuracies of the SVM-RBF + IDM classifier / feature extractor combination for finding out the orientation of a motion arrow symbol




Figure 4.8: An illustration of the database query generation

 



Notes to Chapter 4



1 Note that the hyperparameter ranges used in the grid search are as the same as the ones used for the symbol classification.


2 The center of a player symbol is the average of all points (center of mass) included in the symbol.

 





Chapter 5


DATABASE SYSTEM


To enable motion based retrieval using plain text queries generated from multi- modal inputs, a database system has been designed and some retrieval algorithms have been implemented. The following subsections give a brief overview on the database system design and the retrieval algorithms that comply with the plain text search query format.


5.1 Database Design


Our database system treats every single motion event (header, corner kick, pass, bicycle, free kick, throw-in, tackle, shot and dribbling) as a primitive motion scenario. All motion scenarios described by a user are considered to be an aggregation of the primitive motion scenarios. Following this approach, we designed the database to store and index a set of primitive soccer match cutouts each of which is represented by a feature vector (id, fs, fe, sx, sy, ex, ey, t, e). All features are explained below.


id: A unique number created for every soccer match. This prevents compiling primitive motion scenarios from different matches into the same video clip.


fs: The frame number where the cutout starts


fe: The frame number where the cutout ends


sx: The x coordinate of the point where the primitive motion starts


sy: The y coordinate of the point where the primitive motion starts

 



ex: The x coordinate of the point where the primitive motion ends


ey: The y coordinate of the point where the primitive motion ends


t: The team of the player involved in the primitive motion scenario (gets either 1 or 2)


e: The event name of the primitive motion scenario (either of header, corner kick, pass, bicycle, free kick, throw-in, tackle, shot and dribbling)


We recorded 5 soccer matches between the AI-controlled teams from the Eat the Whistle game [Greco, 2008]. For each match, we extracted primitive motion events and stored their feature representations in ADAMpro, which is a scalable database utility for big multimedia collections [Giangreco and Schuldt, 2016]. To extract fea- tures of each primitive motion cutout, a state machine has been modeled and imple- mented. All state machines are explained below. Illustrations of the complex ones are also provided.

Header: All the subsequent frames from one where a player is heading the ball, to the latest one where the ball is free and has not been picked up by someone else are considered to be about a header event (see Figure 5.1).

Corner kick: All the subsequent frames from one where a corner kick is set, to the latest one where the ball is free and has not been picked up by someone else are considered to be about a corner kick event.

Pass: If the team having the ball does not change but the player does, all the subsequent frames from the one where the ball gets free to the one where the ball is picked up again are considered to be about a pass event (see Figure 5.2).

Bicycle: All the subsequent frames from one where a player is doing a bicycle kick, to the latest one where the ball is free and has not been picked up by someone else are considered to be about a bicycle event.

 



Free kick: All the subsequent frames from one where a free kick is set, to the latest one where the ball is free and has not been picked up by someone else are considered to be about a free kick event.


Shot: If the ball is picked up by the opponent’s goalkeeper or it goes wide, all the subsequent frames from the one where the ball gets free to the one where it goes wide or caught by the goalkeeper are considered to be about a shot event.


Throw-in: All the subsequent frames from one where a throw-in is set, to the latest one where the ball is free and has not been picked up by someone else are considered to be about a throw-in event.


Tackle: All the subsequent frames from one where a player tries to steal the ball to the latest one where the ball is free and has not been picked up by someone else are considered to be about a tackle event.


Dribbling: All the subsequent frames from one where the ball is picked up by a player to the one where he loses it are considered to be about a dribbling event (see Figure 5.3).


Moreover, regardless of belonging to a primitive motion event, all match frames are saved into a disk in order to prevent disjointedness among the video clips of primitive motions. If two successive primitive events include some other frames in between, these frames are also included in the video clip that will be presented in the search results.


5.2 Retrieval Framework


The ADAMpro utility supports two modes of retrieval:


Boolean retrieval: Retrieval based on exact matching between a query and a document

 




 


Figure 5.1: An illustration of the state machine finding out the headers in the ball motion record of a soccer match



Vector space retrieval: Retrieval based on the similarity between a query and a document. The less the distance between the feature vectors of two items is, the more similar they are.


Our retrieval framework exploits these two methods to bring video clips having the motions described. Firstly, for each motion entry in a plain text query, all primitive video clips having one of the motion events listed in the entry are queried through boolean retrieval (see Algorithm 1). Next, on the video clips found in the previous step, a vector space retrieval is performed by using the position information as the query vector. Finally, starting from the first entry, for each cutout found in a motion entry, the closest cutout of the next motion entry happening within the first 250 frames from the first frame is sought1. If there is such a cutout, this cutout is marked as the cutout of the next motion entry, and the seeking process repeats between the next two motion entries. Otherwise the seeking process is terminated. If the chain of seeking processes happen till the last motion entry, then the frames between the first frame of the first entry’s cutout and the last frame of the last entry’s cutout are compiled into a video clip to be presented as a search result. Note that the results are presented through sorting them by their similarity scores in descending order (see

 




 


Figure 5.2: An illustration of the state machine finding out the passes in the motion record of a soccer match



Algorithm 2). The similarity score of a video clip is defined as the sum of the squared Euclidean distances of the cutouts included2.

 













Algorithm 1: An algorithm that finds primitive video cutouts for each motion entry in a search query

Data: K (an array of motion event entries in a search query)

Result: results (an array of search results for each motion entry in the search query)

results ← []

max number of results ← 10

for each entry in K do

boolean query ← e == entry.motion events[0]

possible event count ← entry.motion events.length

for each n in [1, 2, ..., possible event count − 1] do

boolean query ← boolean query||e == entry.motion events[n]

end

query vector ←< query.sx, query.sy, query.ex, query.ey >

results.append(database.vector space retrieval(query vector,

max number of results, database.boolean retrieval(boolean query)))

end


 







 

Algorithm 2: An algorithm yielding search results as video clips

Data: results (an array of search results for each motion entry in the search query)

Result: S (an array of video clips with their similarity scores)

max frame count ← 250

motion query count ← results.length first motion entry results ← results[0]

first motion entry event count ← first motion entry results.length S ← []

for each n in [0, 1, ..., first motion entry event count − 1] do

first event clip ← first motion entry results[n]

first motion event frame range ← first event clip.frame range frame range ← first motion event frame range

similarity score ← first event clip.similarity score

for each m in [1, 2, ..., motion query count − 1] do

next results ← results[m]

next results count ← next results.count min frame diff ← max frame count closest clip ← None

last frame range ← None

for each p in [0,1,...,next results count-1] do

frame diff ← next results[p].frame range.start

if frame diff ≤ min frame diff and next results[p].id = first event clip.id then

min frame count ← frame diff

closest clip ← next results[p]

last frame range ← next results[p].frame range

similarity score for closest clip ← closest clip.similarity score

end

end

if closest clip = None then

break

 

end else


end

 



frame range ← frame range ∪ last frame range

 

if n = first motion event entry count − 1 then

S.add((frame range, similarity score))

end

end

 




end

 

// Sort S by the first column (their similarity scores) in descending order S.sort(column number = 1, ’descending’)

 


 

 

















 


Figure 5.3: An illustration of the state machine finding out the dribbling events in the motion record of a soccer match

 



Notes to Chapter 5



1 Note that the candidate cutouts are from the same soccer match.


2 The squared Euclidean distances of the cutouts are computed by ADAMpro.

 





Chapter 6


USER EVALUATION STUDIES


A video retrieval system has been built by integrating the multimodal input eval- uation framework into the database back end1. More details on the system’s UI components, the protocol followed in the user evaluation studies with some insights on the results are provided in the following subsections.


6.1 User Interface Design


The user interface components of the video retrieval system are shown in Figure 6.1. As seen in the figure, the UI consists of two main columns. The left column includes a text input field where a speech input is transcribed and displayed. Users can change its content through keyboard in case of having a network breakage on Google’s speech recognition system [Google, 2017], or having a mistranscribed text. At the bottom of the text field, there is a drawing canvas with a soccer field background image. The buttons just below the canvas can clear the canvas, remove the recently drawn symbol, starts the multimodal interpretation and kicks a search. The bottommost section on the left panel is a video display. The right panel includes search results as thumbnails. When a user clicks on a thumbnail, the corresponding video clip is played at the video display. Note that whenever the compilation of a video clip is done, the clip is sent to the right panel without waiting for the other clips to be compiled. Thus users can watch the video clips that have been already compiled while the search is in progress, enabling a better user experience than presenting all the clips in one go. To re-play the recently watched video, users can click on the blue button below the display. During the user studies, whenever they think that they have found the correct video, they can click on the green button to evaluate the correctness of the

 



recently watched video.





Figure 6.1: The UI components of the video retrieval system





Moreover, to enhance the user experience while drawing, a symbol is beautified when a user stops drawing it. As illustrated in Figure 6.2, if she stops drawing a symbol for more than a second, the symbol is sent to the sketched symbol classifier and beautified depending on the classification result. In this way users would think that the system is interacting with them. If the symbol is classified as a player, the system draws a circle with 40 pixels of radius. Otherwise, the system detects the endpoints of the symbol, and draws an arrow (either dashed or solid).

To illustrate the joint interpretation of sketching and speech on the drawing canvas, each motion event is written around the center of mass of the arrow the event is assigned to (see Figure 6.3). If a user thinks that a motion event is not assigned to an arrow correctly, then she can click on the motion events listed on the arrow, and add the desired event(s) through marking them on a pop-up window.

 




 



Figure 6.2: A sample symbol beautification in the video retrieval system




6.2 Method


The user evaluation studies take place around a table shown in Figure 6.4. Initially, the participant sits in front of the computer located at the right half of the table. Next, the experimenter gives a brief overview of the study through a presentation on the computer. After the presentation, the experimenter plays two short video clips, and asks the participant to draw the motions of the ball and players in these clips on a paper having a soccer field background image using the symbols given in the visual vocabulary (see Figure 6.5). Then the experimenter reviews the symbols drawn. If there is a misdrawn symbol, he asks the participant to re-draw it.

Once the training on sketched symbols is complete, the experimenter limits access to the visual vocabulary, and the search tasks begin. The participants sits on the left side of the table. There are 5 search tasks, a warm-up task and four actual tasks. The participant’s interactions with the search engine is not included during the warm-up

 




 



Figure 6.3: A sample illustration of multimodal interpretation on the drawing canvas






Figure 6.4: The physical set-up of the user evaluation studies




task. For each search task, a cutout from one of the 5 soccer matches indexed is randomly selected and played on the screen2. Next, the participant tries to find this video clip through our video retrieval system in 8 minutes. Within this time period,

 




 


Figure 6.5: The visual vocabulary used during the user evaluation studies




the participant is allowed to do as many video searches as she would like. Whenever she thinks that she has found the correct video, she clicks on the green button below the video display of the retrieval system3. Then, the computer located at the right half of the table evaluates if the video clip sent by the participant is correct. If the video clip is correct4, then the search task successfully ends. Otherwise, the search task continues until the time is over. If the time is over and the correct video clip could not be found, the search task fails.

Moreover, in order to observe how participants established their search strategies based on time and number of attempts, we reviewed the scoring functions used in the assessment of video retrieval systems. The metric developed by [Schoeffmann et al., 2014] uses two parameters: the time passed till finding the correct video (t) and the number

of attempts made until finding the correct video (p). The scoring function is as follows

(s denotes the points gained from a search task, and tmax denotes the time allowed in a search task):

100−50  t 

s(t, p) = tmax 

This formula enables doing a trade-off between time and number of attempts. The change with respect to time is linear whilst the change on the score is inversely proportional to the square of number of attempts. This motivates the participants

 



to do as many attempts as possible till finding the correct clip. The participants favoring time cannot be distinguished from the ones favoring number of attempts. To overcome this problem, we created a new scoring function having a parametrized trade-off space between time and number of attempts. The function is derived by the functions yielding the point loss per second or attempt, namely T and P .

T (t; w, α, tavg) = −100w(  t  )α (t: the time passed till finding the correct video

clip, tavg: the arithmetic mean time passed till finding the correct video or having a failure)

P (p; w, β, pavg) = −100(1 − w)(  p  )β (p: the number of attempts till finding the

correct video clip, pavg: the arithmetic mean number of attempts till finding the correct video clip)

The −100 coefficient in both functions implies that a participant can lose up to

100 points. The w coefficient is a real number in the [0, 1] range. This coefficient determines the linear contribution of time and number of attempts to total loss. The

last factor in both functions is an exponentiation of the fraction of a time passed or a number of attempts made (α, β ∈ [0, 1] and α, β ∈ lR). Having α = 0 in T means that every unit of time passed has the same point loss. If α > 0, the point loss increases

with respect to time. However, as α approaches to 1, the point loss for a particular time or attempt count decreases.

By taking the indefinite integrals of the loss functions as follows, the scoring function used in our evaluation studies has been derived. Note that the constants coming from the integrals are set to 0.

S(t, p; w, α, β, tavg, pavg)

= 100 + J T (t; w, α, tmax)d(  t  ) + J P (p; w, β, pavg)d(  p  )

tavg pavg

= 100 − 100w(  t  )α+1  1  − 100(1 − w)(  p  )β+1  1 

 

tavg

 

α+1

 

pavg

 

β+1

 

The points gained from a search task is the product of video clip similarity5 and the value obtained from the scoring function. If the search task fails, then the points gained from the task is zero. Taking the summation of the points gained from actual search tasks, the total points collected in an evaluation study is computed.

 



6.3 Results


We conducted the user evaluation studies on our video retrieval system by involving 15 participants. The following criteria were used in the assessment of the system:


Sketched symbol classification performance


Multimodal interpretation performance


Search performance of the participants


Search runtime


The search strategies established by the participants


6.3.1 Sketched Symbol Classification Performance


To evaluate the classification performance of the sketched symbol classifier used in the video retrieval system, initially, the symbols that were drawn once and not re- moved from the drawing canvas later were detected by hand. These symbols, namely permanent symbols, are thought to be classified correctly. For each permanent symbol, we also inspected the number of trials to get to the correct classification. Having less number of trials for a symbol indicates that the symbol classifier works accurately on the symbols drawn by the participants. Within this context, we kept the statistics of the number of trials for each permanent symbol. As presented in Table 6.1, the most frequently drawn symbols are the player symbols, having the least number of trials. The symbol type with the greatest number of trials is the player motion arrow heading to the west. Analyzing the results altogether, the mean number of trials for all classes is below 1. Considering the sizes of the sketched sym- bol datasets collected and used for benchmarking [Eitz et al., 2012], we concluded that the support vector machine based classification with the IDM feature extraction method [Ouyang and Davis, 2009] has exhibited a remarkable performance on our

 



dataset of sketched symbols. We believe that this finding is useful since it can open the way to data efficient learning methods for sketched symbols.


Class or ori- entation Mean num- ber of trials Maximum number of trials Number of permanent symbols Number of permanent symbols with the number of trials greater than

or equal to 1

Player from

side 1 0.15 ± 0.56 6 184 21

Player from

side 2 0.12 ± 0.40 3 130 12

Ball mo-

tion/east 0.21 ± 0.70 6 101 21

Ball mo-

tion/west 0.28 ± 0.80 7 103 28

Player mo-

tion/east 0.27 ± 0.77 5 53 14

Player mo-

tion/west 0.46 ± 1.43 8 41 14

Table 6.1: Summary of the number of trials for permanent symbols





6.3.2 Multimodal Interpretation Performance


Out of the 250 motion arrow symbols drawn, the participants added motion events to 77 symbols (30.80%) through a pop-up window. This means that only 77 symbols were not matched with the correct motion event. In other words, the multimodal

 



interpretation system is 100% − 30.80% = 69.20% accurate on the motion arrows drawn by the participants. Moreover, on average, the participants added motion events to only 30.42% of the motion arrow symbols drawn during a search task (with a standard deviation of 17.70%). Considering the fact that our dataset of n-grams is relatively smaller than the data sets used in benchmarking of the popular text classification models [dos Santos and Gatti, 2014], we concluded that our multimodal interpretation model has exhibited a notable performance on the speech and sketching data combined.





6.3.3 Search Performance of Participants


Out of the 60 video clips searched by the participants, 37 were found correctly (see Figure 6.6 for the number of clips found correctly by each participant). This means that our video retrieval system’s accuracy is 37/60 = 61.67%. Taking the complexity and ambiguity of both speech and sketching into account, this accuracy rate is note- worthy. This retrieval system is the first of its kind in terms of the modalities used. Thus it constitutes a baseline for the video retrieval systems in the future.

Moreover, for each motion event, we analyzed the fraction of the video clips found correctly. As seen in Table 6.2, the highest accuracy was obtained in corner kicks, and the second in shots. For other events, the accuracy rates were ranging between 41% and 61%. These statistics imply that the sketching and speech produced by the participants to describe shots and corner kicks were consistent with each other. For other events, the sketching and speech used by the participants might not be consistent among the participants. Due to the lack of graphics quality of the Eat the Whistle game, the participants might have misinterpreted the motion events presented in video clips, resulting in wrong video clips returned by the retrieval system. For instance, some participants have interpreted a free kick as a pass, and the retrieval system has brought only passes rather than free kicks.

 




 


Figure 6.6: The distribution of the video clips found correctly among the participants





6.3.4 Search Runtime


An average completion time of a search was 83.5 seconds (with a standard deviation of 67.58 seconds). On average, the participants have found the correct video within the first 26.69 seconds of a search runtime (with a standard deviation of 21.87 seconds). Combining these results, the participants were able to find the correct video within the one-third (26.69/83.5 ∼ 1/3) of a search runtime. Most of a search runtime is spent on the compilation of the video clips found. These imply that the retrieval system seems to list the desired results on the top, thus the retrieval algorithm serves to a good user experience. Note that this system is the first of its kind in terms of the modalities employed. Thus that runtime is considered to be a baseline for the multi-modal motion retrieval systems in the future.

 




Motion event Video clips pre- sented Video clips cor- rectly found by

a participant Accuracy

Header 13 8 61.54%

Corner kick 10 9 90.00%

Pass 12 5 41.67%

Bicycle 12 6 50.00%

Free kick 8 4 50.00%

Shot 13 11 84.62%

Throw-in 14 8 57.14%

Tackle 21 11 52.38%

Dribbling 34 20 58.82%

Table 6.2: Distribution of video clip retrieval accuracies among the motion events




6.3.5 Search Strategies


Using our new scoring metric on different combinations of w, α and β, we projected how the search established by the participants strategies based on time67 and attempt count8. All the participants’ scores are presented in figures 6.7, 6.8, 6.9, 6.10 and 6.11. As seen in the figures, the scores when w = 1 with different (α, β)9 values are greater than the ones when w = 0. The participants’ scores increase in parallel to the linear weight on the time loss. Most of the participants have favored time against attempt count, thus the loss in overall scores due to time was less than the ones due to attempt count.

Moreover, for different values of w10, we took the average scores of each participant over different combinations of α and β. As shown in Figure 6.12, 5 participants have reached the maximum score when w = 0 whereas the others (10 participants) have reached the maximum when w = 1. This implies that majority of the participants have used time more efficiently than attempt count. Also, our retrieval system seems

 




 


Figure 6.7: The participants’ scores for different combinations of α and β when w = 0





Figure 6.8: The participants’ scores for different combinations of α and β when w = 0.25

 




 


Figure 6.9: The participants’ scores for different combinations of α and β when w = 0.5





Figure 6.10: The participants’ scores for different combinations of α and β when

w = 0.75

 




 


Figure 6.11: The participants’ scores for different combinations of α and β when

w = 1



to support both time-favoring and attempt-count-favoring search strategies without huge performance gaps.

 









 



Figure 6.12: Average scores of the participants for different values of w (participants are enumerated from 1 to 15)

 



Notes to Chapter 6



1 Source code of the system is available online at http://github.com/ozymaxx/soccersearch


2 Among the 4 video clips played in the actual tasks, one clip includes a single prim- itive motion event, one clip includes two consecutive primitive motion events, one clip includes three consecutive primitive motion events and one includes four consecutive primitive motion events.

3 This action is called an attempt in the rest of the paper.


4 A video clip is said to be correct if the intersection of its frames with the frames of the actual clip is at least 80 percent of the frames of the actual one.


5 The number of frames at the intersection between the video clip found by the participant and the correct one / the number of frames in the correct video clip


6 We used second as the unit of time.


7 The average duration of finding a correct video clip was 285.32 seconds, hence tavg

was taken as 285.32.


8 The average number of attempts made in a search task was 1.63, hence pavg was taken as 1.63.

9 α, β ∈ {0, 0.5, 1}

10 w ∈ {0, 0.25, 0.5, 0.75, 1}

 




Chapter 7


CONCLUSIONS



To have a prior knowledge on how humans articulate motion through speech and sketching, we designed a protocol and implemented a set of software tools for en- couraging participants to draw and talk in a video retrieval task. The protocol and the software tools were assessed on the soccer match retrieval use case scenario. The ecologically validity in these tasks was formed by involving participants with ranging level of expertise in soccer. The software tools are customizable based on the retrieval use case, and the protocol is easy to follow. Afterwards, to build a natural interaction mechanism for depicting motion, we developed a multimodal interpretation model that is able to understand the sequence of motion events specified through speech and sketching. This heuristic model is based on two sub-models, a sketched symbol classifier and a text classifier. Considering the exiguity of the data we collected, the multimodal interpretation model and its sub-models are thought to exhibit a remark- able classification performance. All these results also show that the software and protocol used in data collection make a good couple with the multimodal interpreta- tion model for building a speech-and-sketch-based user interfaces for characterizing motion. As the final step, we developed a multimodal video retrieval system by com- bining the natural interaction mechanism with a database back-end designed for big multimedia collections. The system was assessed through user evaluation studies. The study results show that the retrieval system works accurately on real-life multimodal inputs. However, this system is the first of its kind in terms of the modalities em- ployed. This system has just constructed a baseline for the video retrieval systems in the future. There is a big room of improvement on the multimodal interpretation and the video retrieval algorithm. We believe that this system has opened the way for the

 

Chapter 7: Conclusions 65

studies concerning multimodal interaction for effective and efficient video retrieval.

 





BIBLIOGRAPHY


[Adler and Davis, 2007a] Adler, A. and Davis, R. (2007a). Speech and sketching: An empirical study of multimodal interaction. In Proceedings of the 4th Eurographics workshop on Sketch-based interfaces and modeling, pages 83–90. ACM.


[Adler and Davis, 2007b] Adler, A. and Davis, R. (2007b). Speech and sketching for multimodal design. In ACM SIGGRAPH 2007 courses, page 14. ACM.


[Al Kabary and Schuldt, 2013] Al Kabary, I. and Schuldt, H. (2013). Towards sketch- based motion queries in sports videos. In Multimedia (ISM), 2013 IEEE Interna- tional Symposium on, pages 309–314. IEEE.


[Altıok, 2017a] Altıok, O. C. (2017a). Ad-Hoc Video Search Application for Multi- Modal Data Collection.


[Altıok, 2017b] Altıok, O. C. (2017b). Multi-Modal Collector (Experimenter’s Side). [Altıok, 2017c] Altıok, O. C. (2017c). Multi-Modal Collector (Participant’s Side). [Altıok, 2017d] Altıok, O. C. (2017d). Sketch Annotator for Multi-Modal Collector.

[Bangsbo and Peitersen, 2000] Bangsbo, J. and Peitersen, B. (2000). Soccer systems and strategies. Human Kinetics.


[Cohen et al., 1997] Cohen, P. R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L., and Clow, J. (1997). Quickset: Multimodal interaction for distributed applications. In Proceedings of the fifth ACM international conference on Multimedia, pages 31–40. ACM.

 



[Collomosse et al., 2009] Collomosse, J. P., McNeill, G., and Qian, Y. (2009). Story- board sketches for content based video retrieval. In Computer Vision, 2009 IEEE 12th International Conference on, pages 245–252. IEEE.


[dos Santos and Gatti, 2014] dos Santos, C. and Gatti, M. (2014). Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 69–78.


[Eitz et al., 2012] Eitz, M., Richter, R., Boubekeur, T., Hildebrand, K., and Alexa,

M. (2012). Sketch-based shape retrieval. ACM Trans. Graph., 31(4):31–1.


[Giangreco and Schuldt, 2016] Giangreco, I. and Schuldt, H. (2016). Adampro: Database support for big multimedia retrieval. Datenbank-Spektrum, 16(1):17–26.


[Google, 2017] Google (2017). Cloud Speech Recognition API. [Greco, 2008] Greco, G. (2008). Eat the Whistle.

[Hammond and Davis, 2006a] Hammond, T. and Davis, R. (2006a). Ladder: A lan- guage to describe drawing, display, and editing in sketch recognition. In ACM SIGGRAPH 2006 Courses, page 27. ACM.


[Hammond and Davis, 2006b] Hammond, T. and Davis, R. (2006b). Tahuti: A geo- metrical sketch recognition system for uml class diagrams. In ACM SIGGRAPH 2006 Courses, page 25. ACM.


[Hse et al., 2004] Hse, H., Shilman, M., and Newton, A. R. (2004). Robust sketched symbol fragmentation using templates. In Proceedings of the 9th international conference on Intelligent user interfaces, pages 156–160. ACM.


[Joulin et al., 2016] Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2016).

Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.

 



[Kazi et al., 2014] Kazi, R. H., Chevalier, F., Grossman, T., and Fitzmaurice, G. (2014). Kitty: sketching dynamic and interactive illustrations. In Proceedings of the 27th annual ACM symposium on User interface software and technology, pages 395–405. ACM.


[Mauceri et al., 2015] Mauceri, C., Suma, E. A., Finkelstein, S., and Souvenir, R. (2015). Evaluating visual query methods for articulated motion video search. In- ternational Journal of Human-Computer Studies, 77:10–22.


[Muzychenko, 2015] Muzychenko, E. (2015). Virtual Audio Cable.


[Ouyang and Davis, 2009] Ouyang, T. Y. and Davis, R. (2009). A visual approach to sketched symbol recognition. In IJCAI, volume 9, pages 1463–1468.


[Oviatt et al., 2000] Oviatt, S., Cohen, P., Wu, L., Duncan, L., Suhm, B., Bers, J., Holzman, T., Winograd, T., Landay, J., Larson, J., et al. (2000). Designing the user interface for multimodal speech and pen-based gesture applications: state- of-the-art systems and future research directions. Human-computer interaction, 15(4):263–322.


[Papineni et al., 1997] Papineni, K. A., Roukos, S., and Ward, T. R. (1997). Feature- based language understanding. In Fifth European Conference on Speech Commu- nication and Technology.


[Schoeffmann et al., 2014] Schoeffmann, K., Ahlstr¨om, D., Bailer, W., Cobˆarzan, C., Hopfgartner, F., McGuinness, K., Gurrin, C., Frisson, C., Le, D.-D., Del Fabro, M., et al. (2014). The video browser showdown: a live evaluation of interactive video search tools. International Journal of Multimedia Information Retrieval, 3(2):113– 127.


[Seddati et al., 2016] Seddati, O., Dupont, S., and Mahmoudi, S. (2016). Deepsketch 2: Deep convolutional neural networks for partial sketch recognition. In Content-

 



Based Multimedia Indexing (CBMI), 2016 14th International Workshop on, pages 1–6. IEEE.


[Skantze and Al Moubayed, 2012] Skantze, G. and Al Moubayed, S. (2012). Iristk: a statechart-based toolkit for multi-party face-to-face interaction. In Proceedings of the 14th ACM international conference on Multimodal interaction, pages 69–76. ACM.


[Skubic et al., 2007] Skubic, M., Anderson, D., Blisard, S., Perzanowski, D., and Schultz, A. (2007). Using a hand-drawn sketch to control a team of robots. Au- tonomous Robots, 22(4):399–410.


[Thorne et al., 2004] Thorne, M., Burke, D., and van de Panne, M. (2004). Motion doodles: an interface for sketching character motion. In ACM Transactions on Graphics (TOG), volume 23, pages 424–431. ACM.


[Tumen et al., 2010] Tumen, R. S., Acer, M. E., and Sezgin, T. M. (2010). Feature extraction and classifier combination for image-based sketch recognition. In Pro- ceedings of the Seventh Sketch-Based Interfaces and Modeling Symposium, pages 63–70. Eurographics Association.


[Vergo, 1998] Vergo, J. (1998). A statistical approach to multimodal natural lan- guage interaction. In Proceedings of the AAAI98 Workshop on Representations for Multimodal Human-Computer Interaction, pages 81–85.


[Yesilbek et al., 2015] Yesilbek, K. T., Sen, C., Cakmak, S., and Sezgin, T. M. (2015). Svm-based sketch recognition: which hyperparameter interval to try? In Proceed- ings of the workshop on Sketch-Based Interfaces and Modeling, pages 117–121. Eurographics Association.


[Yu et al., 2013] Yu, H., Ho, C., Juan, Y., and Lin, C. (2013). Libshorttext: A library for short-text classification and analysis. Rapport interne, Department of Computer

 



Science, National Taiwan University. Software available at http://www. csie. ntu. edu. tw/˜ cjlin/libshorttext.


25

 CHAPTER 1




INTRODUCTION






1.1 Motivation





Searching in a collection of documents is done using indexes of these documents. The simplest approach is that the single index stores information for every single docu- ment, then search query can be run over this index. Apparently, this basic approach is not scalable, as the index grows linearly proportional to the number of documents. Distributing the documents across separate nodes allows conducting search in paral- lel. In this distributed approach, there is a broker node that sends search queries to these nodes with the documents and the corresponding indexes, and then gather the results from them. Broker sends the query to all nodes without making any selection. Therefore, all documents are still be processed by the query. This approach is called as exhaustive search on a distributed architecture.

Allocating the documents in different nodes can be done in various ways. For exam- ple, documents can be grouped by their topical similarity, therefore we can say that related documents will be all together. It is also possible to allocate them according to their source information, i.e. URL information for web based search. If we distribute the documents to the nodes using such representative information, i.e., store topical clusters at each node; then at the broker node, we can estimate which clusters may yield better results for the query (called as the resource selection stage [1]). To do that, broker stores information about all the clusters, which is called as centralized sample index, CSI. This technique, called as selective search [1], does not process all the documents, however it still can be as effective as the exhaustive search.

 

In search concept, main purpose is to satisfy the user’s expectation, which naturally involves a diversity among different needs. To achieve this, different aspects of the query should be taken into consideration before returning results. However, queries aren’t always clear enough. Because of the ambiguity of the queries, it is difficult to understand user’s actual intention. Search engines use the similarities between the documents and the query for retrieval. However, this approach may lead to result lists that include only similar documents. Therefore, there should be an additional proce- dure to diversify the results to cover different possible user intentions. On the other hand, there is a trade-off between diversity and relevance: covering more aspects than the user’s need can cause including irrelevant documents.

For example, if we consider the web search engines, many users type only couple of words to find the information they are looking for. A user who wants to find something about fruit apple, may forget to add word ‘fruit’ to query, because of that, most probably the user will be misled to ‘Apple the Company’. The lack of more explanatory words like ‘fruit’, make it difficult to match user expectations. For this kind of ambiguous queries, search engines are expected to generate a result set which includes different aspects of the same query. Despite the fact that they should consider query ambiguities, the trade-off between diversity and relevance prevents them to add documents randomly to the result set just because they improve the diversity of the set [2].

For selective search, there are several layers where diversification can be applied. For example, at the broker node, diversification can be applied just before the result set is being returned, or each cluster may try to diversify their partial result set before sending them to broker. In all of these cases; search result diversification can be done either in an explicit way, where search engines have an external knowledge about queries; or implicitly (as will be discussed later). For both cases, the main purpose is to maximize the coverage to solve ambiguity issues and minimize redundancy, while keeping relevance still high.

 

1.2 Contribution of Thesis



In this thesis, we contribute the literature in several ways, listed as follows.


We propose alternative ways of creating a centralized sample index (CSI) based on various document properties and evaluate their impact on the resource se- lection and subsequently, overall diversification effectiveness.

We expand queries using word embeddings and apply diversification to obtain diverse expansion terms, as in [3]. As a novel contribution, we process such expanded queries on the CSI, i.e., during the resource selection stage, to obtain more diverse resources.

For selective search, we explore the performance of search result diversifica- tion at different layers (i.e., at the broker vs. in the clusters) and at different stages, namely, before resource selection and before/after result merging. To the best of our knowledge, while diversification performance is investigated over a distributed setting for exhaustive search (i.e., with random assignment of documents to nodes), our work is the first to make such a detailed analysis in the context of selective search and by employing representative strategies for both implicit and explicit search result diversification.



1.3 Organization of Thesis



In Chapter 2, we will review the studies in the literature for the result diversification for selective search. This chapter will give background information about algorithms and approaches we adapted for our experiments; i.e., well-known implicit and explicit diversification algorithms, resource selection, query expansion and document alloca- tion techniques used in our experiments. In Chapter 3, we will explain the techniques that are used to create different CSIs and also how we expand the queries using word embeddings. In Chapter 4 we will present different approaches of diversification for selective search on which we build up our experiments. Then, respectively in Chap- ters 5 and 6, detailed information about experiments will be presented. Finally, in Chapter 7, we will conclude our work with a summary and possible future work.

 


 

CHAPTER 2




RELATED WORK






In this chapter, firstly we will present selective search and its document allocation and CSI generation steps that we also adapted for our work. We will also explain different studies for resource selection phase of selective search. Then, we will present various diversification approaches studied in the literature so far. Finally, we will discuss word embeddings and its usage to extend queries.




2.1 Selective Search



Selective search is an approach to search in very large scale environment. Partitioned documents are splitted into clusters. In the center of the selective search, there is a broker which technically manages the process. It sends the query to the clusters. In Figure 2.1, broker is responsible to send the query to selected clusters, that are 1 and

3. However, to decide which clusters to send, it first runs the query in centralized sampled index. After sending the query to selected clusters, broker then collects and merges the results. This technique, which was previously known as a cluster based retrieval [4] aims to reduce resource usages while keeping the accuracy of the results.

Kulkarni and Callan [1]’s studies show that selective search approach can be used instead of exhaustive search, since overall effectiveness of the resultant sets are not so different. Main advantage and purpose of the selective search is reducing the overall time spent by the search engine. In next sections, we will respectively take a look at the studies in clustering the documents, choosing a centralized index for them, and finally selection techniques over these centralized sample indexes.

 

Query




Run query

Broker CSI


Resource Selection

Select Blue


Figure 2.1: Selective search over partitioned clusters using CSI.










2.1.1 Document Allocation Policies






For selective search, each physical shards consist of collection of documents. The main issue here is to determine the relation between documents in same shards. These allocations can be done using any kind of information of documents. Currently, there are three different types of document allocation policies studied, these are random selection, source based selection and topic based selection [5]. Random based al- location choose documents randomly. Source based is mainly depends on the url addresses of web documents.

In this thesis, we applied topically partitioned shards using Kullback-Liebler diver- gence method that is introduced by Xu and Croft [6]. As shown in Algorithm 1 and in Figure 2.2; after documents in the collection are sampled, these sampled documents are used to iteratively generate cluster centroids that is used to assign documents.

 











   






Figure 2.2: Topically Document Allocation simulation.




Algorithm 1 Topically Partitioning using K-means and Kullback-Leibler divergence

 

Input: Document collection C, Number of Clusters K, Times of K-means N, Sam- pling Percentage P

Output: Topical clusters CT

1: SAMPLEDOCS ← RandomSampleDocs(C, K, P)

2: CT ← InitializeKMeans(SAMPLEDOCS, K)

3: CENTROIDS ← CalculateKLCentroids(RANDOMDOCS) // Kullback-Liebler divergence

4: for N Times do

5: for Sampled Document d ∈ C do

6: for k ∈ {1, . . . , K} do

7: FIND DISTANCE(CTi, D)

8: end for

9: Assign d to CTn where CTn is the closest cluster

10: end for

11: end for

return CT

 

2.1.2 Centralized Sample Index (CSI)


After documents are allocated different clusters, each one should be represented at broker in an index. In this index, instead of including information from all documents; it is possible to maintain selection quality at broker by including a subset of them. Current experiments in literature shows that this subset, namely centralized sample index can return effective results and represent the real clusters good enough [1].

Aly et al. [7] introduce a randomly generated centralized sample index approach. In this approach, each cluster will be represented equally at broker, because each one of clusters will send same percentage of documents that are randomly chosen. We adapted this approach to create CSI.



2.1.3 Resource Selection


Selective search is based on searching multiple clusters at the same time. To achieve that, subset of clusters are questioned by queries and selected clusters’ results are merged into a single list [8].

Here the main issue is how to choose correct clusters. In the literature, resource selection is mainly divided into two main categories, big document approach and small document approach. Former one is the first generation of resource selection techniques. In this technique, resources are treated as a bag of words, which are concatenation of its documents or its sampled documents. Based on this, resource selection task is reduced to a document retrieval task [9].

Another resource selection model is the small document model, which ranks the sam- ple documents according to occurrences of documents in the result set. In this thesis, we tried well known resource selection algorithms that apply small document model, these are Redde [10], ReddeTop [11], [12], CRCSLin, CRCSExp [13] and GAVG

[14].


Redde resource selection method is proposed after noticing that big document ap- proaches does not do well in environments that contain small and large documents. Instead of using meta documents, this approach uses the information about collection

 

size and result list. Running query in sampled documents creates a ranked list and from this list, a score for resource is calculated by only looking the total number of documents from that resource in top documents. This score is the score that shows relevancy of resource and query but it is not the final score. For finalizing score, as shown in Equation 2.1, Redde uses the ratio of the size of resources sampled docu- ments to the size of resource.


 



where

 

Score(Ci, q) = n ×

 

|Ci|


 

|Si|

 

(2.1)

 


Ci is the cluster to score

q is the query

n is the number of documents in top results from Ci

Si is the set of sampled documents from Ci

|Ci| is the size of Ci, i.e. number of documents in Ci

|Si| is the size of Si, i.e. number of sampled documents from Ci


In ReddeTop, which is proposed by Arguello et al. [12], scores of the documents are used instead of their ranks, it is shown in Equation 2.2


 




where

 


Score(Ci, q) =

 

 

d∈Si,d∈R

 

P (d, q) ×

 

|Ci|

 

|Si|

 

(2.2)

 


R is the top N result obtained from CSI

P (d, q) is the score of document d in query q


In addition to Redde and ReddeTop; CRCSLin and CRCSExp [13] other methods that take into consider ranks of the documents in the result list. Their equations are shown in Equation 2.3 and Equation 2.4 The main difference between CRCSLin and

 

CRCSExp is that latter consider ranks exponentially, where the former has linear equation.


 


Score(Ci, q) =



i

 

 

d∈Si,d∈R

 

 

n − r ×


−β×r

 

|Ci|

 

|Si|

|Ci|

 

(2.3)

 



where

 

Score(C , q) =

 

α 

d∈Si,d∈R

 

× |Si|

 

(2.4)

 


r is the rank of document d

α and β are the coefficients


GAVG is the another resource selection algorithm which uses document scores as a variable to determine best clusters [14]. As it is shown in Equation 2.5, it calculates geometric averages of the document scores.


 

Score(Ci, q) = ( rr

d∈Si,d∈top m ofSi∈R

 


P (d, q))m

 


(2.5)

 


2.2 Diversification for Selective Search



For selective search, processed queries yield a result set which is a sorted document list with respect to their relevancies to the query. Search result diversification is the process of reordering of these result sets in a way that they will handle the ambiguity of the existing query. This problem is defined as a NP-hard problem [15].

Search result diversification methods are mainly divided into two categories, implicit and explicit techniques [16]. Implicit diversification methods don’t need external knowledge, and mainly depend on relevance results of the documents. To cover other aspects of the query, it re-rank these lists.

Many implicit diversification techniques try to maximize the total score of the set by taking into account relevance and dissimilarity scores. For example, MaxSum disper- sion algorithm [17], as shown in Algorithm 2, tries to maximize objective function,

 

Equation 2.6. Requirement to have document vectors in the place that implicit algo- rithm is being applied, is one drawback of this method. It needs document vectors to calculate SIM and DIV functions.



f (di, dj) = (1 − λ)(SIM (q, di) + SIM (q, dj)) + 2λDIV (di, dj) (2.6)

where


di, dj documents to compare

q query

λ trade-off variable between similarity and dissimilarity

SIM similarity function

DIV dissimilarity function


Algorithm 2 MaxSum Dispersion Algorithm

 

Input: Document set S, result set size k

Output: Re-ranked list R, |R| = k

1: R ← InitializeEmptyResultList()

2: for i ∈ {1, . . . , k/2} do

3: FIND (di, dj)=argmaxdi,dj ∈S (f (di, dj))

4: SET R ← R U {di, dj}

5: SET S ← S \ {di, dj}

6: end for

7: if k is odd then

8: select an arbitrary document di ∈ S

9: SET R ← R U {di}

10: end if

return R



SY is another implicit diversification algorithm, that aims to find near duplicate doc- uments in the collection set [18]. In a sorted result set, it computes document to

 

document relevancy scores and removes the ones that have a score more than thresh- old, as shown in Algorithm 3.


Algorithm 3 SY Algorithm

 

Input: Document set S, result set size k, threshold λ

Output: Re-ranked list R, |R| = k 1: R ← InitializeEmptyResultList() 2: i ← 0

3: while i < k and i < |S| do

4: j ← i + 1

5: while j < |S| do

6: if SIM (S[i], S[j]) > λ then

7: REMOVE S[j] from S

8: else

9: j ← j + 1

10: end if

11: end while

12: SET R ← R U S[i]

13: i ← i + 1

14: end while

return R



Unlike implicit algorithms, explicit search result diversification algorithms use an ex- ternal knowledge, like subtopics of existing queries to cover different aspects. Query aspects are used to re-rank the candidate result list. xQuAD is one promising explicit diversification algorithm in the literature [19]. It tries to maximize its objective func- tion, Equation 2.7, by diversifying the relevance scores between the original query and documents, using the relation between subtopics of the query and documents. The product term in the equation computes novelty values for each aspect. xQuAD algorithm is shown in Algorithm 4.


 

fxQuAD(di) = (1 − λ)P (di|q) + λ   P (qi|q)P (di|qi) rr (1 − P (dj|qi))

 


(2.7)

 

qi  

 

dj ∈R

 

where


P (di|q) is relevance of di to query q


P (qi|q) is likelihood of the aspect qi for query q


P (di|qi) is relevance of di to the aspect qi



Algorithm 4 xQuAD Algorithm

 

Input: Document set S, result set size k, tradeoff λ

Output: Re-ranked list R, |R| = k 1: R ← InitializeEmptyResultList() 2: i ← 0

3: while i < k and i < |S| do

4: FIND d∗ ← argmaxd∈S fxQuAD(d)

5: S ← S − d∗

6: R ← R U d∗

7: i ← i + 1

8: end while

return R



In another study, Hong and Si [20] propose two alternative approaches for search result diversification techniques. First of them is diversifying the document rank- ing while selecting clusters. This approach basically applies various diversification algorithms to result set which gained by processing the query over centralized sam- ple index, CSI. This algorithm, named as Diversification approach based on sample Documents (DDiv).

In the same study, Diversification approach based on Source-level estimation (DivS) is the other proposed diversification technique. It treats clusters like a single big document, and computes their probability of relevance of query. Main advantage is that DivS also naturally supports wider range of resource selection algorithms. On the other hand, it requires an additional computation power to compare scores of query aspects for each clusters.

 

Hong and Si [20] conclude their work that both DDiv and DivS can outperform tra- ditional methods. They also showed that resource level result diversification over CSI results can outperform source level diversification techniques. In this thesis, we will adapt their work and apply DDiv algorithm to compare different applications of search result diversification.



2.3 MMRE: Diversified Query Expansion using Word Embeddings



In this thesis, we also adapted an approach from Bouchoucha et al. [3] to expand the query words. They use the Maximal Marginal Relevance (MMR) implicit diversifica- tion algorithm [21] for selecting expansion terms. Their algorithm, called as MMRE (MMR-based Expansion), shown in Algorithm 5, searches for best documents which maximize MMRE function in Equation 2.8.


fmmre(wi) = λP (wi|q) − (1 − λ)MAXw‘∈R (P (w‘|q)) (2.8)

where


P (wi|q) is relevance of wi to query q

R is already expanded words


Algorithm 5 MMRE Algorithm Input: Dictionary D, result set size k Output: Expanded words R, |R| = k 1: R ← InitializeEmptyResultList()

2: i ← 0

3: while i < k do

4: FIND w∗ ← argmaxw∈D fmmre(w)

5: D ← D − w∗

6: R ← R U w∗

7: i ← i + 1

8: end while

return R

 

During our query expansion experiments, we used word embedding technique [22], [23]. In word embedding, words are represented as a multi dimensional vectors, which helps the task of similarity computation among words [24]. In this model of representation of words, similar words have similar vectors. We used GloVe dictio- nary to represent word embeddings of words in query sets, as GloVe is shown to outperform earlier word representation models [25].

 


 

CHAPTER 3




ANALYZING CSI PERFORMANCE WITH QUERY EXPANSIONS






In this chapter, we will present different CSI creation techniques that we used during our experiments. First we will explain how we created the query-biased access count based CSI and pagerank based CSI. We used these CSIs to compare the impact on overall diversification performance by running over different query sets. In addition to the MMRE query expansion algorithm, we also applied a query expansion algorithm based on only word similarities that will be explained in this chapter.



3.1 Query Biased - Access Count Based CSI



As explained in Chapter 2, we adapted random based CSI in our work. However, this approach mainly depends on random function, and it always generates different document set. To create a more robust and more representative CSI, we applied a query-biased technique and generated an access count based CSI, we showed this technique in Algorithm 6.

For this technique, first of all we needed a query set to collect return set. We used AOL query set, which includes approximately 100.000 queries. For each query, we counted how many times each document appeared in top 10 of query results.

After we retrieved the query-biased access counts of every documents in the total collection, we created two CSIs. First as we show in Algorithm 6, each cluster is represented same by applying sample rate in cluster level. In addition to this, as shown in Figure 3.1, we also collected best %1 of the documents in that list. In this work, we used the second alternative. We removed spam documents from the CSI.

 















d87 d54 d2 d62 d22 d34 ... d643



Remove spam documents Descending order


d87 d54 d62 d22 ... d643 Add Top N documents

Figure 3.1: Query-biased Account count based CSI simulation.





Spam documents are retrieved from Cormack [26]’s studies which based on the same dataset.



Algorithm 6 Query-biased Access Count based CSI

 

Input: Query Set Q, Document Collection D, Number of Clusters K, Sampling Rate S

Output: Document Set DS

1: for Query q ∈ Q do

2: Resulti ← ProcessQuery(D, q)

3: end for

4: POPULARDOCS ← COUNT in Resultn

5: for k ∈ {1, . . . , K} do

6: SamplingCounti ← S x Numberof Documents ∈ Shardi

7: DS ← TOP SamplingCounti Documents ∈ Shardi

8: end for

return DS

 

3.2 Pagerank Based CSI



There is also another way to detect popular documents in the collection. We used pagerank information of documents in the collection. %1 of the all documents are collected from the collection according to their Pagerank scores. Highest scored doc- uments are selected first. We applied spam prunning as we mentioned in the case of the access-based CSI.



3.3 Query Expansion with Word Embeddings



As we mentioned in Chapter 2, we adapted MMRE algorithm and applied it using word embeddings to generate different query sets. In addition to this diversified query set, we also generated another set which based on similarities between query and dictionary words.

For each query, first we simplified them by taking the average of word vectors. So that, each query is represented as a single vector. Then for each word in the dictionary, we computed similarity scores using word embeddings. Each query is expanded by an extra 5-words which are closest to the query vector.

Here, together with CSI techniques, we wanted to investigate the effect of expanded query sets on overall diversification metrics. It’s important to compare them together since we processed these expanded query sets over CSI for selective search. We aimed to reach more precise clusters by processing the expansions over CSI, since expansions handle possible vocabulary gaps.

 


 

CHAPTER 4




SEARCH RESULT DIVERSIFICATION FOR SELECTIVE SEARCH






In this chapter, we will explain different diversification approaches that can be applied on different layers of selective search. In addition to traditional approach, we adapted resource level diversification technique from [20]. Moreover, we also applied in- cluster level diversification. In the first section, we will discuss selective search along with search result diversification. Then, we will introduce three different diversifica- tion approaches that reveal the effectiveness of diversification at different stages of selective search.



4.1 Search Result Diversification for Selective Search



Traditionally, textual search systems divides collections into subsets, and the query is redirected through these collections. Even though, there is a gain in parallel query processing for this approach, it is still an exhaustive method. The query is processed for all documents. On the other hand, this computational cost can be redacted by only searching relevant shards among all. This approach, namely selective search, as introduced by Kulkarni and Callan [1], partitions collections into different clusters. Each cluster is represented by sampled documents in the centralized sampled index (CSI). Query is managed by broker. Broker processes the query over CSI, and selects best clusters. In Figure 2.1 it is shown after query is processed over CSI, only selected clusters will be chosen as a target.

Search result diversification’s primary aim is always to serve most optimum results with respect to diversity and relevance metrics. To achieve this, for query result lists, recent diversification algorithms mainly re-rank them. Many prior work determine

 

Resource Selection

CSI

result, DDiv

N/A

Merged result, BDiv

InCluster

 

Broker Node

 

Layers

 

Figure 4.1: Search result diversification at different stages and layers.



the best document using relevance and diversity estimation altogether in the linear equation [17]. Existing search result diversification algorithms are splitted into two main categories: implicit and explicit. In this work, we applied both of them.

It is possible to apply these search result diversification algorithms in different query result set. Naini et al. [27] studied the performance of diversification for exhaustive search by applying diversification at broker and nodes. For selective search; as we show in Figure 4.1; there exist different candidate sets for diversification. Broker produces result set by processing the query over CSI; moreover it also merges the result sets from chosen clusters. So that, at broker layer, CSI result diversification and merged result diversification are possible.

Furthermore, queries are processed inside the clusters, therefore these results can also be diversified before sending them back to broker. It is shown in Figure 4.1 as InCluster.



4.2 Diversification at Broker Node



First, we applied straightforward approach for search result diversification for selec- tive search. This approach diversifies the final result set at broker node, just after collecting the results from selected clusters.

As shown in Figure 4.2, query is first processed over CSI. Resource selection algo-

 

 

Figure 4.2: Diversification of merged results at Broker node.



rithm chooses best clusters using this result set. Then broker forwards the query to the selected clusters. Each cluster runs the query over their document collection, then returns their result set back to the broker. Broker merges all these results from dif- ferent clusters. Then, diversification is applied over this merged result set at broker node.



4.3 DDiv: Diversification Based on CSI



As we discussed in Chapter 2, Hong and Si [20] proposed a different diversification approach for selective search. In their work, they showed DDiv can outperform tradi- tional methods and their other proposed method, DivS as well. Therefore, we adapted DDiv which applies diversification before the relevant clusters are selected.

As shown in Figure 4.3, query is processed over CSI. Then, diversification is applied directly over this result set. Clusters will be chosen using this diverse set by resource selection algorithms. This approach diversifies in resource level to find diverse clus- ters and subsequently, final rankings as well.

 

Query

Broker Run query CSI

CSI results diversification Resource Selection

Select Blue, Yellow


Figure 4.3: DDiv: Diversification based on CSI results.



4.4 Diversification inside the Selected Clusters



We applied another result level diversification approach in our experiments. Unlike the first straightforward approach, this method diversifies the result sets inside each clusters.

As shown in Figure 4.4, after query is forwarded to the relevant clusters by broker, each cluster diversifies its own result set for this query. Then, they return diversified set to the broker. This method can explore more distant documents, since each of clusters are represented with their diversified set at broker node.

 















 

Figure 4.4: Search result diversification inside clusters.

 


 

CHAPTER 5




EXPERIMENTAL SETUP






In this chapter, we will explain our all experimental setup. First, we will mention about data set we used. Then respectively, clustering technique, different CSIs, query expansion, query processing, resource selection and diversification approaches will be explained. Finally, in the last section we will describe evaluation metrics.



5.1 Data Set



During the experiments, we used Clueweb B as a document collection1. There are total of 50,220,538 documents exist in this collection. Total number of terms for this collection is 163,629,158.

We used spam list, generated by [26] to prune spam documents in our experiments. We also adapted pagerank scores of Clueweb B documents that is available online2.

We used Trec Web Track query sets for query processing [28, 29, 30, 31]. Trec query set 2009, 2010, 2011, 2012 are combined for experiments. They are also available online3. These 4 sets have total of 198 queries. For our explicit diversification exper- iments, we also used query aspects of this query sets.

Moreover, we also used AOL query set to create a query biased CSI [32]. AOL query set includes 100000 queries. We used a different query set, otherwise CSI could be biased over same query set. This makes evaluation difficult, especially for baseline comparisons, because query processing over that CSI would already return

 

1 https://lemurproject.org/clueweb09.php

2 https://lemurproject.org/clueweb09/pageRank.php

3 https://trec.nist.gov/data/webmain.html

 

best popular documents.



5.2 Document Allocation



To simulate complete selective search framework, we created topic based clusters from Clueweb B document collection. We used K-means algorithm as we discussed in Chapter 2. We used a subset of documents from the collection to apply K-means. These documents are selected randomly. Moreover, each cluster centroid is also ini- tialized randomly from this sample set. Then, rest of the sample set is also distributed to the clusters. Here, we used Kullback-Liebler divergence as described by Kulkarni and Callan [5] while computing the similarities between documents and centroid of clusters. Then K-means applied multiple times.

In his work, Hafızog˘lu [33] proved that under selective search environment, best pre- cision and diversity scores are achieved by splitting this dataset into 100 clusters. We used very same clusters with his work.

As we mentioned in Chapter 3, we applied spam prunning all three indexes.



5.3 Centralized Sample Indexes



After documents are allocated according to their topics, we created 3 different cen- tralized sample indexes. First CSI is created by random sampling. Each cluster is represented in CSI by 1% sampling rate. As a result of this sampling, CSI includes 502,200 documents.

In addition to random CSI, we also created a CSI which is created using AOL query set. This set is processed over all documents in the collection and according to their number of existence in the top 10, best documents are selected from each cluster. Similarly, we applied same strategy to create Pagerank CSI. However, this time we used pagerank information of Clueweb B dataset.

For search result diversification method comparisons, we used random based CSI. At the end of the Chapter 6, we also showed some additional comparisons between

 

access and random based CSIs.




5.4 Query Expansions using Word Embeddings



As we explained in Chapter 3, we used Global Vectors for Word Representations, GloVe4 on our query expansion experiments. We used 6B tokens, that includes 400,000 words in the dictionary. In this representation, each entity is represented by 100 dimensional vectors.

We created 2 new query sets that derived from Trec query set. For both, we used cosine similarity to compute relevancy between words and the query average. For di- versified expansion set, we diversified the most relevant 50 words to the query average vector. For MMRE function, we set lambda as 0.5.




5.5 Query Processing



As a query processing algorithm, Best Matching 25 is used [34]. Constants k1 and b is set to 1.2 and 0.5, respectively. We cleaned stop words from texts during query processing.




5.6 Resource Selection



We used Redde as a resource selection algorithm, since prior research favors it [20]. Hafızog˘lu [33] made a cluster coverage analyze in same dataset, which shows that top %10 of these clusters have the %99 of relevant documents. That’s why we also picked the best %10 of clusters. Each cluster returns their best 20 results, that’s why our evaluations are mostly based on top 20, 10 and 5 results. We applied Redde algorithm for top 200 documents of the result set.


4 https://nlp.stanford.edu/projects/glove/

 

5.7 Diversification



We applied both implicit and explicit search result diversification algorithms in this study. As an implicit algorithm, SY is implemented. xQuAD algorithm is also imple- mented, which is a very successful example of explicit algorithms. These algorithms are used as explained in Chapter 2. For xQuAD, BM25 query processing scores are normalized by sum of scores in result set. For both algorithms, we used λ values: 0.25, 0.5 and 0.75. Except the adapted DDiv method, we always diversified top 100 documents of the result set. For DDiv method, since diversification is applied directly over CSI, we diversified top 200 of CSI result set.



5.8 Evaluation Techniques



We used α-nDCG as a diversity performance metric for all experiments [35]. Using this technique, we showed diversity scores for top 5, 10 and 20 results. We used nDeval5 as a tool from Trec Web Track archives. nDeval calculates diversity metrics for a given query results, including α-nDCG.

In addition to diversity metrics, we also showed precision values for top 5, 10 and 20 results. These scores are computed using a tool named, trec_eval, which is also part of Trec Web Track. For both evaluation metrics, Trec provides a ground truths to compare.


















5 http://www-personal.umich.edu/ kevynct/trec-web-2014/

 

CHAPTER 6




EXPERIMENTAL RESULTS






In this chapter, we will present our experimental results. First, we will present results for different CSIs together with query expansions. Next, we will compare different search result diversification methods.



6.1 CSI Results



We created random, access and pagerank based CSIs. We compared them under selective search environment as described in Chapter 2. In tables, mmre represents the diversified query expansions. Other expanded set we described in Chapter 3, is named as exp.

According to their effectiveness values in Table 6.1, access based beats other CSI alternatives.

We also analyzed CSI performances with query expansions. In Tables 6.2 and 6.3, it is shown that access based CSI can not gain from query expansions. On the other hand random and pagerank CSIs can improve overall diversity metrics for selective



Table 6.1: Effectiveness Results for CSI types.



P@5 P@10 P@20 a-nDCG@5 a-nDCG@10 a-nDCG@20


access 0.3381 0.3244 0.2865 0.2811 0.3162 0.3489

random 0.3289 0.3162 0.2825 0.2758 0.3129 0.3438

pagerank 0.3188 0.3025 0.2617 0.2639 0.3013 0.3295

 

Table 6.2: Diversity effectiveness results for CSI types with query expansions.


csi query expansion a-nDCG@5 a-nDCG@10 a-nDCG@20

- 0.2811 0.3162 0.3489

access mmre 0.2791 0.3132 0.3439

exp

0.2796 0.3133 0.3404


- 0.2639 0.3013 0.3295

pagerank mmre 0.2742 0.3117 0.3392

exp 0.2782 0.3121 0.3388

- 0.2758 0.3129 0.3438

random mmre 0.2816 0.3171 0.3481

exp 0.2819 0.314571 0.3440




Table 6.3: Precision effectiveness results for CSI types with query expansions.


csi query expansion P@5 P@10 P@20

- 0.3381 0.3244 0.2865

access mmre 0.3239 0.3112 0.2764

exp

0.3289 0.3076 0.2685


- 0.3188 0.3025 0.2617

pagerank mmre 0.3218 0.3091 0.2665

exp 0.3259 0.3025 0.2579

- 0.3289 0.3162 0.2825

random mmre 0.331 0.3173 0.2812

exp 0.3401 0.3193 0.2789




search.


In conclusion, we found that access based CSI can outperform other CSI techniques both in diversity and precision metrics. However, we also realized that it is possi- ble to improve the overall diversity quality by expanding query words using word embeddings. Random based CSI with mmre query expansions outperform all other alternatives at top 10 and top 5 results and it is barely beaten by access CSI at top 20 results. Therefore, random based CSI is a better choice in case of query expansions will run over CSI.

 

Table 6.4: Diversity effectiveness results for Implicit Diversification.


lambda a-nDCG@5 a-nDCG@10 a-nDCG@20

0.25 0.1899 0.1892 0.1924

BDiv 0.5 0.2338 0.2525 0.2685

0.75

0.2665 0.3048 0.3286


0.25 0.2787 0.3145 0.3453

DDiv 0.5 0.2801 0.3126 0.3452

0.75 0.2754 0.3097 0.3427

0.25 0.2576 0.27463 0.2808

InCluster 0.5 0.2626 0.2908 0.3106

0.75 0.2705 0.3107 0.3397






6.2 Implicit Diversification Results




We compared methods described in Chapter 4 in the implicit setup. We used these abbrebiations to identify methods in the tables: BDiv for Diversification at Broker Node; DDiv for Hong and Si [20]’s Diversification approach based on sample Docu- ments; InCluster for Diversification inside the Selected Clusters.

In the Tables 6.4 and 6.5, it is shown that DDiv method beats their compatitors as Hong and Si [20] claimed in their work. After we verified this, we compared mmre method with them. This showed us that adapted method mmre can outperform best scores of all these methods, as shown in Table 6.6.

Since mmre beat other methods, we combined this resource level diversification method with BDiv method. Here, we aimed to outperform mmre performance by applying a diversification to the final result set at broker node. This method represented as BDiv+mmre in the tables. Also, other resource level diversification method, DDiv could also improve its result set in same way. Therefore, we applied same strategy this method as well. In the tables, its abbreviation is DDiv+BDiv. None of these additional methods could beat mmre approach in implicit setup.

 




Table 6.5: Precision effectiveness results for Implicit Diversification.


lambda P@5 P@10 P@20

0.25 0.1279 0.0741 0.0411

BDiv 0.5 0.2081 0.1645 0.1112

0.75

0.2792 0.2401 0.1812


0.25 0.3279 0.3122 0.2782

DDiv 0.5 0.3289 0.3137 0.2812

0.75 0.3259 0.3132 0.2815

0.25 0.2426 0.1665 0.0959

InCluster 0.5 0.2538 0.2091 0.1475

0.75 0.2904 0.2553 0.2003












Table 6.6: Effectiveness results for Implicit Diversification with all methods’ best results.



P@5 P@10 P@20 a-nDCG@5 a-nDCG@10 a-nDCG@20


BDiv 0.2792 0.2401 0.1812 0.2665 0.3048 0.3286

DDiv 0.3279 0.3122 0.2782 0.2787 0.3145 0.3453

InCluster 0.2904 0.2553 0.2003 0.2705 0.3107 0.3397

mmre 0.331 0.3173 0.2812 0.2816 0.3171 0.3481

BDiv+mmre 0.2822 0.2371 0.1799 0.2745 0.3078 0.3296

DDiv+BDiv 0.2731 0.2365 0.1802 0.2653 0.3029 0.3283

 

Table 6.7: Diversity effectiveness results for Explicit Diversification.


lambda a-nDCG@5 a-nDCG@10 a-nDCG@20

0.25 0.3001 0.3368 0.3738

BDiv 0.5 0.3163 0.3514 0.3826

0.75

0.3215 0.3522 0.3824


0.25 0.2655 0.2998 0.3320

DDiv 0.5 0.2695 0.3047 0.3354

0.75 0.2729 0.3071 0.3392

0.25 0.2842 0.3197 0.3552

InCluster 0.5 0.2918 0.3268 0.3613

0.75 0.2945 0.3308 0.3647


6.3 Explicit Diversification Results



In the explicit setup, we found that BDiv outperforms other methods. In Tables 6.7 and 6.8, it is shown that DDiv method is actually behind the others with respect to precision and diversity.

As we have shown in previous section, we merged some methods together to improve their performances. In explicit setup, BDiv method clearly beats others, therefore we used mmre method together with BDiv to improve cluster selection. BDiv results could also improve by combining it with DDiv. Since it can also improve cluster selections. That’s why, we investigated their results, too. In the Table 6.9, it is shown that none of the methods could actually beat BDiv method.

As an additional experiment, we employed access based CSI for the best methods found above. In Table 6.10, we showed their results in implicit setup. Best scored method, DDiv with random based CSI still outperforms access based CSI. For explicit setup, similarly, BDiv with random based CSI beats its access based alternative as shown in Table 6.11.

 

Table 6.8: Precision effectiveness results for Explicit Diversification.


lambda P@5 P@10 P@20

0.25 0.3675 0.3523 0.3117

BDiv 0.5 0.3766 0.3533 0.3033

0.75

0.3838 0.3533 0.2957


0.25 0.3157 0.3076 0.2734

DDiv 0.5 0.3198 0.3102 0.2744

0.75 0.3279 0.3112 0.2749

0.25 0.3371 0.3223 0.2845

InCluster 0.5 0.3431 0.3218 0.2853

0.75 0.3472 0.3269 0.2873





Table 6.9: Effectiveness results for Explicit Diversification with all methods’ best results.



P@5 P@10 P@20 a-nDCG@5 a-nDCG@10 a-nDCG@20


BDiv 0.3766 0.3533 0.3033 0.3163 0.3514 0.3826

DDiv 0.3279 0.3112 0.2749 0.2729 0.3071 0.3392

InCluster 0.3472 0.3269 0.2873 0.2945 0.3308 0.3647

mmre 0.331 0.3173 0.2812 0.2816 0.3171 0.3481

BDiv+mmre 0.3665 0.3472 0.2967 0.3154 0.3475 0.3789

DDiv+BDiv 0.3706 0.3447 0.2954 0.3088 0.3463 0.3786





Table 6.10: Effectiveness results for Access and Random based CSIs for Implicit Diversification.



csi P@5 P@10 P@20 a-nDCG@5 a-nDCG@10 a-nDCG@20


BDiv access 0.2832 0.233 0.1805 0.2705 0.3036 0.3287

random 0.2792 0.2401 0.1812 0.2665 0.3048 0.3286

DDiv access 0.331 0.3112 0.281 0.2773 0.3130 0.3447

random 0.3279 0.3122 0.2782 0.2787 0.3145 0.3453

 


















Table 6.11: Effectiveness results for Access and Random based CSIs for Explicit Diversification.



csi P@5 P@10 P@20 a-nDCG@5 a-nDCG@10 a-nDCG@20


BDiv access 0.3645 0.3503 0.297 0.3166 0.3525 0.3799

random 0.3766 0.3533 0.3033 0.3163 0.3514 0.3826

DDiv access 0.335 0.3137 0.2789 0.2769 0.3104 0.3422

random 0.3279 0.3112 0.2749 0.2729 0.3071 0.3392

 


 

CHAPTER 7




CONCLUSION AND FUTURE WORK






In this thesis, we provided an in-depth analysis of search result diversification in the context of selective search, and proposed extensions to improve diversification effectiveness. First, we showed that creating a CSI based on the past document access statistics yields both more relevant and diverse results than a CSI based on randomly sampled documents. However, by processing queries expanded with diverse terms during resource selection, it is also possible to obtain diversification performance that is comparable to the latter. This finding is also important to show that even when such past statistics are not available, a random CSI together with diversified query expansion performs reasonably well.

Second, we investigated the diversification performance at different layers (i.e., at the broker vs. in the clusters) and at different stages, namely, before resource selection and before/after result merging. We found that when a representative implicit diver- sification method, namely, SY, is employed; the best diversification performance is obtained by selecting diverse resources as suggested by Hong and Si [20]. While doing so, simply processing expanded queries with diverse terms over the CSI, as we proposed in this thesis, outperform the previous approach in Hong and Si [20]. Inter- estingly, the findings vary for the explicit diversification. By employing a represen- tative explicit approach, xQuAD, we demonstrated that diversifying merged results at the broker is superior to diversifying partial results at the clusters or diversifica- tion during the resource selection. This is again a new and contradicting finding with respect to Hong and Si [20], and implies that when there are more clues for diversi- fication, it is better to conduct it at a more fine-grain level, i.e., over the result list, rather than attempting to diversify resources as a whole.

 

There are various future research directions for our work. In particular, in our addi- tional experiments we observed that by conducting a selective expansion of queries, i.e., expanding only a subset of them and/or setting different thresholds for expansion terms on a per query basis, it is possible to further improve the diversification effec- tiveness. We plan to explore such selective expansion approaches as our future work. As a second research direction, we will investigate the diversification efficiency for selective search. For instance, it is possible to create summary vectors for the doc- uments to conduct the diversification methods at the broker more efficiently. Such optimizations are also left for our future work.

 

REFERENCES




[1] Anagha Kulkarni and Jamie Callan. Selective search: Efficient and effective search of large textual collections. ACM Transactions on Information Systems (TOIS), 33(4):17, 2015.


[2] Rodrygo LT Santos, Craig Macdonald, Iadh Ounis, et al. Search result diversi-

fication. Foundations and TrendsQR in Information Retrieval, 9(1):1–90, 2015.


[3] Arbi Bouchoucha, Jing He, and Jian-Yun Nie. Diversified query expansion us- ing conceptnet. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 1861–1864. ACM, 2013.


[4] Xiaoyong Liu and W Bruce Croft. Cluster-based retrieval using language mod- els. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 186–193. ACM, 2004.


[5] Anagha Kulkarni and Jamie Callan. Document allocation policies for selective searching of distributed indexes. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 449–458. ACM, 2010.


[6] Jinxi Xu and W Bruce Croft. Cluster-based language models for distributed re- trieval. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 254–261. ACM, 1999.


[7] Robin Aly, Djoerd Hiemstra, and Thomas Demeester. Taily: shard selection using the tail of score distributions. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 673–682. ACM, 2013.

 

[8] Milad Shokouhi, Luo Si, et al. Federated search. Foundations and TrendsQR in Information Retrieval, 5(1):1–102, 2011.

[9] James P Callan, Zhihong Lu, and W Bruce Croft. Searching distributed col- lections with inference networks. In ACM SIGIR Forum, volume 51, pages 160–167. ACM, 2017.

[10] Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 298–305. ACM, 2003.

[11] Jaime Arguello, Jamie Callan, and Fernando Diaz. Classification-based re- source selection. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 1277–1286. ACM, 2009.

[12] Jaime Arguello, Fernando Diaz, Jamie Callan, and Jean-Francois Crespo. Sources of evidence for vertical selection. In Proceedings of the 32nd inter- national ACM SIGIR conference on Research and development in information retrieval, pages 315–322. ACM, 2009.

[13] Milad Shokouhi. Central-rank-based collection selection in uncooperative dis- tributed information retrieval. In European Conference on Information Re- trieval, pages 160–172. Springer, 2007.

[14] Jangwon Seo and W Bruce Croft. Blog site search using resource selection. In Proceedings of the 17th ACM conference on Information and knowledge man- agement, pages 1053–1062. ACM, 2008.

[15] Dorit S Hochbaum. Approximation algorithms for NP-hard problems. PWS Publishing Co., 1996.

[16] Rodrygo LT Santos, Jie Peng, Craig Macdonald, and Iadh Ounis. Explicit search result diversification through sub-queries. In European conference on informa- tion retrieval, pages 87–99. Springer, 2010.

[17] Sreenivas Gollapudi and Aneesh Sharma. An axiomatic approach for result diversification. In Proceedings of the 18th international conference on World wide web, pages 381–390. ACM, 2009.

 

[18] Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben, and Ujwal Gadiraju. Groundhog day: near-duplicate detection on twitter. In Proceedings of the 22nd international conference on World Wide Web, pages 1273–1284. ACM, 2013.

[19] Rodrygo LT Santos, Craig Macdonald, and Iadh Ounis. Exploiting query re- formulations for web search result diversification. In Proceedings of the 19th international conference on World wide web, pages 881–890. ACM, 2010.

[20] Dzung Hong and Luo Si. Search result diversification in resource selection for federated search. In Proceedings of the 36th international ACM SIGIR con- ference on Research and development in information retrieval, pages 613–622. ACM, 2013.

[21] Jaime G Carbonell and Jade Goldstein. The use of mmr, diversity-based rerank- ing for reordering documents and producing summaries. In SIGIR, volume 98, pages 335–336, 1998.

[22] Saar Kuzi, Anna Shtok, and Oren Kurland. Query expansion using word em- beddings. In Proceedings of the 25th ACM international on conference on in- formation and knowledge management, pages 1929–1932. ACM, 2016.

[23] Kezban Dilek Onal, Ismail Sengor Altingovde, and Pinar Karagoz. Utilizing word embeddings for result diversification in tweet search. In AIRS, pages 366–

378. Springer, 2015.


[24] Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In Pro- ceedings of the 52nd Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), pages 302–308, 2014.

[25] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://www.aclweb.org/anthology/ D14-1162.

[26] GV Cormack. Waterloo spam rankings for the clueweb09 dataset@ online, 2009.

 

[27] Kaweh Djafari Naini, Ismail Sengor Altingovde, and Wolf Siberski. Scalable and efficient web search result diversification. ACM Transactions on the Web (TWEB), 10(3):15, 2016.

[28] Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. Overview of the TREC 2009 web track. In Proceedings of The Eighteenth Text REtrieval Conference, TREC 2009, Gaithersburg, Maryland, USA, 2009.

[29] Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Gordon V. Cormack. Overview of the TREC 2010 web track. In Proceedings of The Nineteenth Text REtrieval Conference, TREC 2010, Gaithersburg, Maryland, USA, 2010.

[30] Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. Overview of the TREC 2011 web track. In Proceedings of The Twentieth Text REtrieval Conference, TREC 2011, Gaithersburg, Maryland, USA, 2011.

[31] Charles L. A. Clarke, Nick Craswell, and Ellen M. Voorhees. Overview of the TREC 2012 web track. In Proceedings of The Twenty-First Text REtrieval Conference, TREC 2012, Gaithersburg, Maryland, USA, 2012.

[32] Greg Pass, Abdur Chowdhury, and Cayley Torgeson. A picture of search. In

InfoScale, volume 152, page 1, 2006.


[33] Fatih Hafızog˘lu. Improving the efficiency of distributed information retrieval using hybrid index partitioning. Master’s thesis, 2018.

[34] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance frame-

 

work: Bm25 and beyond. Foundations and TrendsQR

3(4):333–389, 2009.

 

in Information Retrieval,

 


[35] Charles LA Clarke, Maheedhar Kolla, Gordon V Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 659–666. ACM, 2008.