A Discourse Search Engine Based on Rhetorical Structure Theory



Kuyten, Pascal, Bollegala, Danushka ORCID: 0000-0003-4476-7003, Hollerit, Bernd, Prendinger, Helmut and Aizawa, Kiyoharu
(2015) A Discourse Search Engine Based on Rhetorical Structure Theory. In: European Conference on Information Retrieval.

[img] Text
ecir15.pdf - Author Accepted Manuscript

Download (264kB)

Abstract

Representing a document as a bag-of-words and using keywords to retrieve relevant documents have seen a great success in large scale information retrieval systems such as Web search engines. Bag-of-words representation is computationally efficient and with proper term weighting and document ranking methods can perform surprisingly well for a simple document representation method. However, such a representation ignores the rich discourse structure in a document, which could provide useful clues when determining the relevancy of a document to a given user query. We develop the first-ever Discourse Search Engine (DSE) that exploits the discourse structure in documents to overcome the limitations associated with the bag-of-words document representations in information retrieval. We use Rhetorical Structure Theory (RST) to represent a document as a discourse tree connecting numerous elementary discourse units (EDUs) via discourse relations. Given a query, our discourse search engine can retrieve not only relevant documents to the query, but also individual statements from those relevant documents that describe some discourse relations to the query. We propose several ranking scores that consider the discourse structure in the documents to measure the relevance of a pair of EDUs to a query. Moreover, we combine those individual relevance scores using a random decision forest (RDF) model to create a single relevance score. Despite the numerous challenges of constructing a rich document representation using the discourse relations in a document, our experimental results show that it improves the F-score in an information retrieval task. We publicly release our manually annotated test collection to expedite future research in discourse-based information retrieval.

Item Type: Conference or Workshop Item (Paper)
Subjects: ?? QA75 ??
Depositing User: Symplectic Admin
Date Deposited: 09 Feb 2015 11:01
Last Modified: 16 Dec 2022 04:43
DOI: 10.1007/978-3-319-16354-3_10
Related URLs:
URI: https://livrepository.liverpool.ac.uk/id/eprint/2006345