Topical Structure in Long Informal Documents
It has long been acknowledged that discourse is more than a sequence of sentences but, for the purposes of many Natural Language Processing tasks, it is often modelled exactly in that way.
In this dissertation, we propose a practical approach to modelling discourse structure, with an emphasis on it being computationally feasible and easily applicable. Instead of following one of the many linguistic theories of discourse structure, we attempt to model the structure of a document as a tree of topical segments. Each segment encapsulates a span that concentrates on a particular topic at a certain level of granularity. Each span can be further sub-segmented based on finer fluctuations of topic. The lowest (most refined) level of segmentation is individual paragraphs.
In our model, each topical segment is described by a segment centre - a sentence or a paragraph that best captures the contents of the segment. In this manner, the segmenter effectively builds an extractive hierarchical outline of the document. In order to achieve these goals, we use the framework of factor graphs and modify a recent clustering algorithm, Affinity Propagation, to perform hierarchical segmentation instead of clustering.
While far from being a solved problem, topical text segmentation is not uncharted territory. The methods developed so far, however, perform least well where they are most needed: on documents that lack rigid formal structure, such as speech transcripts, personal correspondence or literature. The model described in this dissertation is geared towards dealing with just such types of documents.
In order to study how people create similar models of literary data, we built two corpora of topical segmentations, one flat and one hierarchical. Each document in these corpora is annotated for topical structure by 3-6 people.
The corpora, the model for hierarchical segmentation and a software are main contributions of this work.