ABSTRACT

The enormous amount of data published and disseminated over networks is intensifying the demand for novel and effective natural language processing (NLP) techniques and applications for management, analysis, summarization, and extraction of knowledge from the extensive number of texts available. Graph theory is a mainstream discipline that can deliver useful solutions for NLP applications and related information retrieval using techniques such as text summarization and machine translation. Graph methods are extensively employed in a number of text processing applications. Utilizing algorithms that consider the global graph structure of a specific document, rather than the characteristics of unstructured sets of objects, graph-based techniques enhance a comprehensive range of NLP functions.

Automatic summarization facilitates management of the output from information retrieval systems. Random walk summarization, using the notion of lexical centrality, can be applied to both single and multiple documents, since a graph can be constructed on the basis of information inferred from one or several documents in various degrees of detail. The concept of random walk summarization can be enhanced with the theory of biased random walk to addressing the issue of question-centered passage retrieval, wherein user questions are posed in natural language, and relevant passages giving answers to those questions are retrieved from the input document.

Keyword extraction applying random walk algorithms achieves better performance than advanced supervised methods, with the vertices of the graphs corresponding to sequences of words that are representative of the input document. Topic identification based on centrality algorithms addresses the automated detection of categories or topics appropriate to the input document with the potential to create a dynamic ranking of the topics inside a framework. Topic segmentation divides the text into segments after the identification of relevant categories or topics by constructing a graph with the text sentences as nodes and weighted edges reflecting pairwise sentence similarity. Cosine similarity or similar metrics can be used to establish sentence similarity. Normalized cut techniques represent graph methods for segmenting the text with the measurement of similarity among a partition as well as dissimilarity within varied partitions. Enhanced results can be obtained using graph-centered frameworks with extensively ranged lexical relationships among sentences and integration of within- and across-category similarities in the same framework model.

Discourse relationships that interconnect the text segments can be represented by graphs to enable crossed dependencies among statements encoding finer sets of dependencies in comparison to tree data structures. Acyclic word graphs efficiently reduce the search space of machine translations by eliminating redundancies within the candidate machine translations. Accurate cross-lingual retrieval of information from multilingual documents is accomplished through graph-based representations having co-occurrence graphs exercised with random walks. The graph-based question answering technique provides reasonable accuracy along with easier portability to new domains without the need for different question classification steps and named entity recognizers. Graphs are also significant in the capture of crucial terms in the input text.