ABSTRACT

Context: Large scale software projects adopt bug tracking systems such as Bugzilla and Jira to manage bug fixes and store their information. Mining bug repositories is essential to automate some maintenance phase activities including classifying new bug reports to their severity levels. The severity level of a bug specifies its negative impact on the system, in addition to its being a major factor in setting the priority of the bug to be scheduled for fixing. Several approaches for severity assignment have been proposed in the literature. The accuracy of any approach relies on two key factors which are: (i) the information retrieval model utilized to represent the bug reports and (ii) the classification technique utilized.

Objectives: This study proposes an approach that utilizes word embedding and deep learning models, for the automatic prediction of severity classes of newly submitted bug reports. Our motivation was that: (i) word embedding models can acquire semantic relations between words and sentences, and (ii) deep learning models were reported to achieve higher accuracy than traditional machine learning models, in several text classification applications.

Method: Embedding models were trained to provide representations of bug reports using fixed length vectors. Embedding vectors are a robust representation method, due to their ability to acquire semantic relations among words and sentences. The embedding vectors were used to train five effective deep learning architectures (CNN, LSTM, GRU, hybrid CNN-LSTM and hybrid CNN-GRU) to classify the bug reports to their severity classes. We experimented with five models as each of them has different capabilities; CNN can extract the key features of an input and reduce the feature space size, which reduces the noise and enhances the classification performance. While LSTM and GRU are capable of providing representations to a sequence of words (sentences).

Results: Experimental results on two open large-scale bug repositories, Eclipse and Mozilla, demonstrated that the CNN is the superior architecture in identifying the bug severity level. CNN enhanced the F-measure of four severity levels in Eclipse and three severity levels in Mozilla, in comparison to four previous studies. Moreover, utilizing embedding models boosted the performance of traditional machine learning classifiers. Furthermore, experiments showed that the macro-average performance of the CNN could be achieved by some traditional classifiers, KNN, SVM, when they were trained using the embedding vectors instead of the bag-of-words model.