The motivation of the article is to avoid the intermediate representation of the source code, represent the source code as an image, and directly extract the semantic information of the code to improve the performance of defect prediction.
First, see the motivation example as shown below. Although both examples of File1.java and File2.java contain 1 if statement, 2 for statements and 4 function calls, the semantics and structural characteristics of the code are different. In order to verify whether converting source code into images can help distinguish different codes, the author conducted an experiment: map the source code to pixels according to the ASCII decimal number of the characters, arrange them into a pixel matrix, and obtain an image of the source code. The author points out that there are differences between different source code images.
Fig. 1 Motivation Example
The main contributions of the article are as follows:
Convert the code into an image and extract semantic and structural information from it ;
Propose an end-to-end framework that combines self-attention mechanism and transfer learning to achieve defect prediction.
The model framework proposed in the article is shown in Figure 2, which is divided into two stages: source code visualization and deep transfer learning modeling.
Fig. 2 Framework
The article converts the source code into 6 images, the process is shown in Figure 3 shown. Convert the decimal ASCII codes of the source code characters into 8-bit unsigned integer vectors, arrange these vectors by rows and columns, and generate an image matrix. 8-bit integers directly correspond to gray levels. In order to solve the problem of the small original data set, the author proposed a data set expansion method based on color enhancement in the article: the values of the three color channels of R, G, and B are arranged and combined to generate 6 color images. It looks very confusing here. After changing the channel value, the semantic and structural information should change, right? But the author explains it in a footnote, as shown in Figure 4.
Fig. 3 Source code visualization process
Fig. 4 Article footnote 2
The article uses DAN network to capture the semantic and structural information of the source code. In order to enhance the model's ability to express important information, the author added an Attention layer to the original DAN structure. The training and testing process is shown in Figure 5, in which conv1-conv5 come from AlexNet, and four fully connected layers fc6-fc9 are used as classifiers. The author mentioned that for a new project, training a deep learning model requires a large amount of labeled data, which is difficult. Therefore, the author first trained a pre-trained model on ImageNet 2012, and used the parameters of the pre-trained model as initial parameters to fine-tune all convolutional layers, thereby reducing the difference between code images and images in ImageNet 2012.
Fig. 5 Training and testing process
For the tagged code and Target in the Source project The unlabeled code in the project generates code images and feeds them into the model at the same time; both share the convolutional layer and the Attention layer to extract their respective features. Calculate MK-MDD (Multi Kernel Variant Maximum Mean Discrepancy) between Source and Target in the fully connected layer. Since Target has no label, cross entropy is only calculated for Source. The model is trained along the loss function using mini-batch stochastic gradient descent. For each
In the experimental part, the author selected all open source Java projects in the PROMISE data warehouse and collected their version numbers, class names, and whether there were bug tags. Download the source code from github based on the version number and class name. Finally, data from 10 Java projects were collected. The data set structure is shown in Figure 6.
Fig. 6 Dataset structure
For intra-project defect prediction, the article selects the following baseline model for comparison:
For cross-project defect prediction, the article selects the following baseline model for comparison:
To summarize, although the paper was written two years ago, the idea is still relatively novel. , avoiding a series of code intermediate representations such as AST, and directly converting the code into image extraction features. But I am still confused. Does the image converted from the code really contain the semantic and structural information of the source code? It doesn’t feel very explainable, haha. We need to do some experimental analysis later.
The above is the detailed content of Using software visualization and transfer learning in software defect prediction. For more information, please follow other related articles on the PHP Chinese website!