JCCD – A Flexible Java Code Clone Detector API

Getting started

Introduction

JCCD is an API which enables you to build your own code clone detector. It has a pipeline architecture consisting of five phases - parsing, preprocessing, pooling, comparing and filtering. For further information on the architecture and the implementation of custom detectors see the corresponding chapters (coming soon).

In this tutorial we will use a specific implementation of the pipeline with a basic configuration to show you a possible usage of a completed detector - the ASTDetector. During this tutorial we will extend its configuration step by step to find more matches between the two files "TestFileOne" and "TestFileTwo".


01 package test;
02 
03 public class TestFileOne {
04 
05   public int factorial(int n){
06     if(n == 0){
07       return 1;
08     }else{
09       return n * factorial(n-1);
10     }
11   }
12   
13   public int gcdOne(int a, int b) {
14     while (b != 0) {
15       if (a > b) {
16         a = a - b;
17       else {
18         b = b - a;
19       }
20     }
21     return a;
22   }
23   
24   public int mul(int a, int b){
25     int n = 0;
26     for(int i = 0; i < b; i++){
27       n += a;
28     }
29     return n;
30   }
31 }
01 package test;
02 
03 public class TestFileTwo {
04 
05   public int factorial(int n){
06     if(n == 0){
07       return 1;
08     }else{
09       return n * factorial(n-1);
10     }
11   }
12   
13   public int gcdTwo(int c, int d) {
14     while (d != 0) {
15       if (c > d) {
16         c = c - d;
17       else {
18         d = d - c;
19       }
20     }
21     return c;
22   }  
23 
24   public double mul(double a, long b){
25     double n = 0.0;
26     for(long i = 0l; i < b; i++)
27       n += a;
28     
29     return n;
30   }
31 }

First code clone detection

Our first step will be to match the "factorial"-method in line 5 to 11. This method is identical in both files.

05   public int factorial(int n){
06     if(n == 0){
07       return 1;
08     }else{
09       return n * factorial(n-1);
10     }
11   }
05   public int factorial(int n){
06     if(n == 0){
07       return 1;
08     }else{
09       return n * factorial(n-1);
10     }
11   }

To use JCCD we first create an "ASTDetector". The next step is to configure it. For this example we just need to define which files to work on. We create a "JCCDFile"-array with our files and feed them into our detector. Then we start it by calling the "process"-method. To see some output we transfer the result of the "process"-method to the "printSimilarityGroups"-method.

1     APipeline detector = new ASTDetector();
2     JCCDFile[] files = 
3         new JCCDFile("doku/TestFileOne.java"),
4         new JCCDFile("doku/TestFileTwo.java"
5         };
6     detector.setSourceFiles(files);
7     APipeline.printSimilarityGroups(detector.process());

The output should be:

Similarity Group 9
================================================================
test/TestFileTwo.java(5.1−11.1)
test/TestFileOne.java(5.1−11.1)
================================================================

The similarity group is  a unique group of similarities in which our match was found. The matches themselves are as described as package/filename(start_linenumber.start_column - end_linenumber.end_column)

Match clones with different method and variable names

Now we will match the "gcdOne"- and "gcdTwo"-method in line 13-22 in addition to our previous match. These two methods are not identical. They have different method and variable names.

13   public int gcdOne(int aint b) {
14     while (b != 0) {
15       if (a > b) {
16         a = a - b;
17       else {
18         b = b - a;
19       }
20     }
21     return a;
22   }

13   public int gcdTwo(int cint d) {
14     while (d != 0) {
15       if (c > d) {
16         c = c - d;
17       else {
18         d = d - c;
19       }
20     }
21     return c;
22   }

To match them, we must generalize the method and variable names. For this we use operators (you can find a list of available operators here). The Op "GeneralizeMethodDeclarationNames" and "GeneralizeVariableNames" provide the necessary functionality and are predifined in JCCD. We put them in an "APreprocessor"-array and feed them into our detector.

1     APipeline detector = new ASTDetector();
2     JCCDFile[] files = 
3         new JCCDFile("test/TestFileOne.java"),
4         new JCCDFile("test/TestFileTwo.java"
5         };
6     detector.setSourceFiles(files);
    
7     detector.addOperator(new GeneralizeMethodDeclarationNames());
8     detector.addOperator(new GeneralizeVariableNames());    
9     APipeline.printSimilarityGroups(detector.process());

The output should be:

Similarity Group 21
================================================================
test/TestFileOne.java(5.1−11.1)
test/TestFileTwo.java(5.1−11.1)
================================================================

Similarity Group 15
================================================================
test/TestFileTwo.java(13.1−22.1)
test/TestFileOne.java(13.1−22.1)
================================================================

In addition to the previous match, the detector now also matches the the "gcd"-method. As we see, our matches are now in other similarity groups than before. That's because of internal operations.

Matching different number types and missing blocks

Next we want to match the "mul"-method in line 24-30. In this case, the methods have different types and use different number literals. Additionally, there is no block in the for-loop in the second file.

24   public int mul(int a, int b){
25     int n = 0;
26     for(int i = 0; i < b; i++){
27       n += a;
28     }
29     return n;
30   }
24   public double mul(double a, long b){
25     double n = 0.0;
26     for(long i = 0l; i < b; i++)
27       n += a;
28     
29     return n;
30   }

In order to match these two methods, we must insert the block in the for-loop, generalize the types and unify the number literals. To do this we add the "CompleteToBlock"-, "GeneralizeMethodArgumentTypes"-, "GeneralizeMethodReturnTypes"-, "GeneralizeVariableDeclarationTypes"- and  "NumberLiteralToDouble"-operator to our detector.

01     APipeline detector = new ASTDetector();
02     JCCDFile[] files = 
03         new JCCDFile("test/TestFileOne.java"),
04         new JCCDFile("test/TestFileTwo.java"
05         };
06     detector.setSourceFiles(files);  
07     detector.addOperator(new GeneralizeMethodDeclarationNames());
08     detector.addOperator(new GeneralizeVariableNames());
09     detector.addOperator(new CompleteToBlock());
10     detector.addOperator(new GeneralizeMethodArgumentTypes());
11     detector.addOperator(new GeneralizeMethodReturnTypes());
12     detector.addOperator(new GeneralizeVariableDeclarationTypes());
13     detector.addOperator(new NumberLiteralToDouble());  
14     APipeline.printSimilarityGroups(detector.process());

The output should be:

Similarity Group 24
================================================================
test/TestFileTwo.java(3.25−31.0)
test/TestFileOne.java(3.25−31.0)
================================================================

Because we now match all methods we get only one match - the bodies of both classes.

Last but not least

The only thing remaining now is to get a match for the two files. They differ in the class and file name, so we need to accept different file names and generalize the class names. As you might have guessed we need to add some more operators - the "GeneralizeClassDeclarationNames"- and "AcceptFileNames"-operator.

01     APipeline detector = new ASTDetector();
02     JCCDFile[] files = 
03         new JCCDFile("test/TestFileOne.java"),
04         new JCCDFile("test/TestFileTwo.java"
05         };
06     detector.setSourceFiles(files);
07     detector.addOperator(new GeneralizeMethodDeclarationNames());
08     detector.addOperator(new GeneralizeVariableNames());
09     detector.addOperator(new CompleteToBlock());
10     detector.addOperator(new GeneralizeMethodArgumentTypes());
11     detector.addOperator(new GeneralizeMethodReturnTypes());
12     detector.addOperator(new GeneralizeVariableDeclarationTypes());
13     detector.addOperator(new GeneralizeClassDeclarationNames());
14     detector.addOperator(new NumberLiteralToDouble());
15     detector.addOperator(new AcceptFileNames());
16     APipeline.printSimilarityGroups(detector.process());

The output should be:

Similarity Group 25
================================================================
test/TestFileTwo.java(1.0−31.0)
test/TestFileOne.java(1.0−31.0)
================================================================