CPD false positives

I’m using CommunityEdition 8.6.1 and DeveloperEdition 8.6, both give me the same false positives on duplicate code detection:

I’ve lots of lines that look like

$this->input(77, 27, $data->get(‘cl.tischler-fenster-tuer.fenster-fenstertueren.gang-und-schliessbar-richten-wohn-kueche’));

All these lines are not equal, they differ both in the numbers (77, 27) and in the texts - although the texts might be in part the same (e.g. …cl.tischler-fenster-tuer.fenster-fenstertueren.

I’ve tried the same code with PMD-CPD (V6.31.0) which finds duplications in the strings (100-token wide parts of strings match other 100-token wide parts of strings in other lines like in the example above), but not a single duplication spanning more than 2 lines.

SonarQube detects large chunks of duplicate code - e.g. starting in CarpenterV8.php line 10: “Duplicated By …/TilerV8.php Lines: [10 – 87]” whereas in TilerV8.php line 10 it states: “Duplicated By …/CarpenterV8.php Lines: [10 – 241]”
Those lines start with the same code:

{
public function fill(Collection $data): Fpdf
{

but after that they are different -

//Wohn-/Küche
$this->input(77, 27, $data->get(‘cl.tischler-fenster-tuer.fenster-fenstertueren.gang-und-schliessbar-richten-wohn-kueche’));
$this->input(77, 30.8, $data->get(‘cl.tischler-fenster-tuer.fenster-fenstertueren.flugeldichtung-erneuern-wohn-kueche’));

vs.

//ALLGEMEINE ARBEITEN
$this->input(95.5, 24, $data->get(‘cl.fliesenleger.allgemeine-arbeiten.schutz-von-bauteilen-flaeche’));
$this->input(139.5, 24, $data->get(‘cl.fliesenleger.allgemeine-arbeiten.fussbodenschutz-flaeche’));

none of these lines are detected by PMD-CPD as duplicate.

I would expect duplications to be found like stated in the documentation (100 successive and duplicated tokens spanning at least 10 lines of code).

Hello @sebastian.dietrich ,

Thank you for your input. We will investigate your problem and get back to you as soon as possible.

Best,
Nils

FYI: That bug still exists on latest SonarQube Version (8.7.0.41497). Anyways I suspect that bug to be in sonar-scanner CPD Executor (I’m using sonar-scanner-cli-4.6.0.2311-windows and afaik sonar-scanner-jenkins). Somehow the scanner cannot correctly parse php files (I do not experience such problems with java or c# in other projects)

SonarSource’s Copy/Paster Detection replaces numbers and strings with placeholders to increase the speed of comparing code sequences. That is, in your particular case, the lines are always translated into the following tokens.

$this->input(77, 27, $data->get(‘cl.tischler-fenster-tuer...’));

are translated to

$this, , input, (, $NUMBER, ,, $NUMBER, ,, $data, , get, (, $CHARS, ), ), ;

They will then be compared and considered as identical sequences. This behavior is intended by us. You may adjust the sensitivity of the CPD in the configuration Analysis Parameters->Duplications

Best,
Nils

Hm - I understand. So to increase performance you no longer detect duplicate code but just similar code. One can adjust the sensitivity (nr. tokens, nr. lines) but there is no possibility to adjust this behavior.

Can you please elaborate on why this behavior is intended (besides being faster).

btw. in my case the CPD Executor took 896ms for 1166 files while on the same codebase another duplicate-code detector (simian) took 2170ms for 2461 files (with 10 line threshold). So I conclude that simian (which finds true duplicate code) is just 15% slower.

Sorry. It’s not about performance.
And, in fact, it’s not a bug, it’s a feature.
I believe that most developers would like to be notified if 2 big blocks of code are so similar that the only differences are in the literals they use.
That’s why our PHP analyzer considers that, for the detection of duplications, all literals are equivalent.
I think that’s the same for most of our analyzers.

I agree that this approach gives a surprising result on your code which seems to contain a lot of hardcoded data.
If this is limited to a subset of the analyzed files, I suggest that you exclude those files from the duplication detection. There’s a documentation page about that.

I’m using SonarQube since ages, mostly on Java code. Have never seen that behavior on Java code. But if it helps, I might try to simulate it in Java.

I wonder if similar (but not duplicate code) is in fact bad programming practice. I know of no smell, design rule or principle that speaks against similar code. I can’t think of an example where similar (but not duplicate code) would be improved using whatever refactoring.

Moreover I doubt that developers want to be notified similar code as “duplicate code”. If they need a check for similar code, they would like it to be marked as “similar code”. No other tool I know of (PMD-CPD, simian) marks such code as duplicate. In SonarQube duplicate code cannot be marked as “won’t fix” and CPD-exclusion only work on whole files.

In my code I have tons of such duplicates. E.g. the following types of code are as well marked as duplicates - none of them are duplications in the sense that a design- or code-change would improve the quality:

  • array definitions spanning more than 10 lines → get marked as duplicates since the array is filled with literals
  • queries spanning more than 10 lines → if they are structurally equal to another query
  • usage of builder-patterns → if the builder builds an object with >= 10 attributes then severall usages of the same builder will result in duplicate code
  • usage of functional programming → if the lambdas span >= 10 lines then a similar expression results in duplicate code

Sorry if my wording is not correct, I’m not a PHP programmer. But I assume my point is understandable: There are a lot of good and valid programming practices that lead to similar code which is fine and even preferable.
So I assume that in most cases similar code (that is not duplicate code) is probably good code and should therefore not be marked as duplicate code.

Sorry for the late reply.

Thank you for sharing your views.

I can understand that long definitions of arrays using literals can trigger the detection of duplications which you may consider irrelevant. At the same time, that’s probably some kind of hardcoded data and I suppose that it should only happen in a relatively small part of the analyzed project: you should be able to exclude those files from the duplication detection.

I’m more concerned about the 3 other duplication cases you mention (“queries”, “builder pattern” and “functional programming”). I don’t really understand them and I would really appreciate if I could see the code to be able to make an opinion on those “duplications”. Would you agree to share your code? It could be done privately.

I’ve written some example code in GitHub - SebastianDietrich/sonarqube.cpd.false.positives: Examples for false positives on duplicate code in sonarqube. It compiles, but will not run (there is no main() method) - nevertheless it will show the 3 duplication cases, so you can understand them.

It consists of an Entity (Book) with more than 10 attributes (which is quite common for such domain objects). This entity is both annotated with @Entity (for DB-querying purposes using hibernate and JPA) and @Builder (so I don’t need to write builder-pattern code).
The second class contains “similar” code:

  • using the builder to build objects of this class.
  • using the lambdas of the java streams on collections.
  • using querydsl for querying books from the db

Neither querydsl nor JPA/hibernate nor lombok is java-specific magic - other languages like PHP have similar frameworks to encourage writing such “similar” code.

Such code is typical for business-objects and will always result in “similar” code if the business-object has more than 10 attributes. Still it is perfectly clean code, sonarqube just complains that tests are missing and reuse of string-constants.

Sonarqube gives me no duplications. So the java scanner seems not to find similar code that will be marked as duplicate on php code.

I’m not a php developer, but I’ve seen code on the php project I am scanning, that resembles such code:

Additionally every pattern using method-chaining will eventually result in “similar” code and thus be marked as duplicate in sonarqube PHP.

Thanks for taking the time to set up an example repository for Java.
However, I’m a bit puzzled. My understanding is that you have a problem with SonarQube’s behavior in PHP. If we discuss about “CPD false positives” for PHP, we should have a look at PHP code, not Java code. I understand that you want the same behavior for PHP as for Java. I could argue that the current behavior for PHP is really the same as for JavaScript and I could show you an example of real code on SonarCloud.

Do you have PHP code to highlight the false positives you mention?
We need real PHP code to be able to make an opinion.
Ideally, we would look at a public project.
I can also open a private thread if you want to share private code with me.

As mentioned earlier I am using SonarQube since ages on > 100 projects. I’ve never seen the these false-positives on those projects, but since those projects were mostly Java I was not sure if these false-positives are language-specific or not. So I wrote that example code in Java and found out that they are not false-positives in Java.
Besides I thought that the Java code would show you the concepts of necessarily similar code when using builders, querying databases or using collections - regardless of which programming language is used.

I wonder, why the behavior is language-specific. Why would similar code be marked as duplicate code in PHP while it is not marked as duplicate code in Java. I can see no reason for this.

But to come back to your questions. Sure I have PHP code that shows these false positives, but it is much more complicated than that example. It is not a public project and I have to ask for permission to show you the code even in a private thread.