CPD false positives

I’m using CommunityEdition 8.6.1 and DeveloperEdition 8.6, both give me the same false positives on duplicate code detection:

I’ve lots of lines that look like

$this->input(77, 27, $data->get(‘cl.tischler-fenster-tuer.fenster-fenstertueren.gang-und-schliessbar-richten-wohn-kueche’));

All these lines are not equal, they differ both in the numbers (77, 27) and in the texts - although the texts might be in part the same (e.g. …cl.tischler-fenster-tuer.fenster-fenstertueren.

I’ve tried the same code with PMD-CPD (V6.31.0) which finds duplications in the strings (100-token wide parts of strings match other 100-token wide parts of strings in other lines like in the example above), but not a single duplication spanning more than 2 lines.

SonarQube detects large chunks of duplicate code - e.g. starting in CarpenterV8.php line 10: “Duplicated By …/TilerV8.php Lines: [10 – 87]” whereas in TilerV8.php line 10 it states: “Duplicated By …/CarpenterV8.php Lines: [10 – 241]”
Those lines start with the same code:

{
public function fill(Collection $data): Fpdf
{

but after that they are different -

//Wohn-/Küche
$this->input(77, 27, $data->get(‘cl.tischler-fenster-tuer.fenster-fenstertueren.gang-und-schliessbar-richten-wohn-kueche’));
$this->input(77, 30.8, $data->get(‘cl.tischler-fenster-tuer.fenster-fenstertueren.flugeldichtung-erneuern-wohn-kueche’));

vs.

//ALLGEMEINE ARBEITEN
$this->input(95.5, 24, $data->get(‘cl.fliesenleger.allgemeine-arbeiten.schutz-von-bauteilen-flaeche’));
$this->input(139.5, 24, $data->get(‘cl.fliesenleger.allgemeine-arbeiten.fussbodenschutz-flaeche’));

none of these lines are detected by PMD-CPD as duplicate.

I would expect duplications to be found like stated in the documentation (100 successive and duplicated tokens spanning at least 10 lines of code).

Hello @sebastian.dietrich ,

Thank you for your input. We will investigate your problem and get back to you as soon as possible.

Best,
Nils

FYI: That bug still exists on latest SonarQube Version (8.7.0.41497). Anyways I suspect that bug to be in sonar-scanner CPD Executor (I’m using sonar-scanner-cli-4.6.0.2311-windows and afaik sonar-scanner-jenkins). Somehow the scanner cannot correctly parse php files (I do not experience such problems with java or c# in other projects)

SonarSource’s Copy/Paster Detection replaces numbers and strings with placeholders to increase the speed of comparing code sequences. That is, in your particular case, the lines are always translated into the following tokens.

$this->input(77, 27, $data->get(‘cl.tischler-fenster-tuer...’));

are translated to

$this, , input, (, $NUMBER, ,, $NUMBER, ,, $data, , get, (, $CHARS, ), ), ;

They will then be compared and considered as identical sequences. This behavior is intended by us. You may adjust the sensitivity of the CPD in the configuration Analysis Parameters->Duplications

Best,
Nils

Hm - I understand. So to increase performance you no longer detect duplicate code but just similar code. One can adjust the sensitivity (nr. tokens, nr. lines) but there is no possibility to adjust this behavior.

Can you please elaborate on why this behavior is intended (besides being faster).

btw. in my case the CPD Executor took 896ms for 1166 files while on the same codebase another duplicate-code detector (simian) took 2170ms for 2461 files (with 10 line threshold). So I conclude that simian (which finds true duplicate code) is just 15% slower.

Sorry. It’s not about performance.
And, in fact, it’s not a bug, it’s a feature.
I believe that most developers would like to be notified if 2 big blocks of code are so similar that the only differences are in the literals they use.
That’s why our PHP analyzer considers that, for the detection of duplications, all literals are equivalent.
I think that’s the same for most of our analyzers.

I agree that this approach gives a surprising result on your code which seems to contain a lot of hardcoded data.
If this is limited to a subset of the analyzed files, I suggest that you exclude those files from the duplication detection. There’s a documentation page about that.

I’m using SonarQube since ages, mostly on Java code. Have never seen that behavior on Java code. But if it helps, I might try to simulate it in Java.

I wonder if similar (but not duplicate code) is in fact bad programming practice. I know of no smell, design rule or principle that speaks against similar code. I can’t think of an example where similar (but not duplicate code) would be improved using whatever refactoring.

Moreover I doubt that developers want to be notified similar code as “duplicate code”. If they need a check for similar code, they would like it to be marked as “similar code”. No other tool I know of (PMD-CPD, simian) marks such code as duplicate. In SonarQube duplicate code cannot be marked as “won’t fix” and CPD-exclusion only work on whole files.

In my code I have tons of such duplicates. E.g. the following types of code are as well marked as duplicates - none of them are duplications in the sense that a design- or code-change would improve the quality:

  • array definitions spanning more than 10 lines → get marked as duplicates since the array is filled with literals
  • queries spanning more than 10 lines → if they are structurally equal to another query
  • usage of builder-patterns → if the builder builds an object with >= 10 attributes then severall usages of the same builder will result in duplicate code
  • usage of functional programming → if the lambdas span >= 10 lines then a similar expression results in duplicate code

Sorry if my wording is not correct, I’m not a PHP programmer. But I assume my point is understandable: There are a lot of good and valid programming practices that lead to similar code which is fine and even preferable.
So I assume that in most cases similar code (that is not duplicate code) is probably good code and should therefore not be marked as duplicate code.