Detecting Code Duplications Within the Same Python File in SonarQube

Hi everyone,

I’m quite new to SonarQube and currently exploring its features. As part of my evaluation, I created two simple Python scripts that share some duplicated code. SonarQube successfully detects duplicated lines and code blocks between these two files, which is great.

However, I noticed that it doesn’t detect similar duplications within the same file. Is this behavior intentional? And if not, is there a specific setting I need to adjust to enable duplication detection within a single Python script?

For testing, I am using the following command to run the analysis and view the results on my local SonarQube server (localhost):
sonar-scanner.bat -D"sonar.projectKey=python-test" -D"sonar.sources=." -D"sonar.host.url=http://localhost:9000" -D"sonar.token=sqp_..."

Any guidance would be greatly appreciated!

Best regards,
Aaron

Hello!

Yes, SonarQube is capable of detecting code duplication within the same file. You can see an example of this in my SonarQube Cloud organization.

This functionality works out of the box and does not require any special configuration. However, SonarQube does provide several configuration parameters related to duplication detection that you can adjust if you want to fine-tune its behavior or results. You can find detailed information about these parameters in the documentation: Duplication Check Configuration.

Hello Colin,

Thank you for your message.

Unfortunately, I’m unable to access the first link you shared due to restrictions behind my company firewall. However, I did go through the documentation you referred to regarding duplication detection.

Let me explain my use case from the beginning.

Here is an example Python file I’m analyzing:

import math, random, datetime, os  # unnecessary imports


class BaseClass:
    def greet(self):
        print("Hello from BaseClass")


class SubClassA(BaseClass):
    def greet(self):

        # duplicate code within a function
        print("Hello from SubClassA")
        print("Hello from SubClassA")

        # duplicate code across multiple classes
        a = 5 + 5
        b = a * 2
        c = b - 3
        d = b / 4
        e = a * 4
        f = e * e
        g = a + b
        h = 2 + e
        i = f + g
        print("A calc:", d)


class SubClassB(BaseClass):
    def greet(self):
        print("Hello from SubClassB")
        print("Hello from SubClassB")
        a = 5 + 5
        b = a * 2
        c = b - 3
        d = b / 4
        e = a * 4
        f = e * e
        g = a + b
        h = 2 + e
        i = f + g
        print("B calc:", d)


class SubClassC(BaseClass):
    def greet(self):
        print("Hello from SubClassC")
        print("Hello from SubClassC")
        
        a = 5 + 5
        b = a * 2
        c = b - 3
        d = b / 4
        e = a * 4
        f = e * e
        g = a + b
        h = 2 + e
        i = f + g

        a = 5 + 5
        b = a * 2
        c = b - 3
        d = b / 4
        e = a * 4
        f = e * e
        g = a + b
        h = 2 + e
        i = f + g

        print("C calc:", d)


def overly_complicated_function(
    a: int, b: str, c: float
):  # only 'a' and 'b' are used
    if a > 0:
        if a < 10:
            if a != 5:
                if b == "test":
                    print("test passed")
                else:
                    if b != "":
                        print("not empty")
                    else:
                        print("empty string")
            else:
                print("a is 5")
        else:
            print("a is too big")
    else:
        print("a is non-positive")


def uses_typing(x: int, y: str) -> str:
    print(y)
    return str(x)


def unused_function(z, w, u):  # only z is used
    return z + 1


def                              main()         :  # a lot of spaces
    objA = SubClassA()
    objB = SubClassB()
    objC = SubClassC()
    objA.greet()
    objB.greet()
    objC.greet()

    result1 = overly_complicated_function(3, "hello", "not used")  # c is unnecessary
    result2 = uses_typing("not an int", 42)  # type annotations ignored
    print(result2)

    math.sqrt(16)  # calculated but not used


main()

My goals are:

  1. Detect all code duplications (within functions, across classes, etc.).
  2. Catch formatting issues according to PEP8.
  3. Identify type inconsistencies using function type hints.
  4. Report unused imports, variables, functions, and classes.
  5. Detect unused statements like math.sqrt(16) when the result is unused.
  6. Receive refactoring suggestions, e.g., extracting repeated logic into a method in the base class.

Currently, I’m using the following SonarScanner command:

sonar-scanner.bat -D"sonar.projectKey=python-test" -D"sonar.sources=." -D"sonar.host.url=http://localhost:9000" -D"sonar.token=..." -D"sonar.cpd.python.minimumTokens=3" -D"sonar.cpd.python.minimumLines=3"

Despite this, I’m not seeing these issues reflected in the SonarQube dashboard on localhost:9000. SonarQube tells me that there are no code duplications at all. So my questions are:

  • Can SonarQube detect all of the above-mentioned issues?
  • If yes, what additional configuration or setup is required?

I’d really appreciate some guidance on how to ensure these kinds of issues are detected and visualized properly in SonarQube.

Best regards,
Aaron

Hi Aaron,

I’ve reproduced your duplications behavior, both within-file and across files. It’s not clear to me what’s going on here, but I suspect it’s about how the algorithm works. Per the docs:

For a block of code to be considered as duplicated:

  • Non-Java projects:
    • There should be at least 100 successive and duplicated tokens.
    • Those tokens should be spread at least on:
    • 10 lines of code for other languages

I’ve also reproduced your missing issues.

Normally, we try to keep it to one topic per thread. Otherwise it can get messy, fast. But I suspect there’s some underlying mechanism here, and finding it will solve most of this at once. So while I (we) reserve the right to ask you to create other topics, for now I’ll let it ride as is.

This has been flagged for the language experts.

 
Ann

Hi Ann,

I hope you’re doing well.

I wanted to check in and see if there have been any updates regarding the issues I raised—particularly the duplication behavior and the missing issues you were able to reproduce. Have you or the language experts been able to identify the root cause or determine whether this behavior is expected?

Additionally, I’m wondering if some of the issues might stem from the fact that I’m currently using the community version of SonarQube rather than a paid edition. If that’s a factor, it would be important for me to know.

I really need to analyze whether SonarQube is able to meet the previously mentioned goals I outlined in my earlier post, so any insight you can provide would be greatly appreciated.

Looking forward to hearing from you.

Best regards,
Aaron

Hi Aaron,

We’re waiting for the language experts. Hopefully they’ll be along soon.

I doubt this is relevant.

 
Ann

Hello,

For your code duplication issue, the duplicated block

a = 5 + 5
b = a * 2
c = b - 3
d = b / 4
e = a * 4
f = e * e
g = a + b
h = 2 + e
i = f + g

is made of around 40 tokens spread across 9 lines, which is below the threshold for detection.

We support some PEP8 formatting rules, however that has not been our focus, and you could import issues from other linters.

For the type inconsistencies, the rule S6555 triggers on your code on both "not_an_int" and 42, the second one is a secondary location. You can get more details on issues by clicking on them.

For the unused imports, our rule S1128 is not in the default Sonar Way quality profile, it needs to be included manually and only supports the from a import b syntax.

For the unused statements, we have rules such as S905 and S2201. In this case, you are right and I have created SONARPY-2949 to fix this false negative.

There is currently no feature for refactoring suggestions as complicated as this, but many of our rules do have quickfixes. We also have AI Codefix to help for more complicated fixes.

Have a nice day,

2 Likes

Hello Ghislain,

Thank you for the detailed explanation.

To follow up on the code duplication issue: is there a way to lower the threshold for detection so that smaller duplicated blocks like the one I mentioned (around 40 tokens over 9 lines) are also detected?

Best regards,
Aaron

from my first reply :slight_smile:

@Colin

You might have overlooked it, but I wrote the following in my second post.

Currently, I’m using the following SonarScanner command:

sonar-scanner.bat -D"sonar.projectKey=python-test" -D"sonar.sources=." -D"sonar.host.url=http://localhost:9000" -D"sonar.token=..." -D"sonar.cpd.python.minimumTokens=3" -D"sonar.cpd.python.minimumLines=3"

Despite this, I’m not seeing these issues reflected in the SonarQube dashboard on localhost:9000 . SonarQube tells me that there are no code duplications at all .

With best regards
Aaron

I did miss that – sorry!

You need to use py instead of python.

sonar.cpd.py.minimumTokens and sonar.cpd.py.minimumLines respectively.

There’s no great reason we use py instead of python as the key here, but you can always be sure by checking the GET api/languages/list Web API.

Now those duplicated lines show up.

1 Like

Hi @Colin,

Thanks for your message — using "py" instead of "python" indeed solved the issue with detecting duplicated lines of code within the same file! It worked exactly as expected. Thanks a lot!

That said, I find it a bit confusing that "py" is used in this case, while the documentation elsewhere consistently uses "sonar.python...". For example:

Python code is analyzed by default as compatible with Python 2 and Python 3. Some issues will be automatically silenced to avoid raising False Positives. In order to get a more precise analysis, you can specify the Python versions your code supports via the sonar.python.version parameter.

The accepted format is a comma-separated list of versions having the format "X.Y". Here are some examples:
sonar.python.version=2.7
sonar.python.version=3.8
sonar.python.version=2.7, 3.7, 3.8, 3.9

source to documentation

It would be great if the naming could be made more consistent across the platform for better clarity.

Best regards
Aaron

Absolutely, there are a few tricky cases here. This is some very old SonarQube logic (see SONAR-1501 from 2010)! The language key isn’t really used in other parts of SonarQube.

Since this is an advanced configuration that’s rarely used, it’s unlikely we’ll change how the analysis parameters work.

However, we could certainly improve our documentation to make it clearer where to find the language keys when they’re required. I’ll make sure to flag this for review—thanks for pointing it out!

So sorry this caused an issue. I have updated the Python pages to correct this error so it won’t impact future users.