Hi @Iwo_Polanski,
We deeply apologize for delay. We could finally take the resources to properly investigate on this issue and get to the bottom of it.
TLDR;
You are right, there is something not properly working in Copy Paste Detection, specifically on long """
string templates in Kotlin.
We are going to fix it right away, and release in the next version of the Sonar Kotlin analyzer.
After that, depending on the product you are using, you may get the fix in days (SonarQube Cloud), or in the following release of SonarQube Server.
We will inform you via this ticket when the issue is fixed in the analyzer.
Notice that a very long list of strings may still trigger duplication: that is the way Copy Paste Detection works. In your scenario, however, and in many others you may encounter, the problem should be fixed for good.
Technical details
Hereafter some technical details, which may help you and other understand the current behavior of the duplication detection.
As @Colin mentioned earlier, Copy Paste Detection (CPD for short) ignores string literals, and that’s true in most if not all languages where we implemented it: it replace the actual string content with a placeholder. In the case of Kotlin the placeholder is LITERAL
.
String templates, in languages that support them, have a complex structure, since expressions can be injected into them: e.g. """a $x b"""
.
In Kotlin, a single string template is made of many literal string template entries, which are fragments of the overall string template.
For example, if we take the first property in the RemoteConfigDefaults
object you provided:
val METERING_REWARDED_VIDEO_CONFIG: String = """
{
"us": {
"logged_user":
{
"is_enabled":false,
"internal_ad_unit_id": "/2165551/brainly_android_app/RewardedadUnit_house_ads",
"rewarded_videos_threshold": 5
},
"unlogged_user":
{
"is_enabled":false,
"internal_ad_unit_id": "/2165551/brainly_android_app/RewardedadUnit_house_ads",
"rewarded_videos_threshold": 3
}
}
}
""".trimIndent()
the string template """ ... """
translate into a very long series of literal string template entries:
PsiElement(OPEN_QUOTE): """
LITERAL_STRING_TEMPLATE_ENTRY x78
...
PsiElement(CLOSING_QUOTE): """
For the METERING_BASE_CONFIG
string template there are 498 LITERAL_STRING_TEMPLATE_ENTRY
entries!
When implementing the tokenization of string templates used for CPD, we emitted a LITERAL
placeholder token for each entry, instead of emitting them for the entire string template, as we assumed that entries would only be created when expressions were injected in the template (e.g. for $x
"""a $x b"""
).
However, that’s not the case, and 78 entries are created for the string template above, even though there is no single expression injection.
The detection basically compares sequences tokens, and when there are at least 100 equal tokens over at least 10 lines (docs here), a duplication is detected.
This also explains why you observed a duplication only covering part of the string template: the tokenized string looks like the following:
object RemoteConfigDefaults {
@ JvmField val METERING_REWARDED_VIDEO_CONFIG : String =
""" LITERAL x78 times ... """ . trimIndent ( )
@ JvmField val METERING_BASE_CONFIG : String =
""" LITERAL x495 times """ . trimIndent ( )
@ JvmField val APP_ONBOARDING_CONFIG : String = """
...
So the detection would find any sequence of more than 100 LITERAL
over 10+ lines and report as detection.
The possible solutions we are thinking about are:
- either reporting a single
LITERAL
token at string template level, and ignore injected expression altogether
- or trying to find a better compromise, to keep entries when they are about injected variables, and not when they are not relevant for the CPD
Notice that, even going for the most conservative approach (1), you could still have duplications when there is a long series of LITERAL
, potentially coming from strings with different content. The typical example is an array of more than 100 strings, over more than 10 lines:
val strings = listOf(
"line 1 string 1", "line 1 string 2", ... "line 1 string 10",
...
"line 10 string 1", "line 10 string 2", ... "line 10 string 10",
)
that would generate:
"listOf", ",", ("LITERAL", ",") x 100, ")"
These scenarios, however, are much more unlikely.
Best regards,
Antonio