False positive S7189: Using a dataframe column in multiple joins

Language: Python + PySpark
Rule: S7189 “Consider caching or persisting this DataFrame.”
SonarQube Enterprise Edition v2025.2

In this example, the dataframe df1 is joined with another dataframe, then the result is joined again, and so on. However, because the column names from df1 are used in every join, the rule triggers as it believes df1 is used three times. I think this is a false positive, because it’s not the dataframe itself that is reused, just the column name.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

df1 = spark.read.csv("data1.csv")  # False positive: consider caching or persisting this DataFrame
df2 = spark.read.csv("data2.csv")
df3 = spark.read.csv("data3.csv")
df4 = spark.read.csv("data4.csv")

df1.join(df2, df1.IdA == df2.Id, how='inner') \
    .join(df3, df1.IdB == df3.Id, how='inner') \
    .join(df4, df1.IdC == df4.Id, how='inner')
1 Like