Language: Python + PySpark
Rule: S7189 “Consider caching or persisting this DataFrame.”
SonarQube Enterprise Edition v2025.2
In this example, the dataframe df1
is joined with another dataframe, then the result is joined again, and so on. However, because the column names from df1
are used in every join, the rule triggers as it believes df1
is used three times. I think this is a false positive, because it’s not the dataframe itself that is reused, just the column name.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Example").getOrCreate()
df1 = spark.read.csv("data1.csv") # False positive: consider caching or persisting this DataFrame
df2 = spark.read.csv("data2.csv")
df3 = spark.read.csv("data3.csv")
df4 = spark.read.csv("data4.csv")
df1.join(df2, df1.IdA == df2.Id, how='inner') \
.join(df3, df1.IdB == df3.Id, how='inner') \
.join(df4, df1.IdC == df4.Id, how='inner')