False positive S7189: PySpark dataframe caching when columns used in join

Language: Python + PySpark
Rule: S7189 “Consider caching or persisting this DataFrame.”
SonarQube Enterprise Edition v2025.2

The variable ‘df’ is defined outside but used in a loop, so usually this rule should trigger. However, in this case it is actually not the same dataframe that is reused because the variable is reassigned, so I believe this is a false positive. Similar situation when the variable is reassigned without a loop.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

df = spark.read.csv("data.csv")  # False positive: consider caching or persisting this DataFrame
for i in range(5):
    df = transform_df(df, i)
1 Like

Hello @thomas.schouten,

Thanks for reaching out! Pyspark rules are new fresh rules and every feedback on them is super valuable for us.

I agree with you that this rule shouldn’t be raised in the case you are mentioning. I was able to reproduce the problem on my side. I have created a ticket: Jira. We will take care of this FP in a future sprint.

Thanks again for posting here,
Cheers,