Language: Python + PySpark
Rule: S7189 “Consider caching or persisting this DataFrame.”
SonarQube Enterprise Edition v2025.2
The variable ‘df’ is defined outside but used in a loop, so usually this rule should trigger. However, in this case it is actually not the same dataframe that is reused because the variable is reassigned, so I believe this is a false positive. Similar situation when the variable is reassigned without a loop.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Example").getOrCreate()
df = spark.read.csv("data.csv") # False positive: consider caching or persisting this DataFrame
for i in range(5):
df = transform_df(df, i)