False positive S7189: PySpark dataframe caching when columns used in join

thomas.schouten · May 15, 2025, 9:00am

Language: Python + PySpark
Rule: S7189 “Consider caching or persisting this DataFrame.”
SonarQube Enterprise Edition v2025.2

The variable ‘df’ is defined outside but used in a loop, so usually this rule should trigger. However, in this case it is actually not the same dataframe that is reused because the variable is reassigned, so I believe this is a false positive. Similar situation when the variable is reassigned without a loop.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

df = spark.read.csv("data.csv")  # False positive: consider caching or persisting this DataFrame
for i in range(5):
    df = transform_df(df, i)

thomas.serre · May 22, 2025, 8:56am

Hello @thomas.schouten,

Thanks for reaching out! Pyspark rules are new fresh rules and every feedback on them is super valuable for us.

I agree with you that this rule shouldn’t be raised in the case you are mentioning. I was able to reproduce the problem on my side. I have created a ticket: Jira. We will take care of this FP in a future sprint.

Thanks again for posting here,
Cheers,

Topic		Replies	Views
False positive S7189: Using a dataframe column in multiple joins Report False-positive / False-negative... pyspark	1	16	May 22, 2025
Write High Quality PySpark Python Code with SonarQube Sonar Updates python , jupyter-notebooks , pyspark , data-engineering	1	215	May 8, 2025
5 More Rules for PySpark Discussions python	0	171	May 8, 2025
Remove the unused local variable - Pandas query false positives Report False-positive / False-negative... python , sonarqube-cloud	1	1012	October 10, 2022
False-positive on python:S4143 Report False-positive / False-negative...	1	441	October 18, 2023

False positive S7189: PySpark dataframe caching when columns used in join

Related topics