False positive S7189: Using a dataframe column in multiple joins

thomas.schouten · May 15, 2025, 9:03am

Language: Python + PySpark
Rule: S7189 “Consider caching or persisting this DataFrame.”
SonarQube Enterprise Edition v2025.2

In this example, the dataframe df1 is joined with another dataframe, then the result is joined again, and so on. However, because the column names from df1 are used in every join, the rule triggers as it believes df1 is used three times. I think this is a false positive, because it’s not the dataframe itself that is reused, just the column name.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

df1 = spark.read.csv("data1.csv")  # False positive: consider caching or persisting this DataFrame
df2 = spark.read.csv("data2.csv")
df3 = spark.read.csv("data3.csv")
df4 = spark.read.csv("data4.csv")

df1.join(df2, df1.IdA == df2.Id, how='inner') \
    .join(df3, df1.IdB == df3.Id, how='inner') \
    .join(df4, df1.IdC == df4.Id, how='inner')

thomas.serre · May 22, 2025, 9:20am

Hello @thomas.schouten,

Thank you for this second post. I think what you suggest make sense.
As for your first post, I was able to reproduce the problem, and I have created a ticket Jira

Do not hesitate to come back to us if you have other questions or any other feedback about the rules, we really appreciate it!

Cheers,

Topic		Replies	Views
False positive S7189: PySpark dataframe caching when columns used in join Report False-positive / False-negative... pyspark	1	23	May 22, 2025
False-positive on python:S4143 Report False-positive / False-negative...	1	441	October 18, 2023
New SQL rule - Distinct to mask join problems - suspicious New rules / language support tsql	0	606	December 24, 2021
Write High Quality PySpark Python Code with SonarQube Sonar Updates python , jupyter-notebooks , pyspark , data-engineering	1	245	May 8, 2025
Use explicite JOIN's rather then implicite New rules / language support plsql , tsql	6	908	November 24, 2018

False positive S7189: Using a dataframe column in multiple joins

Related topics