5 More Rules for PySpark

jean.jimbo · May 8, 2025, 9:41am

Hello data engineers,

Earlier this year, we introduced an initial set of 8 rules to help you write more efficient, high-quality PySpark code with Python. We’ve expanded this ruleset with 5 additional rules below to help you catch common pitfalls and promote best practices:

The new rules:

S7193: PySpark DataFrame toPandas function should be avoided
S7468: PySpark dropDuplicates subset argument should not be provided with an empty list
S7469: PySpark’s DataFrame column names should be unique
S7470: PySpark’s RDD.groupByKey, when used in conjunction with RDD.mapValues with a commutative and associative operation, should be replaced by RDD.reduceByKey
S7471: master and appName should be set when constructing PySpark SparkContexts and SparkSessions

These rules are available on SonarQube Cloud, and will be available in SonarQube Server 2025.3 and upcoming SonarQube-IDE releases, your feedback is welcome below.

Jean

Topic		Replies	Views
Write High Quality PySpark Python Code with SonarQube Sonar Updates python , jupyter-notebooks , pyspark , data-engineering	1	190	May 8, 2025
5 New Rules for Clean Code with the Pandas Library Sonar Updates python , data-science , pandas	2	1254	February 29, 2024
False positive S7189: Using a dataframe column in multiple joins Report False-positive / False-negative... pyspark	1	13	May 22, 2025
9 New Python Rules & Support for Ruff Reports Sonar Updates sonarqube , python , sonarqube-cloud	0	1736	August 25, 2023
SonarQube - PySpark Code Quality challenges Report False-positive / False-negative... sonarqube , python	4	6173	March 31, 2022

5 More Rules for PySpark

Related topics