Hello data engineers,
Earlier this year, we introduced an initial set of 8 rules to help you write more efficient, high-quality PySpark code with Python. We’ve expanded this ruleset with 5 additional rules below to help you catch common pitfalls and promote best practices:
The new rules:
- S7193: PySpark
DataFrame
toPandas
function should be avoided - S7468: PySpark
dropDuplicates
subset argument should not be provided with an empty list - S7469: PySpark’s
DataFrame
column names should be unique - S7470: PySpark’s
RDD.groupByKey
, when used in conjunction withRDD.mapValues
with a commutative and associative operation, should be replaced byRDD.reduceByKey
- S7471:
master
andappName
should be set when constructing PySparkSparkContext
s andSparkSession
s
These rules are available on SonarQube Cloud, and will be available in SonarQube Server 2025.3 and upcoming SonarQube-IDE releases, your feedback is welcome below.
Jean