Hello data engineers,
Earlier this year, we introduced an initial set of 8 rules to help you write more efficient, high-quality PySpark code with Python. We’ve expanded this ruleset with 5 additional rules below to help you catch common pitfalls and promote best practices:
The new rules:
- S7193: PySpark
DataFrametoPandasfunction should be avoided - S7468: PySpark
dropDuplicatessubset argument should not be provided with an empty list - S7469: PySpark’s
DataFramecolumn names should be unique - S7470: PySpark’s
RDD.groupByKey, when used in conjunction withRDD.mapValueswith a commutative and associative operation, should be replaced byRDD.reduceByKey - S7471:
masterandappNameshould be set when constructing PySparkSparkContexts andSparkSessions
These rules are available on SonarQube Cloud, and will be available in SonarQube Server 2025.3 and upcoming SonarQube-IDE releases, your feedback is welcome below.
Jean