5 More Rules for PySpark

Hello data engineers,

Earlier this year, we introduced an initial set of 8 rules to help you write more efficient, high-quality PySpark code with Python. We’ve expanded this ruleset with 5 additional rules below to help you catch common pitfalls and promote best practices:

The new rules:

  • S7193: PySpark DataFrame toPandas function should be avoided
  • S7468: PySpark dropDuplicates subset argument should not be provided with an empty list
  • S7469: PySpark’s DataFrame column names should be unique
  • S7470: PySpark’s RDD.groupByKey, when used in conjunction with RDD.mapValues with a commutative and associative operation, should be replaced by RDD.reduceByKey
  • S7471: master and appName should be set when constructing PySpark SparkContexts and SparkSessions

These rules are available on SonarQube Cloud, and will be available in SonarQube Server 2025.3 and upcoming SonarQube-IDE releases, your feedback is welcome below.

Jean

2 Likes