Write High Quality PySpark Python Code with SonarQube

Hello data engineers,

We heard from many of you that you’d like SonarQube to help you avoid pitfalls when working with PySpark code. We’re happy to share that you can now find performance, maintainability and correctness issues in your PySpark code in Python and Jupyter Notebook files with SonarQube. The following rules are available on SonarQube Cloud and will be available in the next release of SonarQube Server (Developer Edition and above) and SonarQube IDE.

  • S7181: PySpark Window functions should always specify a frame
  • S7182: The “subset” argument should be provided when using PySpark DataFrame “dropDuplicates” method
  • S7187: PySpark Pandas DataFrame columns should not use a reserved name
  • S7189: PySpark DataFrames used multiple times should be cached or persisted
  • S7191: PySpark withColumns should be preferred over withColumn when multiple columns are specified
  • S7192: The “how” parameter should be specified when joining two PySpark DataFrames
  • S7195: PySpark lit(None) should be used when populating empty columns
  • S7196: Complex logic provided to PySpark “withColumn”, “filter” and “when” methods should be refactored into separate expressions

We welcome your feedback on these rules.

Jean

1 Like