SonarQube - PySpark Code Quality challenges

Must-share information (formatted with Markdown):

  • which versions are you using (SonarQube, Scanner, Plugin, and any relevant extension)

SonarQube Developer Edition - 8.3.1 and Python Scanner

  • what are you trying to achieve

PySpark (Running PySpark in Azure Databricks notebook) code quality check.

  • what have you tried so far to achieve this

Using a project which is written in PySpark and setup the code quality check in SonarQube but getting errors -

Spark is not defined and same error is coming for other constants defined in other files

We are running PySpark code in Databricks which runs on a spark library and our code is using that in-built library. But SonarQube is not able to detect it and throwing errors. Attached screenshot is one of the exapmple. Please guide us that how to measure the code quality of PySpark Code

There is one more issue that we are defining some constants in separate file which is being called in the main python file and these constants are being used in the main file only.

Like here we have a constant called SQL which is defined in a separate file and calling it here in the main file and getting error as above image.

Please help and let us know if you have any further questions.

Hi @Pravish_Jain,

The issues you see come from rule S3827: this rule checks if a given variable is neither defined nor imported (via a python import statement).

Can you please share how spark variable is imported? Is it relying on Databricks special “%run” commands?

Please, beware that Python analyzer doesn’t have any special support for Databricks (or any other notebook script) for the time being.

How is SQL imported? Is it something specific to Databricks?

  1. Can you please share how spark variable is imported? Is it relying on Databricks special “%run” commands?

The Azure Databricks Runtime version which we use includes Apache Spark 2.4.5 with Python 3 support.
As the runtime itself includes the Spark, thus we are leveraging the Spark functionality through it.
So the variable “spark” is not explicitly imported anywhere in the python notebook but is provided directly through Databricks runtime.

  1. How is SQL imported? Is it something specific to Databricks?

The constants are defined in seperate python file (Constants.py) which is imported using databricks command %run -
%run ./Constants.py #have constants like : AVRO = “avro” , SQL = “sql”

And as you mentioned that Python Analyzer doesn’t have any special support for Databricks so any proposed timeline for that?
@Andrea_Guarino… anything on this from your side?

Hi @Pravish_Jain,

Support for Notebooks (like Databricks or Jupiter) is something we might want to support, however I cannot give a timeline because it’s not in the top of our priorities.

For the time being I would suggest disabling rule S3827 generating those FP from your Quality Profile.