SonarQube - PySpark Code Quality challenges

Pravish_Jain · July 27, 2020, 11:16am

Must-share information (formatted with Markdown):

which versions are you using (SonarQube, Scanner, Plugin, and any relevant extension)

SonarQube Developer Edition - 8.3.1 and Python Scanner

what are you trying to achieve

PySpark (Running PySpark in Azure Databricks notebook) code quality check.

what have you tried so far to achieve this

Using a project which is written in PySpark and setup the code quality check in SonarQube but getting errors -

Spark is not defined and same error is coming for other constants defined in other files

We are running PySpark code in Databricks which runs on a spark library and our code is using that in-built library. But SonarQube is not able to detect it and throwing errors. Attached screenshot is one of the exapmple. Please guide us that how to measure the code quality of PySpark Code

There is one more issue that we are defining some constants in separate file which is being called in the main python file and these constants are being used in the main file only.

Like here we have a constant called SQL which is defined in a separate file and calling it here in the main file and getting error as above image.

Please help and let us know if you have any further questions.

Andrea_Guarino · July 27, 2020, 12:32pm

Hi @Pravish_Jain,

The issues you see come from rule S3827: this rule checks if a given variable is neither defined nor imported (via a python import statement).

Can you please share how spark variable is imported? Is it relying on Databricks special “%run” commands?

Please, beware that Python analyzer doesn’t have any special support for Databricks (or any other notebook script) for the time being.

How is SQL imported? Is it something specific to Databricks?

Pravish_Jain · July 27, 2020, 1:07pm

Can you please share how spark variable is imported? Is it relying on Databricks special “%run” commands?

The Azure Databricks Runtime version which we use includes Apache Spark 2.4.5 with Python 3 support.
As the runtime itself includes the Spark, thus we are leveraging the Spark functionality through it.
So the variable “spark” is not explicitly imported anywhere in the python notebook but is provided directly through Databricks runtime.

How is SQL imported? Is it something specific to Databricks?

The constants are defined in seperate python file (Constants.py) which is imported using databricks command %run -
%run ./Constants.py #have constants like : AVRO = “avro” , SQL = “sql”

And as you mentioned that Python Analyzer doesn’t have any special support for Databricks so any proposed timeline for that?
@Andrea_Guarino… anything on this from your side?

Andrea_Guarino · July 28, 2020, 1:22pm

Hi @Pravish_Jain,

Support for Notebooks (like Databricks or Jupiter) is something we might want to support, however I cannot give a timeline because it’s not in the top of our priorities.

For the time being I would suggest disabling rule S3827 generating those FP from your Quality Profile.

Prabhakaran_Elangova · March 31, 2022, 7:54am

Hi @Andrea_Guarino,

Does the support for Databricks notebooks available now in SonarQube now or it is not yet supported. Specifically for the databricks specific magic commands such as “%run” and “dbutils”.

Please do let us know. Thanks in advance.

Topic		Replies	Views
Identify difference in pyspark code SonarQube Server / Community Build sonarqube	3	226	March 16, 2024
Does SonarQube scan supports PySpark and SparkSQL? SonarQube Server / Community Build scanner , python	1	1009	February 19, 2024
Can we scan PySpark code with community version 7.9? SonarQube Server / Community Build sonarqube	3	1487	September 28, 2020
Write High Quality PySpark Python Code with SonarQube Sonar Updates python , jupyter-notebooks , pyspark , data-engineering	1	187	May 8, 2025
Detecting Databricks specific code SonarQube Cloud python	7	3134	December 17, 2019

SonarQube - PySpark Code Quality challenges

Related topics