Analyzing a Python project with SonarCloud generates the following output:
INFO: The Python analyzer was able to leverage cached data from previous analyses for 0 out of 147 files. These files were not parsed.
How can I leverage cached data to speed up the analysis?
The same project contains a single JavaScript file that takes a relatively long time to analyze but is rarely changed. The analysis generates the following output:
INFO: Hit the cache for 0 out of 1
INFO: Miss the cache for 1 out of 1: ANALYSIS_MODE_INELIGIBLE [1/1]
How can I leverage the cache for the JavaScript file?
Sorry, I should have been more specific. The analysis is run inside a docker container on a build server. I have already mounted ~/.sonar/cache into the docker container which speeds up loading/downloading of plugins. However, it seems that that folder only contains plugins and no analysis results. Where are those cached?
In fact, incremental analysis is enabled for JavaScript on SonarCloud, but I don’t believe it’s available yet for Python. Sorry, I had forgotten that & had to do some digging.
I would expect it to be available for Python “soon”.
Incremental analysis is available for all languages but not yet for C++ and C# (coming in the coming weeks, max end of March). It works only for Pull Request analyses. Branch analyses are still doing a full scan each time you push on the branch. The announcement will be published later today.
So it works for JavaScript, TypeScript, and Python and you have nothing to do to enable it. This works with a cache on the server side, there is nothing to change in your config on the scanner side to enable it.
Some INFO logs are more for us to understand what’s going on in case we received logs from users and we already identified that some logs are misleading for Python. The fix should be deployed soon.
The only thing you can check is to have this enabled in Administration > General Settings:
@Mr-Pepe
Do you see your Pull Request analyses running faster than a month ago?
The analysis of that particular project is running faster now compared to a month ago. However, that seems mostly due to a faster JavaScript analysis (single JavaScript file in an otherwise Python-only project). The JavaScript analysis now takes 5197ms instead of 41943ms. The Python analysis now takes 6633ms instead of 7863ms.
The newer (faster) pull request did not even change any Python files that SonarCloud cares about (not in sonar.sources). Does the Python sensor simply need that time even if all files can be retrieved from cache? Are many network calls made to determine which files have to be checked?
I am generally looking into ways to speed up our pipelines and parallelized a lot of steps. However, SonarCloud has to be run sequentially after other steps because it reads in test results. This adds 30 to 60 seconds to each pipeline run of a pull request.
39 seconds is not an extremely long time but it makes up a significant part of an otherwise well-optimized pipeline. It seems that incremental analyses are already enabled, so where else could I realize performance improvements? Can sensors be executed in parallel? Can the analysis be split into parts? The loading of plugins and rules could be performed in parallel with other pipeline steps and the sensor execution could be executed later. Actually, only the sensors that read in test results (e.g., PythonXUnitSensor and Cobertura) have to be deferred.
Last question: Why is caching only enabled for PR builds? Will it become available for branch builds in the future?
The incremental analysis on PR is just the beginning of a long journey and we definitely want to enable incremental analysis on branches in the future. We did it first on PR because we believe it’s where developers will get the bigger benefit today while it’s kind of OK to wait for branch analysis. We had to do a choice between PR and Branch, and PR won.
The good news is that we have ideas for improvements. Our goal is to reach the situation to be very very fast (less than 10 seconds) when there is a PR done on a language/file that we don’t support. This should be the case in a “no change” scenario.
Today, we retrieve the cached data for the security analysis in this line even if there is no need to retrieve them:
This is why you see time spent on this sensor while you changed no Python files. We already identified this as a potential source of time gain.
We can also expect some time gain in this step:
Load active rules (done) | time=3128ms
Potentially in the next coming months, on your example, you could expect a gain of 10-15 seconds.
Meanwhile, there is not much you can do on your side. The only limiting factor will be the speed of your storage because of the number of I/O we do. If you can afford a fast SSD, that could help to reach your own goal.