SonarCloud Outage: JsSecuritySensor plugin update is broken

The SonarCloud updated plugins have broken Sonar Analysis for the TS security rules. They are now taking more than 10 times longer.

This started shortly after noon UTC.

We have looked at logs from noon UTC and from 9PM UTC and noticed the following:

First thing to note is the Load/download plugins went from 3s to 48s. This is a strong sign that the plugins are new and not the same ones from our daily cache.

The Python Sensor step used to take 77s and now 81s, nothing abnormal here and shows that for Python the performance is comparable.

The Code Quality and Security for Go step used to take 85s and now 93s, also very similar. The GolangCI-Lint step went from 7s to 5s also normal.

For JavaScript/TypeScript:

  • JavaScript analysis 5.7s vs 5.8s - ok
  • TypeScript analysis 145s vs 150s - ok

Security sensors:

  • PythonSecuritySensor 23s vs 16s - ok
  • JsSecuritySensor 116s, pipeline timed out after over 18m in this step.

There is a huge problem with the JsSecuritySensor. Apparently there was a Friday production push that causes significant customer issue. Is there anyone on staff at SonarSource to address production outages? We had a massive problem with SonarSource on a Friday back in September but we still lack a prompt escalation procedure.

Some more details:

  • Step JsSecuritySensor 117s → pipeline timed out after 36m in this step
  • rule S6105 12s → 2m51s
  • rule S5696 12s → 3m7s
  • rule S5334 13s → 2m36s
  • rule S2083 10s → 2m47s
  • rule S5146 N/A → 3m21s
  • rule S5147 12s → 3m58s
  • rule S5883 N/A → 3m37s
  • rule S5131 12s → N/A
  • rule S2076 N/A → 3m7s
  • rule S5131 N/A → 2m37s
  • rule S2631 10s → 2m21s
  • rule S5144 N/A → 2m52s
  • rule S3649 12s → 2m51s
  • rule S6096 10s → pipeline timed out

The java heap which is not full is running out of memory as well:
[2021-03-26T23:06:45.475Z] INFO: Final Memory: 1397M/2816M
[2021-03-26T23:06:45.475Z] INFO: ------------------------------------------------------------------------
[2021-03-26T23:06:45.475Z] ERROR: Error during SonarScanner execution
[2021-03-26T23:06:45.475Z] java.lang.OutOfMemoryError: Java heap space
[2021-03-26T23:06:45.475Z] at java.base/java.util.HashMap.newNode(Unknown Source)
[2021-03-26T23:06:45.475Z] at java.base/java.util.HashMap.putVal(Unknown Source)
[2021-03-26T23:06:45.475Z] at java.base/java.util.HashMap.put(Unknown Source)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.D.I.B(na:615)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.D.I$$Lambda$3306/0x0000000801237440.get(Unknown Source)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.D.I.A(na:2905)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.D.I.C(na:2201)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.D.I.A(na:1799)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.A.T.A(na:1995)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.D.I.B(na:713)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.D.I$$Lambda$3306/0x0000000801237440.get(Unknown Source)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.D.I.A(na:2905)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.D.I.C(na:2201)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.D.I.A(na:1799)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.A.T.A(na:1995)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.C.C.A(na:706)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.C.C$$Lambda$3304/0x0000000801236c40.apply(Unknown Source)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.G.A(na:1507)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.G$$Lambda$3305/0x0000000801237040.apply(Unknown Source)
[2021-03-26T23:06:45.475Z] at java.base/java.util.HashMap.replaceAll(Unknown Source)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.G.replaceAll(na:1051)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.C.C.A(na:706)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.D.G.A(na:1094)
[2021-03-26T23:06:45.475Z] at com.sonar.security.analysis.D.D.G.A(na:2623)
[2021-03-26T23:06:45.476Z] at com.sonar.security.analysis.D.A.T.A(na:2192)
[2021-03-26T23:06:45.476Z] at com.sonar.security.analysis.D.C.A(na:851)
[2021-03-26T23:06:45.476Z] at com.sonar.security.analysis.D.C.A(na:1965)
[2021-03-26T23:06:45.476Z] at com.sonar.security.analysis.D.C.B(na:499)
[2021-03-26T23:06:45.476Z] at com.sonar.security.analysis.D.C.A(na:2226)
[2021-03-26T23:06:45.476Z] at com.sonar.security.analysis.D.C.A(na:3458)
[2021-03-26T23:06:45.476Z] at com.sonar.security.analysis.D.J$_B.A(na:1114)
[2021-03-26T23:06:45.476Z] at com.sonar.security.analysis.D.J$_B.C(na:696)

1 Like

Considering we’ve seen the heap error while only 50% was used multiple times I strongly suspect that there is a memory fragmentation problem with the new plugins.

Hi @sodul

The open incident (see https://sonarcloud.statuspage.io/) is related to this issue. A fix is in the works, which essentially will rollback these versions of the analyzers.

Regards,
@AlxO

Hello @sodul,

thank you very much for your report. We are currently investigating these problems to understand and properly fix them in the new version of the JS analyzer.

Just to clarify, you reported both a timeout in your CI pipeline and an OOM error. But for any given analysis run, it should only be possible for either one or the other to occur, but not both. Hence, I assume you are talking about different analyses on different projects?

Thanks,
@Malte_Skoruppa

3 Likes

Hi @sodul,

in fact, it might be the case that the OOM was the reason for the timeout: The security sensor was killed by the OOM, and so it did not produce any further results until the worked thread reached the timeout.

How much memory did you allocate exactly for the JVM (i.e., what is your -Xmx)?

Thanks,
@Malte_Skoruppa

That is correct. We first ran into the timeout issue so we bumped the timeout, but then we ran out of memory so we increased the container memory from 5GB to 6GB and the -Xmx from 2.25GB to 2.75GB. Then ran into a timeout, which we bumped and then the OOM again. Each time we hit the heap issue I noticed that the final memory usage was less than 50% of the actual heap size which is consistent with memory fragmentation issues.

Stephane,

thank you for your feedback.

To give you some context, the JS analyzer in the previous version was more of a feature preview and the new JS analyzer (which was rolled back for the time being) performs a substantially deeper analysis. Therefore, a longer running time is expected. However, in your case the time increase is indeed more substantial than what we observed in our tests.

The same thing goes for the memory requirements. As you seem to be working rather closely to the limit you used before, it could potentially be the case that the new version needed, say, twice the amount of memory as before for your particular project. By not giving it enough memory, the analysis time may first slow down before it runs into an OOM error. So if the timeout is too low, you run into a timeout; by increasing the timeout, you then get to the point where you get an OOM error. By then increasing the memory slightly (but not enough), it then takes longer for the OOM to occur, so you may run into a timeout again, and so on. In other words: It is possible your problem could be fixed by giving the scanner more than enough memory (I cannot say for sure how much would be needed, as that depends on the analyzed application’s source code: For example, the more modules you include in the source code, the higher the memory requirements for analysis).

Finally, please note that the fact that the final memory usage does not match the actual heap size does not necessarily indicate a memory fragmentation problem. This is because the “final memory” output that you see refers to the memory used by the plugin after the core engine crashed with an OOM, so much of the used memory may already have been garbage collected. But it may very well have reached the limit while it was running (hence the OOM).

Whatever the case, we are currently looking into optimizing the memory and time requirements of our core engine to mitigate this problem.

Best,
@Malte_Skoruppa

Thank you for the details. If the new analysis is going to take significantly longer and more memory this is going to be a deal breaker for our CI pipelines, especially since Sonar is not allowing to analyze different languages in parallel.

What would be ideal for us is to allow the Sonar scanner to analyze each language independently and then have a final gathering step to aggregate the results and push them to sonar cloud.
This would allow us to to run our CI pipelines faster overall.

We do that for our code coverage reports already. Each language has its own set of stages running in parallel in independent kubernetes pods, they all complete at different times, then we have a final Sonar Stage that aggregates all the coverage reports. If each language could have its own sonar Stage this would help mitigate the slowness of Sonar. We would have no issue having a final Sonar upload stage which sole job is to aggregate the per language analysis and upload that atomically to SonarCloud.

This would be more cost efficient as well since the memory and CPU could be tuned for each language instead of for the most hungry analysis.

I know Sonar has the ‘monorepo’ support where we could manage each language as a separate project on Sonar but that would make our PR configuration on GitHub more complicated and unfortunately (would they report as separate checks on the branch protection, what if the PR should skip some language?), and we have many common resources across the languages we use (protobuffers, swagger, …). So we need the atomic reporting to ensure everything works together when on main.

The other problem we have is that once we identified the rules that caused the problems we were unable to disable them. We have a custom set of rules ‘company way’ that inherit from ‘Sonar Way’, and we can add rules, but not disable rules. We tried to disable the rules from the sonar properties file, but that only disabled reporting, not analysis of these rules.

Can the rules management be updated to allow for disabling rules? We could ‘clone’ Sonar Way and disable the rules but that would mean that when ‘Sonar Way’ is updated we would have to manually re-copy it, manually re-disable all the rules we need to avoid and change the parent of Company Way. Support for disabling in child rules would really help.

1 Like

Hello Stephane,

Like you correctly assessed, currently it’s not possible to disable a rule from an inherited quality profile. This is an issue we are gonna look into to improve the situation, no ETA though.

Regarding your suggestion of analyzing different languages, it’s indeed not possible at the moment, but this is a suggestion worth having a look at. I escalated the topic to our PMs, but I can’t tell you yet if it’s something we will want to invest in or not.

Otherwise yesterday we deployed a new faster version of the JS/TS security analyzer. It contains the same new rules as the version that was making your project OOM and timeout, but way faster. Still we observed a small average duration increase compared to the previous version but it should not be an issue anymore. You can find more info about it in this thread.

2 Likes

Thanks for the update Gregoire. Our Sonar Stage is now taking 14m on average when it was taking 10-11m before the updates. Sonar is now the slowest part of our pipeline and since we cannot parallelize it it makes it difficult for us to keep on using it. While having the analysis is definitely a plus, the slowness is a big burden now.

Is there a way we could perform a nightly analysis, cache the results, and feed the cache to our CI pipeline? We have to find a solution to get our CI pipelines under reasonable times. Our pull requests turnaround times are becoming a problem.

1 Like

So if I understand well when you talk about caching the result, you are thinking about not analyzing your PR anymore ? Because if you want to feed some cached results to a PR analysis then you are not actually analyzing the PR’s new code, so you might as well disable it for PR.

If that’s what you have in mind, you could eventually shift your workflow to use the new code period instead of relying on PRs, setting it up based on your released version and running the analysis nightly on your main branch.
You could also keep the analysis on PR but stop making it mandatory for merging.

I’m aware that it’s probably not the kind of solution you are looking for, but unfortunately right now we don’t have yet a solution for this kind of issue where the analysis of a language becomes too long. This is something we will want to address at some point but as you can guess it won’t happen overnight.

3 Likes

I’m talking about doing an incremental analysis so that we do not take as long on every PR. Sonarlint is very snappy for example and will give me analysis on open files, it does not take 10-15m for me to get results. If the CLI analysis were able to cache some of the computation between the main branch from the past 24hs and the latest main, or a fresh PR, then we might be able to speed up the Sonar stage. We use Kubernetes to run our CI pipelines so any caching has to be done explicitly. Today we cache the plugins, which saves us 1m per analysis, unless the plugins change. Are there Sonar cache files that we could stash and then unstash before analyzing?
We could then create a separate pipeline that would only cache, and ideally not even publish to SonarSource … if that’s an option.

1 Like

Hello @sodul,

I don’t know exactly what languages you are analyzing, but since you mentioned several ones, if one of them is C or C++, we do have a feature to cache analysis results in these languages, and then analyze incrementally. You can look at “Analysis cache” on the documentation.

There have been some discussions to do something similar for other languages (this is language-specific, because a change in one file can have impacts in other unmodified files, so this requires good knowledge of the dependencies between files), but I’m not aware of current actions on this subject. Is there a language on which you think you would benefit most from this?

1 Like

Our top 3 languages are Go, Python and TypeScript, in that order.
It seems that typescript is the slowest of the bunch for us and could probably benefit the most, but more or our engineers would benefit from faster Go.

We do have some C/C++ but that’s insignificant in our codebase.

1 Like

Hello @sodul,

Thank you for the interesting information!

As @JolyLoic mentioned, although incremental analysis is not on our short term roadmap, we have this in mind for the future of the service.

Maybe there are other improvements that can be done to the process. Could you give us more details around the integration of SonarCloud into your workflow? I think you have mentioned Jenkins and GitHub in this thread, and it would be good to understand how all this is used with SonarCloud.

1 Like

Hi @Martin_Bednorz,

We operate in a serverless infrastructure. We do use GitHub Actions, but that’s for a very small part of our work and we can pretty much ignore that.

The vast majority of our development is done on a monorepo that is hosted on GitHub.com, our CI/CD infrastructure is in AWS.

To run our CI/CD pipelines we use Jenkins. We use the GitHub plugin so that on every code comits to our repositories with a Jenkinsfile a CI pipeline gets triggered.
The Jenkins workload itself is also server less. The Controller is running on EKS (AWS version of Kubernetes) and all pipeline stages get a fresh new Pod spin up with new containers. Each stage may use a custom container for the task that need to be completed. The big advantage of that setup is that we scale our capacity automatically in minutes and we scale down our EC2 instances as well. This is very cost efficient.

Now it does introduce a few differences:

  • No persistence from run to run. The volumes are always brand new which helps guarantee reproducibility.
  • Any caching we perform is done by uploading/downloading a tgz from S3. We are considering EFS but it would not be cost effective for us yet.
  • Memory tuning is tricky. We do not want to allocate more than needed in order to keep our costs under control so right sizing is more important than with traditional hardware. We need to account for Kernel Memory, Cache and Processes memory in Kubernetes. We have some tooling to help introspect memory consumption and we have a good handle on this nowadays.
  • We perform stages in parallel as much as possible. I cannot go into the number details but we do have many parallel stages to run a large amount of unittests on each PR.
  • We do optimize which stages to run for each PR. If we only have python code, we skip our Go and Typescript stages, and tell sonar to ignore these files (since it is for a PR that’s fine).
  • Once a test stage is completed we collect the test results and coverage information and store the tgz on s3 for use later in the CI pipeline.
  • Once all test stages are completed the Sonar stage triggers, it runs on an ephemeral pod, just like all other stages. It clones the repo like the other stages, we unshallow that clone to get full blame information, then we download all the tgz files from previous stages. We then run the sonar-scanner which is not taking much longer.

We are working on splitting up our repository into micro repos, but this will take time and is mostly an option for new projects while the existing codebase will still grow. We really think that if we could run the sonar stage into parallel pods per language to analyze, then aggregate the data in a final stage this could cut our Sonar time in half or even to a third.

I’m more than happy to setup a call to get into more details, but this is about as much as I’m able to share on a public forum.

4 Likes

Hello @sodul ,

Thank you for sharing those details.

My initial feeling is that our monorepo feature might be a good fit for this setup, as it will allow the parallel execution of the analysis.

You would get a check per project that is being analyzed in that PR. How is your monorepo structured? If it is per project, you could set up your pipelines to only run when code in that project changed, and thus also only then trigger a SonarCloud analysis. If a PR touches multiple projects, they would run in parallel and you would get multiple checks and Quality Gates.

I suppose it would work if we were to trigger each language (go, python, typescript) separately. Do you know of OSS projects on GitHub that use Sonar in this manner? I would love to see how the monorepo integration would work with GH, especially the fact that multiple Sonar Checks would be tied to a single PR.

Hello @sodul ,

Sorry for getting back to you so late. I don’t know of an open source project unfortunately, you could take a look at one of my test projects: GitHub - martinbdz-test/monorepo . There is no real code, but it could be helpful to see how SonarCloud behaves in a monorepo setup where changes are made to either one or multiple projects.

Let me know if you have any questions.