Well-cached scans sometimes take a very long time

Must-share information (formatted with Markdown):

  • which versions are you using:
    • SonarQube 9.9.0
    • sonar-scanner 4.8.2856
  • how is SonarQube deployed: Helm
  • what are you trying to achieve: Scan code fast
  • what have you tried so far to achieve this: Use server-side caching.

Basically we have this situation. If the cache is not hot the scan takes about 80 minutes. When the cache is hot the scan usually takes about 10 minutes, give or take. This is on a 72 core Azure agent.

Now the CPU utilization is usually around 100% for most of the scan, which is good. However on some scans we have a lot of cache hits, but the scan still takes 40-60 minutes. In these cases I’ve observed CPU utilization is around 100% for 10 minutes (like on a normal hot-cache scan), but then drops off, no more log messages appear, and 30-50 min later I get a message from the Z3 solver like this:

2023-03-29T09:06:07.2483235Z ##[debug] 09:06:07.246 - PCH: unique=0 use=0 (forceInclude=0,throughHeader=0,firstInclude=0) out of 3950 (forceInclude=0,throughHeader=0)
2023-03-29T09:06:07.9459789Z ##[debug] 09:06:07.944 - SE: 3938 out of 3950
2023-03-29T09:06:07.9461022Z ##[debug] 09:06:07.944 - Z3 refutation rate: 1853 out of 2710
2023-03-29T09:06:08.7757329Z ##[debug] 09:06:08.774 - Subprocess(es) done in 2841659ms

Now in that time CPU utilization is near 0. No log messages are ouptut. So my suspicion is that it’s just hanging somewhere. Alternatively could this be a “normal” edge case of the Z3 solver? If so how can this be avoided?

Hi Carl!

A similar case was reported in CPP-3682.

Right now there is nothing that you could do about this unless you are willing to disable the Z3 refutation completely, and note that it is not recommended and it would increase the number of false-positive reports.

According to some experimental data in CPP-4267, the Z3 refutation should not take too long, except for some edge-cases where we measured 9 and a half minutes runningtime for a single refutation call - which is really bad. In general, such calls should not take longer than a couple of seconds at most.

Thanks for raising this, this helps to prioritize a fix for this as part of CPP-4267.

1 Like

Hi Balázs!

Thank you for the quick response. I’d like to not have more false positives. Do you have any idea how these refutation times can be reduced though? What is causing this to sometimes use up 40-60 mins? Could refutation be parallelized better, the CPU usage is at near 0 while it’s doing the refutation.

Best regards,

Carl

No. I can only see two options: 1) disable refutation 2) exclude the long-taking file(s) if there are only a handful of them. If I must recommend one, I’d recommend the 1st. I don’t think the 2nd option is a scalable solution.

We don’t know it yet. Most projects seem to work just fine, but in some cases, we can see that something goes wrong. This suggests to me that a few edge-case should likely cause the slowdown.

In general, it shouldn’t need parallelization beyond what we do right now by analyzing each translation unit (TU) individually. If there are long taking TUs, then it is most likely a bug on our side and we should probably bound the analysis time somehow. Under these two assumptions, the analysis should utilize all cores throughout the whole analysis.

We know it’s a painful issue, and we will likely prioritize it accordingly, but I cannot give a deadline.

@balazs.benics so since this happens often enough I checked, and the PR diff size doesn’t seem to be the reason. Even small PRs sometimes trigger this performance cliff.

The interesting part I found is that the same PR will repeatedly trigger this performance cliff, meaning it’s probably a systematic issue.

However since sonar-scanner even with -X doesn’t provide much information what the Z3 solver is doing I can’t see which files/TUs are causing this. Maybe there is some pattern there, e.g. many includes, large files, etc.

Is there some debug option on the Z3 solver that could be bubbled up to sonar-scanner? That way maybe we can help diagnose this issue faster.

Best regards,

Carl

As of now, we don’t log Z3 refutation times. So no, we don’t have anything to help the investigation.

Have you checked that with disabled Z3 refutation, the PR analysis takes considerably less time?

@Carl My colleague suggested to me that you could sort the log entries like this:
16:22:52 15:22:48.079 INFO: [pool-3-thread-3] sort by pool per thread. This way by subtracting two subsequent timestamps you could have a rough idea of how long the given file took to analyze.

After finding the offending file, you could create a reproducer for it by using the sonar.cfamily.reproducer=/path/to/file.cpp option. Let me know if you only want to share it privately, in which case I’ll PM you to open a channel.

@balazs.benics I tried this now

cat sonarqube.log | rg --color never -p '\d\d:\d\d:\d\d\.\d\d\d - \[pool' | perl -pe 's/.*(\d\d:\d\d:\d\d\.\d\d\d) - \[pool-(\d*)-thread-(\d*)\]\s+(.*)/$2-$3-$1-$4/' | sort -h | code -

I put the log in sonarqube.log for this, and used regex to filter all the messages from the scanner, perl to reorder it, sort -h to sort it by the new ordering (pool number, thread number, then timestamp).

Looking at the file it doesn’t seem particularly big. However it has recursion in it, which might cause the issues here.

Looks good.

Is the running-time distribution considerably different if you disable Z3 refutation?
Can you share the sonarqube.log file?

What recursion are you referring to?

I’m afraid I can’t share the sonarqube.log. It’s also not particularly interesting except that this one file takes long to process and Z3 takes long to process, too, according to its own debug message.

I found another build today that took long and it was the same file.

Recursion in the sense that there’s a function that’s recursive in the file. I’m guessing this is causing the Z3 solver to hit some edge case where it goes too deep or something.

To clarify, are you still not comfortable sharing the reproducer with us privately?
It would be interesting to see the concrete case you are dealing with.

Maybe you can redact the sensitive parts if you are lucky while preserving reason why it’s so slow, although the reduction process is really sensitive and I cannot really full heartedly ask you to do so.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.