Very slow analysis after 8.1 -> 8.8 upgrade

We have SonarQube 8.8 Community Edition running on a Kubernetes cluster via the community helm chart. We have recently upgraded from image version tag 8.1-community-beta to 8.8-community. Things were extremely slow in the UI after the upgrade and we’ve since deleted hundreds of branches that hadn’t been deleted automatically and run vacuum full in our postgres instance. This cleaned up a few hundred gigabytes of data and the UI now seems to be working just fine.

Unfortunately analysis is slow. We’re using the latest-tagged version of the sonarqube-scanner image which takes about 8 minutes to run on the project, which is fine and hasn’t changed, but the server-side analysis by the compute engine has gone from ~5 minutes to ~30 minutes.

We’ve disabled all third-party plugins and disabled duplicate detection (via a ** exclude) and set the following JVM properties:

sonar.search.javaOpts: "-Xmx4G -Xms4G"
sonar.ce.javaOpts: "-Xmx4G -Xms4G"

The following are a lightly-parsed version of the logs from the server.

| Load analysis metadata                                 | status=SUCCESS, time=26ms                                                                                                     |
| Initialize                                             | status=SUCCESS, time=0ms                                                                                                      |
| Build tree of components                               | components=11558, status=SUCCESS, time=40891ms                                                                                |
| Validate project                                       | status=SUCCESS, time=17ms                                                                                                     |
| Load quality profiles                                  | status=SUCCESS, time=308ms                                                                                                    |
| Load Quality gate                                      | status=SUCCESS, time=16ms                                                                                                     |
| Load new code period                                   | status=SUCCESS, time=17ms                                                                                                     |
| Detect file moves                                      | reportFiles=9180, dbFiles=9180, addedFiles=0, status=SUCCESS, time=424ms                                                      |
| Load duplications                                      | duplications=0, status=SUCCESS, time=15793ms                                                                                  |
| Compute cross project duplications                     | status=SUCCESS, time=0ms                                                                                                      |
| Compute size measures                                  | status=SUCCESS, time=40314ms                                                                                                  |
| Compute new coverage                                   | status=SUCCESS, time=163552ms                                                                                                 |
| Compute coverage measures                              | status=SUCCESS, time=13832ms                                                                                                  |
| Compute comment measures                               | status=SUCCESS, time=34ms                                                                                                     |
| Copy custom measures                                   | status=SUCCESS, time=9ms                                                                                                      |
| Compute duplication measures                           | status=SUCCESS, time=28ms                                                                                                     |
| Compute size measures on new code                      | status=SUCCESS, time=38ms                                                                                                     |
| Compute language distribution                          | status=SUCCESS, time=27ms                                                                                                     |
| Compute test measures                                  | status=SUCCESS, time=19ms                                                                                                     |
| Compute complexity measures                            | status=SUCCESS, time=48ms                                                                                                     |
| Load measure computers                                 | status=SUCCESS, time=0ms                                                                                                      |
| Compute Quality Profile status                         | status=SUCCESS, time=37ms                                                                                                     |
| Execute component visitors                             | status=SUCCESS, time=189988ms                                                                                                 |
| Checks executed after computation of measures          | status=SUCCESS, time=0ms                                                                                                      |
| Compute Quality Gate measures                          | status=SUCCESS, time=0ms                                                                                                      |
| Compute Quality profile measures                       | status=SUCCESS, time=10ms                                                                                                     |
| Generate Quality profile events                        | status=SUCCESS, time=82ms                                                                                                     |
| Generate Quality gate events                           | status=SUCCESS, time=5ms                                                                                                      |
| Check upgrade possibility for not analyzed code files. | status=SUCCESS, time=0ms                                                                                                      |
| Persist components                                     | status=SUCCESS, time=329ms                                                                                                    |
| Persist analysis                                       | status=SUCCESS, time=21ms                                                                                                     |
| Persist analysis properties                            | status=SUCCESS, time=38ms                                                                                                     |
| Persist measures                                       | inserts=90, status=SUCCESS, time=65ms                                                                                         |
| Persist live measures                                  | insertsOrUpdates=529036, status=SUCCESS, time=192873ms                                                                        |
| Persist duplication data                               | insertsOrUpdates=0, status=SUCCESS, time=4ms                                                                                  |
| Persist new ad hoc Rules                               | status=SUCCESS, time=0ms                                                                                                      |
| Persist issues                                         | cacheSize=34 KB, inserts=8, updates=59, merged=0, status=SUCCESS, time=464ms                                                  |
| Persist project links                                  | status=SUCCESS, time=0ms                                                                                                      |
| Persist events                                         | status=SUCCESS, time=58ms                                                                                                     |
| Persist sources                                        | status=SUCCESS, time=245065ms                                                                                                 |
| Persist cross project duplications                     | status=SUCCESS, time=0ms                                                                                                      |
| Enable analysis                                        | status=SUCCESS, time=53ms                                                                                                     |
| Update last usage date of quality profiles             | status=SUCCESS, time=62ms                                                                                                     |
| Purge db                                               | status=SUCCESS, time=352ms                                                                                                    |
| Index analysis                                         | status=SUCCESS, time=8192ms                                                                                                   |
| Update need issue sync for branch                      | status=SUCCESS, time=10ms                                                                                                     |
| Send issue notifications                               | status=SUCCESS, time=10ms                                                                                                     |
| Publish task results                                   | status=SUCCESS, time=0ms                                                                                                      |
| Trigger refresh of Portfolios and Applications         | status=SUCCESS, time=0ms                                                                                                      |
| sksExecutor] Webhooks                                  | globalWebhooks=0, projectWebhooks=0, status=SUCCESS, time=8ms                                                                 |
| sksExecutor] Pull Request Decoration                   | status=SUCCESS, time=0ms                                                                                                      |

The total duration in this case was 1993457ms (33 minutes).

Warnings produced from this scan were:

Dependencies/libraries were not provided for analysis of SOURCE files. The 'sonar.java.libraries' property is empty. Verify your configuration, as you might end up with less precise results.
Unable to import 1 RuboCop report file(s).
Please check that property 'sonar.ruby.rubocop.reportPaths' is correctly configured and the analysis logs for more details.
Missing blame information for 4402 files. This may lead to some features not working correctly. Please check the analysis logs.

None of which feel from my limited knowledge sound like they’d make the scan slower.

In terms of the slowest entries:

| Persist sources                                        |  245065 |
| Persist live measures                                  |  192873 |
| Execute component visitors                             |  189988 |
| Compute new coverage                                   |  163552 |
| Build tree of components                               |   40891 |
| Compute size measures                                  |   40314 |
| Load duplications                                      |   15793 |
| Compute coverage measures                              |   13832 |
| Index analysis                                         |    8192 |

Observations:

Slow “Persist sources” and “Persist live measures” might suggest a slow database (postgres) issue, and is there any guidance here? Our postgres server feels small but it was not stressed at all during the analysis:

In this analysis we did not provide Sonar with any code coverage information, so why did “compute new coverage” take almost 3 minutes?

Is there a way to speed up “Execute component visitors” step? It does not look like this is multi-threaded, or if it is can I increase the thread count? Our server CPU usage was very low (note the graph includes the web UI, compute engine, and elasticsearch) so I do not think there were any bottlenecks or garbage collection issues:

Any advice or suggestions for where to look next would be greatly appreciated - Sonar has proved extremely valuable but we want to make sure it meets our needs before considering purchasing the hosted option. It’s also worth saying that nothing around the filesystem has changed to my knowledge - so I don’t think the bottleneck is there.

Hello @Martin_Foot,

At first sight I would say that the problem comes from the database. The slow persist sources and persist live measures is symptomatic of this root cause.
Could you check in particular the health of your live_measures and file_sources tables.

  • How big are they ? (in GB and in number of entries)
  • Full vacuum reclaims space, but I am not sure about the indexing. Can you confirm that the DB has also been full reindexed after upgrade (in particular those 2 tables) ?

I also see something I don’t understand. You mentioned that the total execution time was 1993457ms (33 minutes) but if I add up everything (let’s say the main time contributors) I reach a total of about 15 minutes. Where does the (huge) gap come from ?

Olivier

Hi @OlivierK and thanks for the response! Well spotted, I had missed the biggest offender:

[o.s.c.t.s.ComputationStepExecutor] Extract report | status=SUCCESS | time=736446ms

This confused me when posting too because I was sure I’d seen report extraction taking a long time and had investigated it before. From looking at the repo I’d guess the time spent is extracting a zip file. This step alone is more expensive than the entire analysis before.

I’ll get back on the table sizes and a reindex - assuming this is REINDEX DATABASE "sonarDB" (or just the tables), I haven’t done this explicitly. I only ran VACUUM FULL as I saw it here.

Thanks!

Here’s the sizes and row counts of the tables:

sonarDB=> SELECT pg_size_pretty( pg_database_size('sonarDB') );
 pg_size_pretty 
----------------
 31 GB
(1 row)

sonarDB=> SELECT pg_size_pretty( pg_total_relation_size('live_measures') );
 pg_size_pretty 
----------------
 10 GB
(1 row)

sonarDB=> SELECT pg_size_pretty( pg_total_relation_size('file_sources') );
 pg_size_pretty 
----------------
 5906 MB
(1 row)

sonarDB=> SELECT COUNT(*) FROM file_sources;
 count  
--------
 766834
(1 row)

sonarDB=> SELECT COUNT(*) FROM live_measures;
  count   
----------
 41112988
(1 row)

I’ve just run a branch report after reindexing those two tables (REINDEX TABLE "live_measures" and REINDEX TABLE file_sources`) and the report took 1714549ms, not wholly different from the previous scan.

OK, I believe I have solved this issue.

I ran an ANALYZE on the sonarDB database but did not see any improvement.

I also enabled the postgres slow query log and during the analysis the only query that took over 5 seconds was this one, which would suggest postgres is not having to do large non-indexed joins:

2021-05-13 11:25:22.219 GMT [2365] LOG: duration: 5237.327 ms execute <unnamed>: SELECT
p.uuid,
p.path,
p.module_uuid as moduleUuid,
fs.src_hash as srcHash,
fs.revision
FROM components root
INNER JOIN components p on
p.project_uuid=root.uuid
and p.enabled=true
and p.scope='FIL'
INNER JOIN file_sources fs ON
fs.file_uuid=p.uuid
where
root.uuid=$1
2021-05-13 11:25:22.219 GMT [2365] DETAIL: parameters: $1 = 'AXkiaT-5v9lU_N2D-Wog'

I then used YourKit to analyse the compute engine process during an analysis and it produces the following flamegraph:

It’s clear from the staircase of functions on the left that “persisting live measures” was the majority of the runtime and it’s slowed down mostly by time spent in org.apache.ibatis.session.defaults.DefaultSqlSession.commit(). There’s a chunk of time missing though that’s not visible here - hovering over the lines inside Yourkit showed only about 5 minutes of actual time spent in the CE code path.

I then went to look at the filesystem that postgres is using, an EFS volume, and it doesn’t seem terribly slow (it’s not great) but isn’t bad enough to cause such a slowdown, especially given the slow query log isn’t showing anything super slow:

postgresql@sonar-postgresql-0:/$ pg_test_fsync -f /tmp/file
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      1129.834 ops/sec     885 usecs/op
        fdatasync                          1241.245 ops/sec     806 usecs/op
        fsync                               786.327 ops/sec    1272 usecs/op
        fsync_writethrough                              n/a
        open_sync                           994.000 ops/sec    1006 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                       579.749 ops/sec    1725 usecs/op
        fdatasync                           984.110 ops/sec    1016 usecs/op
        fsync                               735.919 ops/sec    1359 usecs/op
        fsync_writethrough                              n/a
        open_sync                           467.525 ops/sec    2139 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
         1 * 16kB open_sync write           814.711 ops/sec    1227 usecs/op
         2 *  8kB open_sync writes          421.977 ops/sec    2370 usecs/op
         4 *  4kB open_sync writes          243.189 ops/sec    4112 usecs/op
         8 *  2kB open_sync writes          124.718 ops/sec    8018 usecs/op
        16 *  1kB open_sync writes           53.923 ops/sec   18545 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
        write, fsync, close                 953.955 ops/sec    1048 usecs/op
        write, close, fsync                 761.787 ops/sec    1313 usecs/op

Non-sync'ed 8kB writes:
        write                            183792.449 ops/sec       5 usecs/op

So this was a bit of a dead end (although some postgres tuning can likely get that commit time down some).

I then had a look at the events tab in YourKit and then I saw a large number of tiny writes of temp files to /opt/sonarqube/temp/ce during zip file extraction. I have now mounted an in-memory filesystem there and the zip extraction step went from ~700,000ms to ~4000ms. The runtime of my entire report went from ~30 minutes to ~5.

Is this a safe change to keep moving forwards?

Thanks!

1 Like

Hello @Martin_Foot,
Sorry for the delay to reply.
Indeed our requirement is to have local fast disks for all SonarQube partitions. I don’t know what was the original disk set up, but indeed the improvement is impressive with in-memory filesystem.
I cannot guarantee that as a safe change since it all depends on the reliability of the disk (whether in memory of physical). We typically recommend SSDs.
What I can say is:

  • The unzipping and processing of files is temporary for the duration of the CE task, so there is no need for long term persistence
  • If something goes wrong with the in-memory file system you should notice rather quickly since it would cause (several or all) CE background tasks to end in error during to IO problems.

Olivier
that you would notice if something goes wrong with your in-memory filesystem, since