Diagnose performance difference between two environments

Must-share information (formatted with Markdown):

  • which versions are you using (SonarQube, Scanner, Plugin, and any relevant extension)
    SonarQube v10.6 (92116)
  • how is SonarQube deployed: zip, Docker, Helm
    Helm
  • what are you trying to achieve
    Get our production instance to match the performance of our staging instance
  • what have you tried so far to achieve this

Do not share screenshots of logs – share the text itself (bonus points for being well-formatted)!

We’re running SonarQube EE and have a staging instance instance for testing and a production instance. Both are hosted using Helm in Kubernetes clusters in Azure.

Staging info:

  • Azure Database for PostgreSQL flexible servers, Standard_B1m size
  • Node pool size Standard_D16ads_v5 using ephemeral disk
  • Node pool is tainted such that the only workload schedulable on it is SonarQube
  • Persistent volume via Azure storage, not sure on size but max 20k IOPS
  • Storage, Database and SonarQube in same Azure subscription
  • Helm chart resource requests/limits set to 16Gi and 8 cores

Production info:

  • Azure Database for PostgreSQL flexible servers, Standard_D4ds_v4 size
  • Node pool size Standard_D32ads_v5 using ephemeral disk
  • Node pool is tainted such that the only workload schedulable on it is SonarQube
  • Persistent volume via Azure storage, not sure on size but max 20k IOPS
  • Storage and SonarQube in same Azure subscription, Database in different Azure subscription (will be moved but haven’t been able to do so yet)
  • Helm chart resource requests/limits set to 16Gi and 8 cores

When analyzing a mixed C and C++ project of around 209k lines of code, the staging instance is completing the background task in ~9.5 minutes, but the production instance is taking ~18-20 minutes.

I’ve attached extracts from the ce.log files on both staging and production, and also put them into a chart to highlight the differences (filtered to events taking longer than 10 seconds).
staging_ce.log (11.7 KB)
prod_ce.log (11.9 KB)

event staging ms production ms
Extract report 358343 483608
Build tree of components 9633 12876
Load file hashes and statuses 9733 12191
Compute size measures 9864 11383
Compute new coverage 13488 16732
Execute component visitors 50078 449638
Persist live measures 4269 103235
Persist duplication data 379 10110
Persist sources 34450 43079
Time between final event and Executed task 67439 79821

Hey there.

Ultimately, it’s hard to do a direct comparison since you have some variables (sizing, being located on the same Azure subscription).

However, my first recommendation for any performance issues on a Postgres database would be this:

If that doesn’t help – I would like to know more about the dataset on staging. Is it a clone of production as it is today? Something else?

Hi Colin,

However, my first recommendation for any performance issues on a Postgres database would be this:

Autovacuum was on but I tried running that anyway; didn’t make a difference.

Ultimately, it’s hard to do a direct comparison since you have some variables (sizing, being located on the same Azure subscription).

You’re right that the sizes of the Azure resources are different, however the production ones are basically “and then some” vs the staging ones, e.g. more cores and/or more RAM, higher throughput limits etc. I do not have a good understanding of what the impact to performance can be if any for the production instance being in a different subscription than the production database.

If that doesn’t help – I would like to know more about the dataset on staging. Is it a clone of production as it is today? Something else?

The staging database is a clone of production. The number of analyses run on the project will have diverged by now as I’ve been testing back and forth between staging and production - we’ve tried playing around with the resources given to the SonarQube pod - more cores more RAM etc., but none of that has so far made any difference.

Thanks,
Sean

Just a follow up… more testing today strongly suggests that the issue is network related. The Azure vnet configuration is different in the staging environment. I ran an analysis this morning before switching anything around, and the background analysis took ~22 minutes. I shut down both production and staging instances, pointed the staging instance at the production database, purged the staging elasticsearch index (AFAIK this is required when swapping databases?), started the staging instance back up, and repeated the test. This time it took ~11 minutes.

@Colin does that make sense to you? If the network connection to the database is poor (unsure if poor throughput or poor latency), would you expect to see issues like this?

Hey there.

Yes, definitely! As documented

Hosts and locations

For optimal performance, the SonarQube server and database should be installed on separate hosts, and the server host should be dedicated. The server and database hosts should be located on the same network.

Just an update and this will conclude this issue for us… this really does all revolve around disk and network (latency and/or throughput). Things we were doing that are the culprits:

  • Database in different network
  • (helm) Using Azure Fileshare for persistent storage
  • (VM) Slow disks for temporary storage and elasticsearch storage

Going forward, we are going to switch over to using a dedicated VM with fast storage. Azure ephemeral disks are good for this, and with a VM we don’t have to chase probe timeouts over time as our database grows like with a helm deployment.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.