which versions are you using (SonarQube, Scanner, Plugin, and any relevant extension)
SonarQube v10.6 (92116)
how is SonarQube deployed: zip, Docker, Helm
Helm
what are you trying to achieve
Get our production instance to match the performance of our staging instance
what have you tried so far to achieve this
Do not share screenshots of logs – share the text itself (bonus points for being well-formatted)!
We’re running SonarQube EE and have a staging instance instance for testing and a production instance. Both are hosted using Helm in Kubernetes clusters in Azure.
Staging info:
Azure Database for PostgreSQL flexible servers, Standard_B1m size
Node pool size Standard_D16ads_v5 using ephemeral disk
Node pool is tainted such that the only workload schedulable on it is SonarQube
Persistent volume via Azure storage, not sure on size but max 20k IOPS
Storage, Database and SonarQube in same Azure subscription
Helm chart resource requests/limits set to 16Gi and 8 cores
Production info:
Azure Database for PostgreSQL flexible servers, Standard_D4ds_v4 size
Node pool size Standard_D32ads_v5 using ephemeral disk
Node pool is tainted such that the only workload schedulable on it is SonarQube
Persistent volume via Azure storage, not sure on size but max 20k IOPS
Storage and SonarQube in same Azure subscription, Database in different Azure subscription (will be moved but haven’t been able to do so yet)
Helm chart resource requests/limits set to 16Gi and 8 cores
When analyzing a mixed C and C++ project of around 209k lines of code, the staging instance is completing the background task in ~9.5 minutes, but the production instance is taking ~18-20 minutes.
I’ve attached extracts from the ce.log files on both staging and production, and also put them into a chart to highlight the differences (filtered to events taking longer than 10 seconds). staging_ce.log (11.7 KB) prod_ce.log (11.9 KB)
However, my first recommendation for any performance issues on a Postgres database would be this:
Autovacuum was on but I tried running that anyway; didn’t make a difference.
Ultimately, it’s hard to do a direct comparison since you have some variables (sizing, being located on the same Azure subscription).
You’re right that the sizes of the Azure resources are different, however the production ones are basically “and then some” vs the staging ones, e.g. more cores and/or more RAM, higher throughput limits etc. I do not have a good understanding of what the impact to performance can be if any for the production instance being in a different subscription than the production database.
If that doesn’t help – I would like to know more about the dataset on staging. Is it a clone of production as it is today? Something else?
The staging database is a clone of production. The number of analyses run on the project will have diverged by now as I’ve been testing back and forth between staging and production - we’ve tried playing around with the resources given to the SonarQube pod - more cores more RAM etc., but none of that has so far made any difference.
Just a follow up… more testing today strongly suggests that the issue is network related. The Azure vnet configuration is different in the staging environment. I ran an analysis this morning before switching anything around, and the background analysis took ~22 minutes. I shut down both production and staging instances, pointed the staging instance at the production database, purged the staging elasticsearch index (AFAIK this is required when swapping databases?), started the staging instance back up, and repeated the test. This time it took ~11 minutes.
@Colin does that make sense to you? If the network connection to the database is poor (unsure if poor throughput or poor latency), would you expect to see issues like this?
For optimal performance, the SonarQube server and database should be installed on separate hosts, and the server host should be dedicated. The server and database hosts should be located on the same network.
Just an update and this will conclude this issue for us… this really does all revolve around disk and network (latency and/or throughput). Things we were doing that are the culprits:
Database in different network
(helm) Using Azure Fileshare for persistent storage
(VM) Slow disks for temporary storage and elasticsearch storage
Going forward, we are going to switch over to using a dedicated VM with fast storage. Azure ephemeral disks are good for this, and with a VM we don’t have to chase probe timeouts over time as our database grows like with a helm deployment.