We are attempting to upgrade from Version 7.9.3 to 8.4.0.
Our initial attempt at this upgrade failed due to the server becoming unresponsive during DBMigration #3208.
We had to abort the upgrade and rollback our installation once we encountered this issue.
The next day we created a test replica of our sonar server/database to attempt the upgrade on (so as to not affect our production installation). During this upgrade attempt we observed the same behavior. The server became unresponsive and we lost our ssh session. We decided to let it be, and after 7 hours the server became responsive again once DBMigration #3208 had completed. However, the sonar process had died so we had to restart it, at which point it continued patching for another 2 hours and completely finished the Database Migration in 9 hours total.
Logging observed:
2020.07.23 15:11:38 INFO web[][DbMigrations] #3208 'Remove old Security Review Rating measures'...
2020.07.23 15:12:18 INFO web[][o.s.s.p.d.m.s.MassUpdate] 1521 rows processed (25 items/sec)
2020.07.23 15:13:18 INFO web[][o.s.s.p.d.m.s.MassUpdate] 249 rows processed (4 items/sec)
2020.07.23 15:14:18 INFO web[][o.s.s.p.d.m.s.MassUpdate] 499 rows processed (4 items/sec)
...
2020.07.23 16:36:18 INFO web[][o.s.s.p.d.m.s.MassUpdate] 1249 rows processed (0 items/sec)
2020.07.23 16:37:18 INFO web[][o.s.s.p.d.m.s.MassUpdate] 1249 rows processed (0 items/sec)
2020.07.23 16:38:17 WARN web[][o.s.p.ProcessEntryPoint] Fail to start web
java.lang.IllegalArgumentException: Unable to create shared memory :
at org.sonar.process.sharedmemoryfile.AllProcessesCommands.<init>(AllProcessesCommands.java:103)
at org.sonar.process.sharedmemoryfile.DefaultProcessCommands.<init>(DefaultProcessCommands.java:34)
at org.sonar.process.sharedmemoryfile.DefaultProcessCommands.secondary(DefaultProcessCommands.java:52)
at org.sonar.server.app.WebServer.isOperational(WebServer.java:69)
at org.sonar.server.app.WebServer.getStatus(WebServer.java:61)
at org.sonar.process.ProcessEntryPoint.waitForStatus(ProcessEntryPoint.java:121)
at org.sonar.process.ProcessEntryPoint.launch(ProcessEntryPoint.java:104)
at org.sonar.process.ProcessEntryPoint.launch(ProcessEntryPoint.java:81)
at org.sonar.server.app.WebServer.main(WebServer.java:99)
Caused by: java.io.FileNotFoundException: /opt/sonarqube-8.4.0.35506/temp/sharedmemory (Too many open files in system)
at java.base/java.io.RandomAccessFile.open0(Native Method)
at java.base/java.io.RandomAccessFile.open(RandomAccessFile.java:345)
at java.base/java.io.RandomAccessFile.<init>(RandomAccessFile.java:259)
at java.base/java.io.RandomAccessFile.<init>(RandomAccessFile.java:214)
at org.sonar.process.sharedmemoryfile.AllProcessesCommands.<init>(AllProcessesCommands.java:100)
... 8 common frames omitted
2020.07.23 16:38:17 INFO web[][o.s.p.ProcessEntryPoint] Hard stopping process
2020.07.23 16:38:18 WARN web[][o.s.p.ProcessEntryPoint$HardStopperThread] Can not stop in 1000ms
2020.07.23 16:38:18 WARN web[][o.s.s.a.EmbeddedTomcat] Failed to stop web server
org.apache.catalina.LifecycleException: A child container failed during stop
at org.apache.catalina.core.ContainerBase.stopInternal(ContainerBase.java:997)
at org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:257)
at org.apache.catalina.core.StandardService.stopInternal(StandardService.java:486)
at org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:257)
at org.apache.catalina.core.StandardServer.stopInternal(StandardServer.java:787)
at org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:257)
at org.apache.catalina.startup.Tomcat.stop(Tomcat.java:450)
at org.sonar.server.app.EmbeddedTomcat.terminate(EmbeddedTomcat.java:104)
at org.sonar.server.app.WebServer.hardStop(WebServer.java:83)
at org.sonar.process.ProcessEntryPoint$HardStopperThread.lambda$new$0(ProcessEntryPoint.java:219)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
2020.07.23 16:38:18 INFO web[][o.s.s.p.d.m.s.MassUpdate] 1249 rows processed (0 items/sec)
2020.07.23 16:38:27 WARN web[][o.s.s.p.d.m.DatabaseMigrationExecutorServiceImpl] Pool DatabaseMigrationExecutorServiceImpl did not terminate
2020.07.23 16:39:18 INFO web[][o.s.s.p.d.m.s.MassUpdate] 1249 rows processed (0 items/sec)
2020.07.23 16:40:18 INFO web[][o.s.s.p.d.m.s.MassUpdate] 1249 rows processed (0 items/sec)
2020.07.23 16:41:18 INFO web[][o.s.s.p.d.m.s.MassUpdate] 1249 rows processed (0 items/sec)
...
2020.07.23 21:59:18 INFO web[][o.s.s.p.d.m.s.MassUpdate] 1521 rows processed (0 items/sec)
2020.07.23 22:00:18 INFO web[][o.s.s.p.d.m.s.MassUpdate] 1521 rows processed (0 items/sec)
2020.07.23 22:00:31 INFO web[][o.s.s.p.d.m.s.MassUpdate] 1521 rows processed (0 items/sec)
2020.07.23 22:00:31 INFO web[][DbMigrations] #3208 'Remove old Security Review Rating measures': success | time=24532914ms
Information about installation:
- SonarQube running on r5.2xlarge EC2 instance
- SonarQube DB running Postgres 11.6 on an m5.large instance in RDS
Configurations:
- vm.max_map_count: 1048576
- fs.file-max: 262144
- nofile: 262144
- nproc: 16384
- sonar.ce.javaOpts=-Xmx20G
- sonar.web.javaOpts=-Xmx28G
- sonar.search.javaOpts=-Xmx12G -Xms12G
I did some monitoring of the queries that are running during DBMigration #3208. This query seems to be the killer:
delete from live_measures where project_uuid = $1 and metric_id in ($2, $3)
Our database has 208265000 rows in that table
Is there something we can do to speed up DBMigration #3208? 9 hours of downtime on our production instance is not feasible. Perhaps an index is missing that we could add? Maybe our configuration needs some tuning? Any advice would be great!