[Tech Story] A little story about SonarQube 7.9.1

Investigation details

Our performance tests helped us to narrow down the problem to an upgrade step that is injecting a lot of data into the database. A lot like 650,000,000 rows. While PostgreSQL and Oracle perform the migration in less than 24h, SQL Server take 14 days to complete. (yup, that’s an estimation, we didn’t really wait for that long).

Inserting 650M rows on PostgreSQL or Oracle perform at 30k insert/sec, while on SQL Server it starts at 5k insert/s and slowly decrease to 200 insert/sec. On AWS RDS monitoring, the write IOPS looks bad. Something is wrong.

At this stage, SonarQube is reading from a table, does a little bit of transformation, and inserts results into another table. Nothing fancy, everything should be fine. Since the code performs well on PostgreSQL and Oracle, there must be something wrong with SQL Server, right?

Chapter 1: Challenge the environment

The first suspect on our list is the driver version! Let’s check with more recent version. Search the patch notes for performances improvements. No luck.

Since there are different versions of the driver depending on the version of Java, let’s check then with another JRE! No luck.

Something must be wrong with the DB engine version! We were running our tests with the oldest version of Microsoft SQL Server we were compatible with, let’s check with a 2017 version. The exact same results.

With an Enterprise Edition instead of a Standard Edition? Same results.

Something must be wrong on the DB configuration. Everyone double-check that everything is right. Check the RDS instance, the database properties, the provisioned IOPS. No luck.

Hmm, then let’s try to insert data with regular SQL scripts. Come on! Performances are fine. From this point we must conclude there is no issue with the environment, we need to close this chapter and look somewhere else.

Chapter 2: Check connections and isolation

Something must be wrong with the DB connection pool used by SonarQube. We open 2 transactions: one with read-only isolation level, and another one for writing. Maybe the pool is foobared and the transactions are in conflicts somehow. We tried a version without reading, only one connection, inserting random data. Same result? depressing :’(

Double check that SQL statements are correctly handled. Check the Java try-with-resources, and better statement release. And oh my gosh we found some bugs! Let’s run the code again with that improvement, I’m sure it will be blazing fast! Nope, still slow as hell.

Remove all the indexes! Tune the batch size! Analyse with code in production with an APM (Glowroot)! Try all the things!! But sadly nothing can change the insertion rate per second. Maybe SQL Server is just slow, and that’s life. How can people be using it?

Time to take a breath. We now trust the environment we are executing our tests on. We also removed all SonarQube code to simply insert random data in a table without any index.

What are the remaining options?

Chapter 3 : Think outside the box

At some point, the best is to start from scratch: a JDBC driver, a mug of coffee, one connection, a for loop, a prepareStatement and few commits… And guess what?

Here we go : that small loop reach 30k insertions! Woohoo!

But the journey is not over: after a few hours, the rate started to decrease. The RDS monitoring? Showing that the write IOPS are increasing:

So now for some reason the rate decrease while the write IOPS increase. We found out that index on a Primary Key on Microsoft SQL Server are clustered. That means that the actual data for a rows are stored in the same storage space that the index (hence the name clustered). The index is the table structure. And the index is a balanced-tree, or b-tree. On a b-tree, inserting a value may cause nodes splits as the data goes through the tree. In a very large b-tree, we do not benefit from sequential read/write because data pages are so far away on the disk that it doesn’t fit the disk cache. Moreover, non-sequentials keys increase fragmentation (our PKs are random UUID). On general case, we may not see the difference, but here our data contains BLOB of unlimited size. That’s a lot to move every time. The visible effect on the DB engine is visible by the number of exclusive IO waits (PAGEIOLATCH_EX). OK, let’s remove the PK during the insertion, and put it back when done. An interesting lead, that helped to stabilize the rate throughput to a high speed.

Chapter 4 : Final step

One code is slow, the other is fast. The last step is to merge the slow code into the fast one, until it becomes slow. And boom, the guilty line of code is revealed:

When the value is null, we are using setNull with the Types.DECIMAL. The column type is decimal. Should have worked.

Under the hood, Microsoft SQL Server is calling implicit CONVERT(). If we change to Types.DOUBLE, baoum, the rocket is flying to the sky! it goes from 200 inserts/s to 30k/s just by using the double type. Indeed, this is not an expected behavior from the Microsoft SQL Server JDBC driver, and we filled a bug report with a reproducer so that they can tackle this at the JDBC driver level: https://github.com/microsoft/mssql-jdbc/issues/1106

19 Likes