[Tech Story] A little story about SonarQube 7.9.1

Few weeks ago, SonarSource was about to release SonarQube 7.9 LTS. This version will be supported for 18 months; it introduces many new features including our first move on security; all teams got involved and worked very hard on it. In a nutshell, this is a highly awaited release you do not want to mess up with.

Sadly, the day before the release, our upgrade tests reported very bad performances on Microsoft SQL Server while previous validation we made on PostgreSQL and Oracle were fine.

In such a situation, a reptilian reflex could be: “Let’s put all available devs in a room, order pizzas and lock the doors until a solution is found!‘’

Well, this is an option, but such an approach may have disadvantages:

  • It put a lot of pressure on the shoulders of a small group of people that already worked hard for months on this version
  • It reduces the group to a certain category of people which may limit the options / the approach to investigate the problem
  • You never know when (/if) you will reopen the doors :sweat_smile:

#WeAreSonarSource

In such a situation, company culture can have a big impact on decision being taken. For us in this case, two of our core values influenced our next steps:

The delivery:

We want to continuously deliver. We believe in increments, timebox and baby steps as means to deliver often, deliver simple, get feedback and continue improving.

Smarter together:

We believe that no one is as smart as all of us. We operate as a group with minimal hierarchy using collective intelligence, each of us being co-responsible for decisions made as well as relations with others.

(definitions copy-paste from our internal wiki)

What do you think happened? Did we ordered pizzas right away? :innocent:
Not at all! Team representatives of Development, Product Management, Marketing, Services and Business met to find the best way to address the problem.

The output of this meeting? We then decided to:

  • raise the awareness of our users in the release note
  • deliver what we already had
  • start a sprint to tackle the problem

Why delivering was important for us?

At this point of time, we had no clue what the problem was: environment, configuration, bug, … Hard to guess how much time it will take to identify the root cause, find a solution, deliver a fix. It means that if we delayed the release, there was no way for us to foresee how long all our users will have to wait.
It does not look like sain conditions to investigate a problem.

In addition, we knew that only fraction of our users may be impacted by this problem: users with significant SonarQube dataset who are using Microsoft SQL Server. All the rest of our users can safely enjoy this new LTS without being bothered by this problem.

Why did we run sprint?

The ultimate goal is to avoid the tunnel effect. Defining a sprint of 5 days helped us to set expectations and put ourselves in good working conditions.

Without any time frame:

  • You start guessing your next test is the last one which most of the time prevent you to industrialize your setup and your tests
  • There is no end until you find the solution which may be hard to keep team’s motivation
  • Harder to handle expectation of the rest of the world (other SonarSourcers, users, customers)

What about collective intelligence mentioned earlier?

First, we collectively weigh pros and cons and took decision with all impacted team representatives.
Secondly, team we setup for the investigation was a mix coming from different teams:

  • development team (x3)
  • support (x1)
  • SonarCloud infrastructure (x1) (since tests were running on RDS)

The investigation:

To put a long story short, the team managed to identify the problem the last day of the sprint and fix has been shipped as part of 7.9.1. The root cause was a limitation of the jdbc driver. Here is the report we open on jdbc driver side.

If you are interested in the investigation details, we will post step by step details in a comment bellow.

What did we learn?

First it was 5 days of nice and intense collaboration. Some people in the team never had the occasion to collaborate before. There were depressing deadends, funny wrong paths and a nice conclusion.

Here are some of the key point that really helped us:

Know what you are testing

The environment, the measure you are using or the piece of code you are testing, everybody in the team is 100% confident about what is being done. If there is any doubt, the team will lose a lot of time in double validation, re-running tests and so on.

From the beginning, we saved a lot of time relying on numbered version of SonarQube coming from our CI pipeline. When we were losing faith in tests results, we were able to quickly confirm the code we were testing. This would have been far more complex if we would have used snapshot binaries coming from our computers.

Make your tests easy to replay

From the early stage replaying tests should cost nothing (or almost). This is important to dedicate time to make your tests easy to deploy and easy to run. Never know how many times you will have to run them before finding a solution.

Simplify and compare

If the context looks too complex, start from scratch to work in a simpler one! First through raw SQL, then with a simple Java program, we found what performance we should normally get with SonarQube. And it was then quicker to tweak it to get to a repro of the original issue!

Team coordination

Having good level of collaboration is important in such a situation. If the team is not in sync, there are many ways to lose efficiency: machines are under used pending for tests, many people can be on the same environment, hesitations on what has been tested and what is the next step…

This is even more important when the team is like us, spread over many sites.

Here are the tools we used to help the collaboration:

  • A dedicated Slack channel for all informal discussions
  • Short alignment meetings to recap current situation and define clear action plans for next steps as often as needed
  • A shared document providing key information about tests environments, machine distribution, location of tests in progress, free environment.

Keep team motivation

When you focus on the same problem many days in a row, this is important to keep motivation in the team. Regularly recap the progress being made to not lose track of achievement and have fun!! This is fuel for keeping high energy in the team.

Learn from past experience

A lesson we take out of this is to run tests performances a bit earlier in the process

Even if there were schedule since a while, performance tests arrived a bit too late in our roadmap. It added a last minute extra pressure on team shoulders when we realize there were remaining work to do to reach good performances.

Guess what? This last point is already in our agenda. :wink:

Wrap up:

There are many exciting challenge like this at SonarSource. We have big ambitions, high motivation and we try to learn everyday from our success and failure. If you are interested to know more about our culture and what we try to achieve as a company, I recommend you having a look to #WeAreSonarSource website.

Looking forward to seeing you in the community or may be one day in one of our offices.
Pierre, Sebastien, Simon, Alex & Jean-Denis.

21 Likes

Investigation details

Our performance tests helped us to narrow down the problem to an upgrade step that is injecting a lot of data into the database. A lot like 650,000,000 rows. While PostgreSQL and Oracle perform the migration in less than 24h, SQL Server take 14 days to complete. (yup, that’s an estimation, we didn’t really wait for that long).

Inserting 650M rows on PostgreSQL or Oracle perform at 30k insert/sec, while on SQL Server it starts at 5k insert/s and slowly decrease to 200 insert/sec. On AWS RDS monitoring, the write IOPS looks bad. Something is wrong.

At this stage, SonarQube is reading from a table, does a little bit of transformation, and inserts results into another table. Nothing fancy, everything should be fine. Since the code performs well on PostgreSQL and Oracle, there must be something wrong with SQL Server, right?

Chapter 1: Challenge the environment

The first suspect on our list is the driver version! Let’s check with more recent version. Search the patch notes for performances improvements. No luck.

Since there are different versions of the driver depending on the version of Java, let’s check then with another JRE! No luck.

Something must be wrong with the DB engine version! We were running our tests with the oldest version of Microsoft SQL Server we were compatible with, let’s check with a 2017 version. The exact same results.

With an Enterprise Edition instead of a Standard Edition? Same results.

Something must be wrong on the DB configuration. Everyone double-check that everything is right. Check the RDS instance, the database properties, the provisioned IOPS. No luck.

Hmm, then let’s try to insert data with regular SQL scripts. Come on! Performances are fine. From this point we must conclude there is no issue with the environment, we need to close this chapter and look somewhere else.

Chapter 2: Check connections and isolation

Something must be wrong with the DB connection pool used by SonarQube. We open 2 transactions: one with read-only isolation level, and another one for writing. Maybe the pool is foobared and the transactions are in conflicts somehow. We tried a version without reading, only one connection, inserting random data. Same result? depressing :’(

Double check that SQL statements are correctly handled. Check the Java try-with-resources, and better statement release. And oh my gosh we found some bugs! Let’s run the code again with that improvement, I’m sure it will be blazing fast! Nope, still slow as hell.

Remove all the indexes! Tune the batch size! Analyse with code in production with an APM (Glowroot)! Try all the things!! But sadly nothing can change the insertion rate per second. Maybe SQL Server is just slow, and that’s life. How can people be using it?

Time to take a breath. We now trust the environment we are executing our tests on. We also removed all SonarQube code to simply insert random data in a table without any index.

What are the remaining options?

Chapter 3 : Think outside the box

At some point, the best is to start from scratch: a JDBC driver, a mug of coffee, one connection, a for loop, a prepareStatement and few commits… And guess what?

Here we go : that small loop reach 30k insertions! Woohoo!

But the journey is not over: after a few hours, the rate started to decrease. The RDS monitoring? Showing that the write IOPS are increasing:

So now for some reason the rate decrease while the write IOPS increase. We found out that index on a Primary Key on Microsoft SQL Server are clustered. That means that the actual data for a rows are stored in the same storage space that the index (hence the name clustered). The index is the table structure. And the index is a balanced-tree, or b-tree. On a b-tree, inserting a value may cause nodes splits as the data goes through the tree. In a very large b-tree, we do not benefit from sequential read/write because data pages are so far away on the disk that it doesn’t fit the disk cache. Moreover, non-sequentials keys increase fragmentation (our PKs are random UUID). On general case, we may not see the difference, but here our data contains BLOB of unlimited size. That’s a lot to move every time. The visible effect on the DB engine is visible by the number of exclusive IO waits (PAGEIOLATCH_EX). OK, let’s remove the PK during the insertion, and put it back when done. An interesting lead, that helped to stabilize the rate throughput to a high speed.

Chapter 4 : Final step

One code is slow, the other is fast. The last step is to merge the slow code into the fast one, until it becomes slow. And boom, the guilty line of code is revealed:

When the value is null, we are using setNull with the Types.DECIMAL. The column type is decimal. Should have worked.

Under the hood, Microsoft SQL Server is calling implicit CONVERT(). If we change to Types.DOUBLE, baoum, the rocket is flying to the sky! it goes from 200 inserts/s to 30k/s just by using the double type. Indeed, this is not an expected behavior from the Microsoft SQL Server JDBC driver, and we filled a bug report with a reproducer so that they can tackle this at the JDBC driver level: https://github.com/microsoft/mssql-jdbc/issues/1106

19 Likes