Few weeks ago, SonarSource was about to release SonarQube 7.9 LTS. This version will be supported for 18 months; it introduces many new features including our first move on security; all teams got involved and worked very hard on it. In a nutshell, this is a highly awaited release you do not want to mess up with.
Sadly, the day before the release, our upgrade tests reported very bad performances on Microsoft SQL Server while previous validation we made on PostgreSQL and Oracle were fine.
In such a situation, a reptilian reflex could be: “Let’s put all available devs in a room, order pizzas and lock the doors until a solution is found!’’
Well, this is an option, but such an approach may have disadvantages:
- It put a lot of pressure on the shoulders of a small group of people that already worked hard for months on this version
- It reduces the group to a certain category of people which may limit the options / the approach to investigate the problem
- You never know when (/if) you will reopen the doors
In such a situation, company culture can have a big impact on decision being taken. For us in this case, two of our core values influenced our next steps:
We want to continuously deliver. We believe in increments, timebox and baby steps as means to deliver often, deliver simple, get feedback and continue improving.
We believe that no one is as smart as all of us. We operate as a group with minimal hierarchy using collective intelligence, each of us being co-responsible for decisions made as well as relations with others.
(definitions copy-paste from our internal wiki)
What do you think happened? Did we ordered pizzas right away?
Not at all! Team representatives of Development, Product Management, Marketing, Services and Business met to find the best way to address the problem.
The output of this meeting? We then decided to:
- raise the awareness of our users in the release note
- deliver what we already had
- start a sprint to tackle the problem
Why delivering was important for us?
At this point of time, we had no clue what the problem was: environment, configuration, bug, … Hard to guess how much time it will take to identify the root cause, find a solution, deliver a fix. It means that if we delayed the release, there was no way for us to foresee how long all our users will have to wait.
It does not look like sain conditions to investigate a problem.
In addition, we knew that only fraction of our users may be impacted by this problem: users with significant SonarQube dataset who are using Microsoft SQL Server. All the rest of our users can safely enjoy this new LTS without being bothered by this problem.
Why did we run sprint?
The ultimate goal is to avoid the tunnel effect. Defining a sprint of 5 days helped us to set expectations and put ourselves in good working conditions.
Without any time frame:
- You start guessing your next test is the last one which most of the time prevent you to industrialize your setup and your tests
- There is no end until you find the solution which may be hard to keep team’s motivation
- Harder to handle expectation of the rest of the world (other SonarSourcers, users, customers)
What about collective intelligence mentioned earlier?
First, we collectively weigh pros and cons and took decision with all impacted team representatives.
Secondly, team we setup for the investigation was a mix coming from different teams:
- development team (x3)
- support (x1)
- SonarCloud infrastructure (x1) (since tests were running on RDS)
To put a long story short, the team managed to identify the problem the last day of the sprint and fix has been shipped as part of 7.9.1. The root cause was a limitation of the jdbc driver. Here is the report we open on jdbc driver side.
If you are interested in the investigation details, we will post step by step details in a comment bellow.
What did we learn?
First it was 5 days of nice and intense collaboration. Some people in the team never had the occasion to collaborate before. There were depressing deadends, funny wrong paths and a nice conclusion.
Here are some of the key point that really helped us:
Know what you are testing
The environment, the measure you are using or the piece of code you are testing, everybody in the team is 100% confident about what is being done. If there is any doubt, the team will lose a lot of time in double validation, re-running tests and so on.
From the beginning, we saved a lot of time relying on numbered version of SonarQube coming from our CI pipeline. When we were losing faith in tests results, we were able to quickly confirm the code we were testing. This would have been far more complex if we would have used snapshot binaries coming from our computers.
Make your tests easy to replay
From the early stage replaying tests should cost nothing (or almost). This is important to dedicate time to make your tests easy to deploy and easy to run. Never know how many times you will have to run them before finding a solution.
Simplify and compare
If the context looks too complex, start from scratch to work in a simpler one! First through raw SQL, then with a simple Java program, we found what performance we should normally get with SonarQube. And it was then quicker to tweak it to get to a repro of the original issue!
Having good level of collaboration is important in such a situation. If the team is not in sync, there are many ways to lose efficiency: machines are under used pending for tests, many people can be on the same environment, hesitations on what has been tested and what is the next step…
This is even more important when the team is like us, spread over many sites.
Here are the tools we used to help the collaboration:
- A dedicated Slack channel for all informal discussions
- Short alignment meetings to recap current situation and define clear action plans for next steps as often as needed
- A shared document providing key information about tests environments, machine distribution, location of tests in progress, free environment.
Keep team motivation
When you focus on the same problem many days in a row, this is important to keep motivation in the team. Regularly recap the progress being made to not lose track of achievement and have fun!! This is fuel for keeping high energy in the team.
Learn from past experience
A lesson we take out of this is to run tests performances a bit earlier in the process
Even if there were schedule since a while, performance tests arrived a bit too late in our roadmap. It added a last minute extra pressure on team shoulders when we realize there were remaining work to do to reach good performances.
Guess what? This last point is already in our agenda.
There are many exciting challenge like this at SonarSource. We have big ambitions, high motivation and we try to learn everyday from our success and failure. If you are interested to know more about our culture and what we try to achieve as a company, I recommend you having a look to #WeAreSonarSource website.