Issue : Quality Gate failed: Compute engine task did not complete within time

Hey all.

I’m sure there’s plenty to say about this situation after it’s resolved, but I can confirm that right now

  • someone is looking at the situation
  • tasks have resumed processing, it will take time to catch up
7 Likes

Thanks. We need urgent answers, a paid service can not spend 24 hours unavailable.

After 24 hours status page get updated

Sonarsource should have a ticket system for troubleshooting. And don’t just rely on a forum.

1 Like

@Colin_SonarSource
Similar issue reported in 2018
SonarCloud - Stuck on publishing quality gate result

All my builds are broken because of this. I now have 40 developers twiddling their thumbs unable to work properly.

Issue ongoing for 24 hours. The overall cost of issues like this to businesses is catastrophic.

What is going on?

My last analysis has been finished. Maybe the issue is fixed.

My last run was successful roo. Ran a bit longer than usual with 13 minutes, but it worked.

The service has caught up the lag and is now fully operational. We are monitoring it actively. The issue is still under investigation.
You can follow the status at: https://sonarcloud.statuspage.io/

1 Like

@Vince, @Colin_SonarSource,

Thank you for fixing the issue. Once the dust settle will you be publishing a detailed Post Mortem? What I am particularly interested is not necessarily what caused this specific outage (it happens and such incident can be unique). What we want to see is what tools and procedures SonarSource will be putting in place so that paying customers will be able to escalate such outages and be acknowledged quickly.

I think I speak for many of your paying customers here that this incident has put a dent in our trust of SonarSource and that a strong response is required to help heal the damage.

6 Likes

@sonarsourcers @Colin_SonarSource : Finally, thanks for fixing it.

We all would like to know what went wrong, why there was no acknowledgement for more than 24 hours, and are you guys considering a proper support/ticketing system in place or still just want to follow this same process? There should be a proper way to approach you guys and if needed there should be an escalation channel as well where we should be able to approach you if required, this is clearly high time!

Also, it’s really annoying that you guys have put restrictions on no.of replies per day and then no one has acknowledged it, I have tweeted it on Twitter as well with no response as usual. Appreciate if such restrictions are removed.

Also the status page still doesn’t mention what went wrong and no details about the cause/reason of downtime/unavailability.

4 Likes

@Colin_SonarSource @sonarsourcers : can someone kindly reply to my above message?

We all have right to know what went wrong/had happened, it is a payed version of Licensed software.

It is really interesting you guys have put the computational engine outage as Major outage with only 50 mins of unavailability wherein in reality it was close to 30 hours or more!
I’m much more interested now to know why

1 Like

Excellent placement. SonarSource does not seem to know how to deal with problems, including removing posts that criticize the long waiting time for solving a Global problem, which lasted more than 24 hours and appears with only 50 minutes.

Given that we don’t need any of the esoteric languages, we are now looking seriously at self-hosting the open source version.

Thanks @Mukesh_RT2019,

The Status Page information is now fixed. The time displayed was taking in account the time when we stood up the Status Page, rather than the time when the issue actually started.

@AlxO thank you for correcting the timeline. Will you follow up with a post mortem and a procedure for customers to escalate so that outages don’t go unnoticed for this long in the future?

Hi everyone. First of all, we are very sorry for the inconvenience caused to you and the teams you support. We are conscious of the extent of the impact. Analysis report processing was not functional for more than 24 hours and this is not acceptable.

This incident uncovered gaps in our monitoring and escalation systems. A follow-up is in progress with a postmortem review. It will trigger actions to prevent this from happening again. Some simple actions have been implemented already and others will follow. Also, this threads’ community inputs will be used in the process.

Please accept our apologies for the outage.

Regards,
@AlxO

1 Like

@AlxO : thanks for the update Alex.
Do you have any update on what exactly the issue was and what went wrong, I mean you guys should be knowing the cause and what went wrong, which is why you guys fixed it eventually.

Thanks again.