Plugin download fails

jasondamour · August 17, 2022, 6:51pm

@ganncamp This issue has never been resolved for us. Something about the way specifically sonarqube sends responses to our Load Balancer is causing this failure. None of the other of 100s of servers behind our same LB face this issue. Please schedule resources to investigate this, I can reproduce regularly and provide any logs.

The end user error is always some variant of the following:
java.lang.IllegalStateException: Fail to download plugin [iac] into /opt/sonar-scanner/.sonar/_tmp/fileCache2582033732133438722.tmp

ganncamp · August 17, 2022, 7:19pm

Hi @jasondamour,

Why is your single SonarQube instance behind a load balancer?

Ann

jasondamour · August 17, 2022, 9:30pm

It really shouldn’t really matter why we did it some particular way. We have a single load balancer serving as the single entrypoint for several applications. This same loadbalancer handles 50k+ requests/day for our Maven artifact server with no issues. It is only Sonarqube which causes this issue. Here are the load balancer logs. The backend is taking several minutes per request for plugin downloads, so it appears to be a Sonarqube Server issue, not the network layer.

I can, and have already, provide my full server configuration details:

  env:
  - name: SONAR_WEB_HTTP_MAXTHREADS
    value: "200"
  - name: SONAR_WEB_HTTP_ACCEPTCOUNT
    value: "50"
  - name: SONAR_WEB_JAVAADDITIONALOPTS
    value: "-Xms2048m -Xmx2048m"
  - name: SONAR_CE_JAVAADDITIONALOPTS
    value: "-Xms2048m -Xmx2048m"
  - name: SONAR_SEARCH_JAVAADDITIONALOPTS
    value: "-Xms1024m -Xmx1024m"

  # Web Xmx + CE Xmx + Search Xmx = 2048 + 2048x2 + 1024 = 7Gb
  resources:
    requests:
      cpu: 2
      memory: 7Gi
    limits:
      cpu: 4
      memory: 7Gi

I also have months of data on all 527 custom prometheus metrics from sonarqube which could be used to debug. I have meticulously inspected each one already and found no indicator of the issue, but will provide full access if it helps debug this.

jasondamour · August 17, 2022, 9:32pm

We’re an enterprise customer, and have escalated this a couple times to our TAM, who has redirected us here since we will not pay $20k for a support package for this one simple issue. So I’ll continue to escalate here until it’s resolved

jasondamour · August 17, 2022, 9:47pm

I’ve tried debugging this myself for months. I even opened a PR to sonarqube which was accepted and released, which didn’t fix the issue: Need ability to configure Tomcat KeepAliveTimeout - #3 by jasondamour

Adding on. The GCP load balancer logs this for all the failed requests:
statusDetails: “backend_connection_closed_after_partial_response_sent”

jasondamour · September 1, 2022, 8:12pm

@ganncamp What else can I provide to aid identifying the issue here?

ganncamp · September 2, 2022, 11:36am

Hi,

I’ve flagged this for more expert attention. Hopefully, they’ll be along soon.

Ann

jacek.poreda · September 8, 2022, 12:34pm

Hey @jasondamour ,

Have you tried to use the same endpoints (or just a single one) the same way scanners do on the same machine as SonarQube is installed, so that you could check downloads time by passing the load balancer?

For instance: http://your-sq-instance/api/plugins/download?plugin=javascript&acceptCompressions=pack200

I would like to confirm if that’s a server-side or client-side issue.

jasondamour · September 17, 2022, 10:53pm

Sure @jacek.poreda, I will test it this week

Safinsaf · September 27, 2022, 7:58am

Hi all, are there any news regarding this issue? I have similar issue, but I don’t have load balancer, just reverse proxy. Downloading manually files that my pipeline could not is really fast. And actually downloading of other plugins is also below 20ms each in pipeline, it just stops at random one

jacek.poreda · October 14, 2022, 12:23pm

Hey @jasondamour, do you have any updates?

jasondamour · November 1, 2022, 3:06am

Hi @jacek.poreda, I think I have an update. I can reproduce the error without the GCP LoadBalancer hop.

I’ve deployed a pod on the same cluster as our Sonarqube Server, and ran a load test downloading a random plugin from the server using the internal K8s svc and the external HTTPS loadbalancer. I originally tried to use the commonly used apache benchmark (ab) tool, but our loadbalancer requires HTTP/1.1 and ab only supports HTTP/1.0. So the test uses h2load with the same configuration for both tests.

External LoadBalancer Load Test

$ h2load  -n 1000 -c 100 -T 10s https://sonarqube.[MASKED].com/api/plugins/download?plugin=cpp

starting benchmark...
spawning thread #0: 100 total client(s). 1000 total requests
TLS Protocol: TLSv1.3
Cipher: TLS_AES_128_GCM_SHA256
Server Temp Key: X25519 253 bits
Application protocol: h2
progress: 10% done
progress: 20% done
progress: 30% done
progress: 40% done
progress: 50% done
progress: 60% done
progress: 70% done
progress: 80% done
progress: 90% done
finished in 514.40s, 1.91 req/s, 102.50MB/s
requests: 1000 total, 994 started, 989 done, 980 succeeded, 20 failed, 20 errored, 4 timeout
status codes: 993 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 51.49GB (55285310234) total, 59.62KB (61047) headers (space savings 83.16%), 51.45GB (55238990142) data
                     min         max         mean         sd        +/- sd
time for request:   322.14ms      14.37s       7.28s       2.31s    72.65%
time for connect:    18.36ms     89.99ms     60.50ms     19.29ms    61.63%
time to 1st byte:   121.94ms       7.20s       2.98s       2.13s    70.93%
req/s           :       0.00        0.15        0.10        0.02    84.88%

h2load doesn’t have the best client-side logging, but we can easily check our Google Cloud Logs, and see the same error which is plaguing our CI/CD systems. The requests are returning 200, it seems like all the data is transmitted, but the request never terminates and the loadbalancer logs summarizes the request as statusDetails: "backend_connection_closed_after_partial_response_sent"

Internal Load Test:

$ h2load -v -n 1000 -c 100 -T 10s http://sonarqube-sonarqube.default:9000/api/plugins/download?plugin=cpp
...
progress: 70% done
progress: 80% done
progress: 90% done
[stream_id=19] x-content-type-options: nosniff
[stream_id=19] cache-control: no-cache, no-store, must-revalidate
[stream_id=19] sonar-md5: 6b77cb35e18da7134308ed017e64adb2
[stream_id=19] content-type: application/java-archive
[stream_id=19] date: Tue, 01 Nov 2022 00:33:07 GMT
[stream_id=19] x-envoy-upstream-service-time: 91
[stream_id=19] server: envoy
[stream_id=19] :status: 200
[stream_id=19] x-frame-options: SAMEORIGIN
[stream_id=19] x-xss-protection: 1; mode=block
[stream_id=19] x-content-type-options: nosniff
[stream_id=19] cache-control: no-cache, no-store, must-revalidate
[stream_id=19] sonar-md5: 6b77cb35e18da7134308ed017e64adb2
[stream_id=19] content-type: application/java-archive
[stream_id=19] date: Tue, 01 Nov 2022 00:33:07 GMT
[stream_id=19] x-envoy-upstream-service-time: 81
[stream_id=19] server: envoy

The logs stop here, and never continue. This is reproducible with these load configurations, it always fails around the 90% mark. In the external test theLoadBalancer is configured with a 180 second timeout, which is why most requests stop there. But in the internal test, even after configuring h2load with a 10 second timeout (-T 10s), the test hangs indefinitely. According to the sonarqube tomcat metrics, sonarqube’s embedded web server has a timeout of 60,000ms aka 1 minute. However, even letting the test sit for nearly an hour does not complete.

I’ve configured sonarqube with the following Log options for debugging:

- name: SONAR_LOG_LEVEL_WEB
  value: DEBUG
- name: SONAR_LOG_LEVEL_APP
  value: DEBUG
- name: SONAR_WEB_ACCESSLOGS_ENABLE
  value: "true"

Going into the filesystem, I see the following log entries:

access.log

...
127.0.0.1 - - [01/Nov/2022:02:59:26 +0000] "GET /api/plugins/download?plugin=cpp HTTP/1.1" 200 - "-" "h2load nghttp2/1.47.0" "AYQxBwFk/o1eN4YOAAwG"
127.0.0.1 - - [01/Nov/2022:02:59:26 +0000] "GET /api/plugins/download?plugin=cpp HTTP/1.1" 200 - "-" "h2load nghttp2/1.47.0" "AYQxBwFk/o1eN4YOAAwL"
127.0.0.1 - - [01/Nov/2022:02:59:26 +0000] "GET /api/plugins/download?plugin=cpp HTTP/1.1" 200 - "-" "h2load nghttp2/1.47.0" "AYQxBwFk/o1eN4YOAAwM"
127.0.0.1 - - [01/Nov/2022:02:59:26 +0000] "GET /api/plugins/download?plugin=cpp HTTP/1.1" 200 - "-" "h2load nghttp2/1.47.0" "AYQxBwFk/o1eN4YOAAwP"
127.0.0.1 - - [01/Nov/2022:02:59:26 +0000] "GET /api/plugins/download?plugin=cpp HTTP/1.1" 200 - "-" "h2load nghttp2/1.47.0" "AYQxBwFk/o1eN4YOAAwQ"
127.0.0.1 - - [01/Nov/2022:02:59:26 +0000] "GET /api/plugins/download?plugin=cpp HTTP/1.1" 200 - "-" "h2load nghttp2/1.47.0" "AYQxBwFk/o1eN4YOAAwO"
127.0.0.1 - - [01/Nov/2022:02:59:26 +0000] "GET /api/plugins/download?plugin=cpp HTTP/1.1" 200 - "-" "h2load nghttp2/1.47.0" "AYQxBwFk/o1eN4YOAAwN"

This is the end of the file, with no new entries, as the test sits hanging. No indication of any issue. In fact, I can’t find any issue in any of log files. I have all the files saved, please let me know what steps you would like to see me try.

jeremy.cotineau · November 3, 2022, 2:35pm

Hello @jasondamour huge thanks for taking the time to help us troubleshoot that by giving us all those informations and testing the different scenarios.

Based on what you gave us, there is still something i would like to try based on the configuration of your cluster.

Could you try to run the same tests but inside the pod, with a sidecar container targeting localhost ?

In order to do that we have Values.extraContainers that let you define sidecar containers. ( there is an h2load docker image, but it might be old and untrusty, we could build a small one for that or just script everything in an official os image)

I have seen similar cases caused by the cloud infra and would like to do the most precise test really close to the SonarQube instance.

Depending on the result i will be very happy to keep investigating on what could be wrong.

Ps: i never worked in depth with google cloud, but on similar cloud platform i have seen many issue caused by :

the iops of the disk
the bandwith of the card interface
the container network interface bugs
…

Another point, the default monitoring cloud gives us might be misleading. For example when it was a microsoft process taking up all the iops, it would not be shown on the graph so on our side it was all ‘OK’

PS2: that does not exclude it could be a SonarQube bug, and i hope the test will give us that information.

jasondamour · November 4, 2022, 2:12am

Thanks Jeremy,

Today I tried to further test by running a test from the Sonarqube container itself, bypassing the istio envoy sidecar proxy we have. Since I can’t install anything on the sonarqube image (a good thing) I tried running the test like this:

$ while true; do wget -S http://localhost:9000/api/plugins/download?plugin=cpp -O /dev/null; done

saving to '/dev/null'
null                 100% |**********************************************************************************************************************************************| 53.1M  0:00:00 ETA
'/dev/null' saved
Connecting to localhost:9000 (127.0.0.1:9000)
  HTTP/1.1 200 
  X-Frame-Options: SAMEORIGIN
  X-XSS-Protection: 1; mode=block
  X-Content-Type-Options: nosniff
  Cache-Control: no-cache, no-store, must-revalidate
  Sonar-MD5: 6b77cb35e18da7134308ed017e64adb2
  Content-Type: application/java-archive
  Transfer-Encoding: chunked
  Date: Tue, 01 Nov 2022 18:14:12 GMT
  Connection: close
  
saving to '/dev/null'
null                 100% |**********************************************************************************************************************************************| 53.1M  0:00:00 ETA
'/dev/null' saved
Connecting to localhost:9000 (127.0.0.1:9000)
  HTTP/1.1 200 
  X-Frame-Options: SAMEORIGIN
  X-XSS-Protection: 1; mode=block
  X-Content-Type-Options: nosniff
  Cache-Control: no-cache, no-store, must-revalidate
  Sonar-MD5: 6b77cb35e18da7134308ed017e64adb2
  Content-Type: application/java-archive
  Transfer-Encoding: chunked
  Date: Tue, 01 Nov 2022 18:14:13 GMT
  Connection: close

And these logs continue forever, I let it run for about 15 minutes with no issues. On one hand, I am suspecting it is an issue with how the Envoy proxy handles the content being sent by Sonarqube, possibly the header Transfer-Encoding: chunked. However on the other hand, I’ve seen other reports on the forum of the same issue who do not use istio/envoy.

Tomorrow or Monday I will try with the sidecar image, to generate higher multi-thread load.

jasondamour · December 12, 2022, 8:02pm

Hi everyone, I’ll post my last update.

The issue is resolved. It is somehow caused by the istio proxy we had in our stack. I don’t know how or why, since the same envoy istio-proxy is working for our hundreds of other servers. Wish I could say for sure.

system · December 19, 2022, 8:09pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.