SQ - LTS 7.9.4, after upgrade issue

Hi Team,

Recently we have upgraded our SQ version from 6.7.1 to 7.9.4 LTS version. Find the below details tech components being used.

  1. SQ 7.9.4 LTS
  2. DB - AWS Aurora 10.7 Postgresql
  3. JDK - OpenJDK - 11.0.5
  4. Using AWS EFS for the data persistency to data and extensions dirs
  5. Deployed in Kubernetes cluster

Issue:

  1. After upgrade till today it was working fine, but suddenly today it went down, not coming up even after restarting it for multiple restarts and after investigating the logs we found below warnings in the logs,
2020.10.13 19:32:44 WARN  es[][o.e.c.InternalClusterInfoService] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
  1. Later we removed the es6 folder, to reindex all again, where now it stuck at below while reindexing, remains there so far for almost an hour
2020.10.13 18:04:01 INFO  web[][o.s.s.e.IndexerStartupTask] Indexing of type [projectmeasures/auth/projectmeasure] ...
2020.10.13 18:05:03 INFO  web[][o.s.s.es.BulkIndexer] 0 requests processed (0 items/sec)
2020.10.13 18:06:03 INFO  web[][o.s.s.es.BulkIndexer] 0 requests processed (0 items/sec)
2020.10.13 18:07:03 INFO  web[][o.s.s.es.BulkIndexer] 0 requests processed (0 items/sec)
2020.10.13 18:08:03 INFO  web[][o.s.s.es.BulkIndexer] 0 requests processed (0 items/sec)
2020.10.13 18:09:03 INFO  web[][o.s.s.es.BulkIndexer] 0 requests processed (0 items/sec)

Please suggest what could be the issue here. Let me know for anything additional required to share suggestions/recommendations.

Thx

Hello @sandeepsharmadevops,

It’s hard to understand what’s happening here. Note that we do not yet support Kubernetes deployment and I don’t know exactly what is “AWS EFS for the data persistency to data and extensions dirs”. I suspect the issue is coming from one of these things.

Have you tried to simplify your deployment and change things one by one in order to isolate the problem? You listed 5 of the changes you’ve made from SQ 6.7 to 7.9, doing things one by one will allow you to identify where things go south, then work on this specific part.

I hope this helps.

Cheers

Thanks @Antoine for your update. Let me clarify the items to make it more clear.

SQ LTS 6.7.1 is already in-use for almost 3 years from now, where we have almost 15+ K projects. It is deployed within the Kubernetes cluster with AWS EFS as data persistency which is NFS compliant storage. So, wow as part of the SQ upgrade from 6.7.1 to 7.9.4, only SQ version and its required JDK version is changed here, rest everything is same.

1. SQ 7.9.4 LTS
2. DB - AWS Aurora 10.7 Postgresql
3. JDK - OpenJDK - 11.0.5
4. Using AWS EFS for the data persistency to data and extensions dirs
5. Deployed in Kubernetes cluster

Now, latest issue which we are facing is as per below:

  1. As per the SQ recommendations it is not recommended to use NFS/CIFS based storage solutions to host SQ data, as this could impact the performance with single point of failure.
    [AWS EFS(highly available NFS compliant storage service) is being used from the day one with SQ 6.7.1, to store extensions and data dirs persistency, SQ hosts 15+ k projects with ES indexes data size is around 9+ GB]

  2. Post SQ upgrade, SQ is not coming up throwing below warnings message when we are using the same AWS EFS for data persistency which was used earlier but without AWS EFS it is working fine able to come up with even creating the indexes again(9+ GB). In order to mitigate the index generation issue with AWS EFS we have cleaned up the DB with retired projects and now ES data is around 700 MB(this unit is captured when started SQ as local storage i.e. without AWS EFS).

2020.10.13 12:37:14 WARN  es[][o.e.c.InternalClusterInfoService] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
2020.10.13 12:38:04 WARN  es[][o.e.c.InternalClusterInfoService] Failed to update node information for ClusterInfoUpdateJob within 15s timeout

Now, considering the amount current ES data size, SQ should come up with AWS EFS but is not and throwing the above warnings, hence not coming up.

Questions:

  1. SQ doesn’t support NFS/CIFS storage solutions at all, situations like above in our case can arise in such conditions?
  2. If it does support, what best practices one should adopt while hosting SQ with such storage solutions?
  3. What could be the reasons behind the above warning messages, I/O latency or anything else?
  4. Any timeout value can be increased at SQ level to avoid such issues so that SQ can work seamlessly?

I hope above details now clarify ask/issues clearly, so that recommedations/suggestions can be shared accordingly.

Let me know for any additional details are required.

Thx

Hello,

As I can’t test anything using AWS EFS it is hard to comment about it. The requirement about not using NFS/CIFS solution for storage is more a strong suggestion than a hard requirement, in the sense that it can work, it’s just usually slower, less reliable, etc. In your case it worked before apparently.

There is no timeout value that could be changed for that. The issue could be related to IO indeed, it could also be about a lack of memory: a user with a similar issue here fixed it by providing more memory to his container.

I can’t really investigate here, but if you want to use this solution, what I can suggest is:

  • enable debug mode to if there is more logging, which could help to understand, also make sure to analyze all log messages, not only this Warning
  • check what is the real I/O rate on your EFS disk to understand if there is at least something happening or not at all

Cheers

Make sense @Antoine, investigated the below checkpoints before reaching out to the community for expert comments/suggestions based on their knowledge/experience.

  1. Ran with DEBUG log level nothing specific found.
  2. Increased HEAP space for all SQ internal components i.e. CE(1GB), WEB(1 GB), ES(4GB), where SQ was working fine with even lower of these values when using local storage, investigated HEAP consumption as well used 50% only. Any recommendations on further increasing the HEAP the while using AWS EFS storage?

I will go ahead with other suggested action items.

Thanks again!!!..