Just an update and this will conclude this issue for us… this really does all revolve around disk and network (latency and/or throughput). Things we were doing that are the culprits:
- Database in different network
- (helm) Using Azure Fileshare for persistent storage
- (VM) Slow disks for temporary storage and elasticsearch storage
Going forward, we are going to switch over to using a dedicated VM with fast storage. Azure ephemeral disks are good for this, and with a VM we don’t have to chase probe timeouts over time as our database grows like with a helm deployment.