On February 26th, 2024, Cornerstone's engineering team swiftly responded to alerts generated by our internal monitoring tools, signalling issues with queuing functionality affecting clients in the US SL1 (AWS) Prod swimlane.
After initiating an immediate investigation, we identified the root cause as an issue with one of the instances of the cluster responsible for managing the queuing functionality. Subsequent analysis pinpointed a specific queue component within the cluster that had become stuck, resulting in the observed delays. To resolve this, the problematic queue component was promptly removed, and the backlog in the queue was processed, restoring normal operations.
To prevent similar incidents in the future and minimize the risk of recurrence, Cornerstone collaborated internally with the application team to implement a permanent fix. This proactive approach aims to safeguard against similar disruptions and ensure continued reliability of our systems.