Server downtime today

Started by Sami, February 12, 2022, 03:48:16 PM

Sami

We had a rather long downtime on our servers today, but now we are back online!

The reason for this outage was an unexpected problem in the server operating system upgrade process.

This task had been well tested before it was started, and the exact same upgrade had been performed to two other production-use servers at our disposal (which use the same configuration), and naturally also by hundreds of others people worldwide before us, so the full expectation was it would also work within the normal <5 minute timeframe here too. But for some unknown reason the process failed for us and left the servers unreachable (it appears to be bug in the operating system upgrade process).

This failure triggered the need to start troubleshooting processes for the upgrade which took over an hour, but this path also unfortunately seemed to lead nowhere, or it would have taken considerably more time. After this we made the decision to proceed with a total rebuild of the server environment, which took another few hours, but luckily went without problems since this is something we're prepared for and have a pre-planned procedure.

All systems should be now restored, and no data loss at all should have happened, and the simulations have resumed from the point they were left at in the morning when the upgrade started. Please inform us if some system isn't operating normally.

Sorry for the prolonged downtime - that's very unusual from us as you might know.  :)