Back online after a server failure - and significant loss of data
microstudio.dev is back online today after a complete server failure on Saturday, July 8th. But data was lost in the process and we are back to the state of the data as it was 2 months ago.
What happened
The server went down on Saturday evening (CEST), July 8th. In the hosting provider's administration dashboard, the server was showing as running, but disconnected from the internet. We contacted them immediately and after about 1 hour, they got back to us with an answer, that the server was completely dead and all they could do was to provide a replacement. As soon as we received their answer, the server moved to a status "unrecoverable" and every option we had on the server was gone (like restarting in rescue mode). So far, no big deal though, this is something we are prepared for, all we need is to reinstall the new server and reload the data from the latest backup. Only problem is that I was far away from home / office and that I needed the (always benevolent) help of my pal @mattamore. We agreed on the phone that we would do the reinstall the next morning (Sunday).
Data is missing
On Sunday morning, we were preparing the new server instance and had a quick look at the data backups, only to notice that something was off with the dates of the files. We correctly had one "snapshot" folder created everyday, but in the last few ones, the latest file we could find was from May 9th. After some digging, we understood what had happened: everyday, our backup script was syncing files from the production server to a local server ; then a snapshot of the folder was created with the timestamp of the day. Then a notification was sent to us to confirm that the daily backup was completed. Only problem was that for some reason, the syncing (using rsync commands) was not actually working since May 9th. We had been making a snapshot of the same old folder contents everyday. This was going undetected and the script was incorrectly sending a backup confirmation everyday, despite the syncing not working. We still have to find out why the syncing stopped working at some point (we will look into this soon).
What's next
I want to say how mortified I am that I let all this happen. I should have tested the backups more thoroughly and more often. I feel terrible, knowing that many of you have lost hours or days of their work. I sincerely, deeply apologize to you all.
(We are currently trying to get some help from our hosting provider ; this is our last hope. If they somehow manage to give us access to the server, it's SSD or files, we should be able to recover the missing 2 months of data. We are currently waiting for their answer, fingers crossed.)
Update: we exhausted all the options with our service provider ; server remains unreachable ; we offered to buy back the server or SSD but they declined.
I will spend the next few days rethinking the backup system, with additional redundancy, reliable automated verifications and alerts and manual verification processes. I also want to maintain a spare server ready to kick in whenever the main server fails. So that none of this can ever happen again.