I want to apologize to anyone who has found me difficult to reach, or slow to respond in the past four or five weeks.
About eight weeks ago, I did a major server and workstation upgrade for a client. We basically replaced everything. The upgrade and conversion went quite well actually. The owner was way past a bit happy in fact -- he said "Jon, I have to give you kudos everything is way faster and it works almost perfectly”. Nice, and I got my completion payment.
Well, that happiness lasted about another 10 days. My server monitoring e-mails began showing strange errors. Whoops! Most USB bus devices didn't seem to be recognized, "what's up with that?" The server was reporting that the RAID was "Dead", even though it continued to serve files properly. We got the equipment vendor involved in trying to diagnose the problems. However, even with their best engineers on the case, no one could either fix or explain the issues. Fortunately, I had the old server on standby and was able to sync them before the vendor attempted a fix. Good thing. When I had contacted the vendor, the servers were still serving files, no problem. After two attempts at a fixed -- it was all bad -- the server wasn't talking to anyone. Oh well. 760 GB of data, over a million files, was toast. Fortunately, we are also using a "cloud" continuous backup. So our actual data loss was only a few files. But, everything still had to be reinstalled on a reinitialized server from the ground up.
It's difficult to assess blame directly. The system worked right for about 20 total days. Then something really bad happened. The vendor got right on at, but they were as mystified as I was. It appears this was a unique failure and they assigned a top engineer (unfortunately for me he was in Sydney Australia, literally half a world away). However, there was not a question I could pose (other than "why?") he could not answer directly and without reference, so I would call him very knowledgeable. We were able to ditch the "call support and give them your case number" part of the usual relationship and he contacted me directly by telephone and e-mail repeatedly. I am a "pretty to very" experienced UNIX/Linux systems engineer and I was baffled, so was the vendor and all their engineers. Oh well, doggie do does happen. It cost me a huge amount of foregone revenue as I had to redo the install and bring back all the data. Not frikkin' pretty. The client and I split the cost of reinstallation and restoration, so there was a lot of mutual pain but they are back to being reasonably happy which is as it should be.
This certainly makes the perfect argument for "RAID is not a substitute for proper backup". Had we not been taking local USB-based backups and had a cloud service, and a standby server (how many companies have all those levels of protection?), we would have been screwed royally. How you lose two partitions on a Linux RAID system is still inexplicable.
The real price of this misadventure was paid by yours truly in the form of 14 to 16 hour work days for several weeks in a row, and many sleepless nights induced by stress. I worked a lot of freakin' hours and did exhaustive testing ex post facto. Fortunately, all now appears well, so be it and God bless.
Thusly, I have not been able to give any time to problems other than "immediate need" service responses to ongoing issues. My apologies to anyone who feels I have been inattentive, that was not my intention, I was just toast, LOL. I'm going to spend this week trying to get everything caught up and hopefully a disaster like this will not befall any of you (or me).
Additionally, I have moved to a new office – it was a consequence of doing the original work. In the end, it was all just more than I had time for. The move was underway when things went “BOOM” – fortunately, I am almost out from under water on that too, and I have a nice spot to do video conferencing from now. I would still like to have the month I lost back, LOL.
Thanks for your patience.