I’ve posted this before in a Friday Fun in the Veeam Community comment a few months back, but I think it bears repeating, especially because Veeam really did save the day, and maybe a few lives as well! Details may be fuzzy, but it’s all true. It’s a long story, so maybe set aside some time for some light reading, or maybe use it as a 15 minute distraction from working so that you can reset your brain. And I hope you learn something too!
Setting the Stage
Long, long ago (okay, it was 2020) in a galaxy far, far away (otherwise known as a small rural community in centralish Nebraska), there resides a community hospital who shall rename nameless. We, an eastern Nebraska-based MSP, sourced this new client and it’s looking like the beginning of a fabulous relationship.
I arrive onsite on a hot June day as part of the initial sales call and to review what their infrastructure looks like. COVID is more or less just beginning to hit the area. We sit down, talk about the services we provide in an overwhelmingly hot conference room because the building AC has failed and a large temporary, portable AC system is outside forcing lukewarm air into the building, but not the part that we’re in. After we’ve talked about our services and how we can serve them, we take a look at their hardware and discover that they don’t know what everything is running on, what everything does, or if they even own the gear in their server rack. Perhaps, a not so beautiful partnership, but too early to tell.
I close on the contract and onboard the client. All is well. We move them off of a competing backup product over to Veeam. Said competing product, who should also remain nameless (due to just how good I’ve found this product is at silently failing and leaving you high and dry when need it most) is replaced by Veeam. We’re also copying data off-site to a Veeam Cloud Connect provider (shout-out to 11:11 Systems/Iland) because it’s Nebraska – tornado’s happen among other potential disasters.
Now we tell the client that their hardware is old. Not just old, but it’s from some other clients of their former MSP – references are found as to whom owned this hardware in its previous life. The good days of this equipment are probably past, and then when retired, it was quietly disposed of to the previous MSP who eventually repurposed and donated it to this hospital. Turns out the hardware consists of a 8ish year old Equallogic PS6100 SAN, and three 12th Generation Dell PowerEdge servers all close to their 7th or 8th birthdays. One of the servers, an R720 is sitting there doing nothing but sucking up power and generating heat – I guess that’s a “cold” spare. The other two, a R720 and an R820 are happily computing away although they’re starting to feel a bit tired since all of the servers are 8-9 years old. We inform the client of this risks, but are told that they have no money and just to keep things alive, but they’ll see what they can do. I begin to question this partnership and fear that some rough days are ahead.
“One Year Later….”
By now, the year is 2021 and COVID is in full swing. The US Government is throwing money around like it doesn’t matter, and the hospital receives a substantial amount of CARES money to support patient recovery from the pandemic. The hospital asks us what they need, and foolishly, we think they’re going to follow our recommendations which include, in addition to the construction already occurring on the building, new servers, SAN, networking, phone system, nurse call system, access control systems and who knows what else! We think things are looking up for replacing the aging infrastructure because the hospital has a little money in the pocket. And then we find out that they don’t have as much money as they’re going to need. We prioritize what is “really” needed, among it is a dedicated server for Veeam backups, and of course two new ESXI host and a SAN. Rebuilding the core network is also a high priority because everything rides the network, including the new phone system that is a “must-have”.
Nothing happens….until something does. I get a report from front-line support that that IP phone systems are stuttering really badly. Compute systems are slow. I log in remotely, check the servers and realize that the input temperature to on of their hosts seems high. I do some quick google math since I don’t speak Celsius well, and realize that in freedom units, the temperature is 3 digits, and the first two digits are 1’s. Not good. Turns out the AC failed and nobody noticed. When we have an on-site employee open the server room door, he’s met with blast-furnace temperatures and I immediately see temperatures begin to fall on the server. Good signs….facilities addresses the server room AC issues. Or so we thought.
We request a sit-down with the client. Perhaps if we can physically lay hands on the hardware and instill some fear in them of what will happen if hardware fails, they’ll understand the dire need they have for an update. CEO went fairly deep into how the grant process and Medicare funding works, Medicare funding is based on the number of patients they see per days (they’re in the single digits) but they can adjust their request for federal funding, etc. I felt for the guy…he knew what he was talking about, and I knew they didn’t have the money for the hardware but were working on it, but very slowly. Client says they understand, and they’ll start pulling together the money from CARES, local donations, and maybe some grant requests can pay this all off. CEO also requests more information on the hardware and what could happen if things fail – all for the grant request process.
A couple months later, I get reports that the AC failed again. Turns out construction debris blocked the AC air flow and the server room overheats again. This can not be good for already old hardware. I literally tell them that the two heat incidents probably shaved whatever little time is remaining on the hardware that is already long in the tooth. CEO says they’re working on it and we resubmit an updated proposal for approval because this stuff really needs to be purchased. They’re ready to commit, but their pocket-books still aren’t having it.
Disaster Strikes
A couple months later, the entire environment is down. On the weekend of course, and the hospital is about a 2 hour drive from the on-call engineer. Blind as to what exactly is happening, he drives 2 hours on-site to troubleshoot. I hear through the grapevine from the next level support that there’s an issue and I’m brought up to speed so I can start formulating ideas if I need, but I’m not yet escalated to. Engineer drives 2 hours back to the office during the night to get some hardware and then 2 hours back to the client site. Sunday morning rolls around…still having issues, and the problem is escalated to me. My wife’s not happy but she knows the drill. I head to the office, grab a car, grab whatever hardware I can find without really knowing what the issue is including hardware that’s running my lab environment. After arriving a few hours later, I setup shop and the Equallogic SAN is acting wonky. Looks like the controllers are rebooting every few minutes. Hosts are online, but the VM’s are hung in memory like they would when their underlying storage suddenly drops out from under them. I console into the controllers and find that the supercapacitor board on both controllers has failed due to age (and probably heat stress) causing BSD kernel panics and both controllers are boot looping.
I find out that the hospital has diverted all incoming patients, emergency and non-emergency to other hospitals “around the area”. Around the area means 20-30 minutes or more away. July 4th is looming around the corner, so we all pray that nobody tries to blow off a hand or take some of the larger shells to the chest. Instead, the small staff sits around and enjoys to break from the daunting task of taking care of patients and dealing with the festive injuries that are simply bound to happen. Nobody even bothers to ask me how long until things will be back up – which is refreshing if you’ve every been in an outage situation where they want to know when things will be back up, pestering you every 15 minutes. Shift changes happen, nurses converse with the original engineer onsite to see if he even went home since he was then when their last shift had ended.
Slow Recovery, but Veeam Saves the Day
Turns out the spare SAN I brought (PS6000) doesn’t use the same modules because it’s a generation older – it has actual batteries. I slap extra hard drives I was able to scavenge into one host and setup the local datastore on the added disks. The Veeam server was virtual and all is lost, so I being the long process of connecting to the internet, downloading a Windows Server ISO, deploy a new VM, download Veeam, install and build a new Veeam backup server. I don’t really know what condition backups are in, so my fingers are crossed. Fortunately, since we kept the Veeam Configuration Database backup in the VCC storage, I reconned to the repo to grab the config database backup, restore it and take a look at what I have to work with. Turns out all VM’s were successfully backed up the night before except for one Domain Controller before the repo was getting full….no biggie, just restore the other Domain Controller first, let it stabilize and then bring up the older restore point on the VM that wasn’t as recently backed up and let AD replicate and changes.
We sit for a few hours while VM’s restore. I call out a third engineer, this time from our phone team. She’s going to need to build new VOIP appliances because when the VM’s were built, they were never added to backups. Could have been worse. SEVERAL hours later, the hospital is back online and can being accepting patients again. I make sure the backups are getting all of the machines, including the phone system. All is well. Prep for a two hour drive home taking the failed storage array home, and call the boss man and tell him not to let the initial responding engineer to drive as he’s not been up for 40 hours and needs to sleep…..the car can stay there and get picked up at a later date. He already had his own health issues with diabetes and such, we don’t need to add being dead to the growing list of concerns.
“Sounds like somebody has a case of the Mondays”
The next day the client reports that some of their files restored are out of date. How can that be? As it turns out, the DC that was restored from an older restore point also happened to be a file server that we didn’t know about. And they need the data.
I need to rebuild the failed array that I had brought back to the office with me. Fortunately, I had another client that managed to trash a PS4100 that I knew they had in storage that was equally as old. That one I know failed because they tried to do power maintenance on UPS batteries that ran out of juice that went hard down and had 5 out of 12 drives fail. Don’t ask me how I get all of the bright, shining stars for clients, it’s the luck of being the Senior Engineer where all escalations come to be resolved. Google says there’s a chance that my theory could actually work.
I email that client, ask if he’d be willing to donate his old array that’s taking up space and might have the right parts that I need to help this little hospital. He is, and I drive an hour to pick up the array and a hour back to my office. Parts match as it turns out, and I’m able to bring the PS6100 back online. Plug it into my now reconnected lab environment, figure out the ISCSI CHAP authentication, map some datastores, bring up the VM in question so that I can extract the files from that server. I copy the files to the DC back in production, throw the changed files out there that were changed on the restored VM since the originals – they now have duplicates of some data, but at least they have the old data back as well. Turns out to be 125MB worth of data, mostly a spreadsheet that is updated by nurses as they make their rounds or something like that. An awefully lot of work for a spreadsheet, but in the end, zero data loss. A win is a win. I have no doubts that if the client was still using ShadowProtect (oops, did I name names?) from their previous MSP, we would have not been able to restore quickly. I have little doubt that we would have not been able to restore at all.
The ongoing saga….that eventually comes to a close
In the end, the client did find the money to get the hardware purchased. Actually, some of it is purchased on credit because the CEO knows the money is coming eventually and gives us the business equivalent of “the check is in the mail”. We throw in a new PowerVault ME4 array and two new hosts, and a new physical Veeam server with local storage so that we don’t have to delay to build a new Veeam server next time. Client is claiming the replacement hardware to insurance due to the service life being cut short due to the AC failures. Insurance company, probably partially due to COVID, goes through about 3 claims adjusters through the process. I had provided a root cause analysis after the incident and we resubmit it to the the adjuster. In the RCA, I did state that the hardware was already really old but the AC failure didn’t help their cause with life on the old hardware and had a phone conversation with one of the claims adjusters to give some context to the RCA as they’re walking a fine line in my opinion, and I want to make sure we’re not aiding in some sort of insurance fraud.
Client eventually has issues paying us for our services. Threats of lawsuits ensue as the client is trying to wait out getting money from the insurance company to pay us, but we’re not a bank and in the business of lending them credit. We eventually get paid but end up firing the client a few months later because they’re just not a good fit for us, as much as it pains everyone involved. We have to do what’s right for us, but we give them plenty of time to get onboarded by someone else. They end up hiring one of our former engineers that left for a different company to handle the offboarding, and he gets roped into their insurance claim as well. Insurance company states they’ll pay them the depreciated value of the failed hardware, but won’t pay for the new hardware as that was an improvement and not a like-for-like replacement. Seems fair to me. Client wants us to provide a value of the failed hardware. Everything was end of life and only suitable for use in a lab environment, if even that. We don’t give hardware values to the client but I’d estimate the value all of the replaced hardware at about $500 tops, and I fully expect to see the insurance company cut a check to the client for maybe $100.
Since this was originally written, the client has been offboarded. I’m a little curious about what the Insurance paid out to them, but really anything over $0 is too much given the malfeasance that occurred here after so many warnings. Whatever they go paid, I’m pretty sure it was “go away” money.
Lessons Learned (and reinforced)
Anybody who listened to Aesop’s fables knows that’s a moral to the story. And here’s a few from this one.
- Listen to your MSP, Partners, and Service Providers. They’re probably not trying to extract as much money as possible from you. Any good provider is going to have your best interests at heart. Recommendations come for a reason, and if that flag is being waved frantically, take a moment to figure out why and address that issue.
- Hardware has lifespan. It’s probably more than what the vendor will support it, but why take that risk? Dell will warrant and support hardware for up to sever years. In my experience, you can probably get 8-10 years in the best case. There’s third-party support, but if you’re needing to buy support elsewhere, again, take a moment to find out why you’re doing that. It’s probably for the wrong reasons.
- Make sure you’re monitoring your hardware and can receive alerts. I guarantee you that both controllers on the failed SAN didn’t fail at the same time. One was probably boot-looping for weeks or even months, but we had no idea until they had both failed.
- Make sure you have a plan to recovery from a disaster. Flying by the seat of your pants might work, but do you really want to? Figure out the best way to work yourself out of a bad scenario and be prepared for it.
- Test your backups. It’s been said that your backups are only as good as the data you can restore from them. We were lucky that all VM’s except one had been backed up the night before. The reason the single domain controller backup had failed? Turns out the repository was running out of space so Veeam refused to continue backing it up.
- Monitor your backups. Make sure that if you have failures, you know about it and you address it.
- Keep and offsite copy of your data. And make sure it’s encrypted. Seriously, an offsite copy, while not necessary, super helpful. It took a lot less time to begin restoring data because I had a copy of the configuration database, passwords for everything, etc. I didn’t have to completely reinvent the wheel. Just fix the hub/spokes and reattach the existing tire. For every one of my clients, their configuration database is stored somewhere offsite. It’s either at one of their secondary sites, it’s in a VCC repository that they have somewhere, or it’s in my VCC repo that I host with my service provider console if they have no other option.
- Documentation. Keeping that configuration database backup and copies of backups offsite is only useful if you can access that data. If you lose your passwords and configuration, it’s going to be hard to access that data. Make sure your documentation is updated. It needs to be revised because the only thing that might be worse than no documentation is outdated, inaccurate documentation.
- If you can’t use Veeam, use something. Generally speaking, a decent backup that isn’t great but has what you need is still a good backup. As previously mentioned, at least you have a copy of your data somewhere, right? Also make sure to remember that if you don’t need much, Veeam does have a Community Edition!
- Use Veeam. Seriously! Sure, there’s other products out there. But why would you use anything other that the best in class systems to protect your critical (and non-critical) business data. Plus, honestly, that peace of mind it brings you helps you sleep better at night, and that tends to be affordable at any cost, but Veeam really isn’t that expensive for the return that you get on keeping your business running. What’s the cost of your business being down? What’s the cost of losing your business? Veeam has some free (community edition) options for backing up your own systems, but even paying for the product tends to be well worth the investment.
My apologies for the length of this story, so if you made it this far I appreciate you. Hopefully you found it entertaining, educational, and worth the time that I took to write (and rewrite) it and that you took to read it.
I really enjoyed this story. Having been in IT for about 35 years, I’ve seen similar stuff and it just keeps coming! Our company recently acquired another hospital system that we found, as we began supporting it, was still running on ESXi 5.5 and old hosts with uptimes in the 5-7 years range (the ones that didn’t crash on a regular basis). Some running off SD boot cards with failed mirrors and corrupt filesystems. Also encountered a Cisco Voice server that was just being decommissioned, with uptime over 9 years and approaching 10. We are keeping it running even though it isn’t doing anything now, just to see if it can get to 10 years. Don’t apologize for the length of the story – I could read a whole book of stories like this, and there’s probably interesting detail you left out.