Quantcast
Channel: Spiceworks Community
Viewing all articles
Browse latest Browse all 5334

Schrödinger’s backup: When good documentation goes bad

$
0
0

This is the 212th article in the Spotlight on IT series. If you'd be interested in writing an article on the subject of backup, security, storage, virtualization, mobile, networking, wireless, cloud and SaaS, or MSPs for the series PM Eric to get started.


Photo credit: 'No Matter' Project

The concept of Schrödinger’s backup is a simple, but I believe it’s an important one for us in IT. It states, “The condition of any backup is unknown until a restore is attempted.”

We can quite happily feed tape libraries their daily, weekly, monthly required diet of tapes, ensure that disk backup systems have enough space, and smile when the emails appear saying that everything backed up and was verified just fine. But the true test comes when a restore is required — be it a single file or an entire system. At that point, we find out if the backups really were as successful as the emails claim, if we’ve left something out and if the documentation is valid (if it even exists).

A few years ago — back when physical servers were more abundant than virtual servers — I was fortunate enough to have one of those rare moments when the planets aligned and I managed to get an agreement from management to purchase a server to be used as a test restore platform for our then key systems. To go along with this, the IT department as a team came up with the “random restore” concept.

This was just a simple spreadsheet that had all of the key systems listed along with a few other things, such as a file restore or an AD object restore. A macro in the spreadsheet would pick a restore more or less at random. Now, there were a few rules in the macro: no system could go untested for more than six months and the first few restores were for the key servers the company had highlighted as essential to running the business and key systems that business-critical applications required — services like Active Directory, print servers and Exchange.

For the very first restore test, the scenario was that the Exchange server had died and everything else was working. At the time, we had a very key junior in the IT department who was keen to have a go at the restore by following the existing documentation. So, he was duly dispatched with a hefty document to do the restore. About two hours later he was back with said hefty document and had more notes scribbled over it than anything I’d ever seen before.

Suffice to say that the restore was an unmitigated disaster — not because the person doing the restore was a junior but because things had changed (most notably, the location of one of the Exchange databases had been moved to a different drive because of space issues and the documentation never updated).

Over the course of what was fortunately a quiet Friday afternoon, most of the IT department got involved in one way or another. We found tech notes to fix issues that cropped up, we made notes of those issues, and we covered each other on support calls to ensure that the day-to-day work carried on. In short, it was a team-builder’s dream.

Eventually, we did get Exchange restored, but the lessons learned were important.

Five people had spent a large part of the day dealing with a system that, when first installed, had been well documented, but, over time, several minor things had changed and that had broken the documentation. As a team, we’d pulled together and managed to make it work.

Of course, we weren't under any pressure to get the system up and running. We never had the finance director screaming for his emails, nor the lawyers. If the restore had been required because of a failure and if we’d had people screaming at us to get it working, I think it would have taken twice as long to do. After all, it’s easy to panic and try anything just to get the system back up rather than stopping, taking a step back and really looking at the problem.

The IT department as a whole learned a few valuable lessons:

  • It doesn’t matter how many “backup successful” emails we’d received, the documentation was wrong; restores would have been painful no matter what.
  • Restore documentation shouldn’t just contain information about the restore. It should also include serial numbers for software, contact numbers for support companies and support contract reference numbers. If this is a problem then those details should be in another system with a link or other reference in the restore documentation.
  • Any documentation should have a date of when it was last updated and tested, for there is nothing worse than documentation that is wrong. Having no documentation is better than wrong documentation.
  • All documentation should have a glossary page explaining the acronyms to ensure that the person reading the documentation understands what the person who wrote it was trying to say.

After the Exchange server was finally restored and we’d had a lessons-learned exercise, the term “Schrödinger’s Backup” was coined. It’s something I’ll certainly never forget.

Over the next few months we tested out all sorts of restores, we requested tapes back from offsite, built AD from the ground up, restored SQL to point in time, did CIFS shares and much more.

We got a chance to try things out during restores to learn and gain confidence in systems that normally we'd only be doing administration in. (Plus, we all got something extra to put on the CV/résumé). All in all, it was a very worthwhile exercise and an excellent investment of time.

Have you ever received any painful backup/restore lessons? Share your stories, thoughts and tips in the comments below!


Viewing all articles
Browse latest Browse all 5334

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>