This is the 303rd article in the Spotlight on IT series. If you'd be interested in writing an article on the subject of backup, security, storage, virtualization, mobile, networking, wireless, cloud and SaaS, or MSPs for the series PM Eric to get started.
It was 4:30 p.m. on a Friday. I had just left work for a weekend with family and friends. Tom Petty played about as loud as my wife’s VW Bug could handle. I hadn’t noticed my phone going off non-stop for about 10 minutes — an email every two minutes from our battery backup informing me that the entire plant’s power was down, two phone calls from my onsite colleague and two more from my manager.
That isn’t what this story is about though. I had intended on writing this post over that weekend. I was so excited to be a part of the Spiceworks Community and to provide one of the articles that I always loved reading. Instead, I spent the weekend in an actual disaster recovery situation. Nothing kills the motivation for a DR story like an actual DR situation...
A year into my help desk call center job, I found myself moving into my dream position. I started as the network and systems admin for a medium-sized company, a huge role change with a lot to learn. I hadn’t been there for two months before the inevitable happened: my first complete server crash. How exciting… I mean, terrible, and ultimately terrifying.
I had learned plenty about backups and restores. I knew about fulls, incrementals, differentials — all of that. What I knew about looked great on paper, although I had never actually touched a piece of backup software until I started this position. What I knew was simple: The backups were running on schedule, were successful and were easily accessible.
When I left for the day, my co-worker and I noticed a strange error message in the system tray. Without getting to read it all before it mysteriously disappeared, I shrugged it off and figured I’d check the event logs in the morning. I packed my bags and left for the day. My coworker re-booted it “just in case.”
The machine never came back on.
I had trained for this. I knew what to do. One by one we tried all of the expected first steps. Stop and start it — what does it do? That doesn’t seem right, what about safe-mode? Nothing? We jumped through all the right hoops. What’s going on here?
We started diving into some more technical fixes. (Isn’t it amazing how long it ends up taking? Each fix is always one step away from being done. “We’ll just run this, it should be 20 minutes, and we’ll be running again.” It always ends up taking 40 minutes, and still doesn’t work. Repeat five similar fixes, and here we are well past midnight with no running virtual server. I’m not alone on that, am I? I’m guessing not.)
Prior to starting, my manager was utilizing an IT consulting business for my role. Since I was still new, we decided it would be best to involve them in the incident. For most of the night, we kept them in control. I was reading TechNet articles and browsing the forums to see what other people had tried. All the while, I was on the phone with them and remoted into the servers through vSphere so I could watch as they tried different things. All the while, I was trusting their guidance. After all, I was still new. I could practically hear David Attenborough describing me as a timid cub, being led by the knowledge of my mother consultant. I knew that if they suggested something, that must be the best next step.
At this time, my wife was sound asleep. She kicked me out of the room so I wasn’t tapping away on the keyboard all night and keeping her up with phone calls. My second bedroom, formerly filled with guitars, my electric drum set, the keyboard (of the piano nature, silly techies) and what used to be my “music computer” had now been transformed into an all-night office.
I spent the late hours of my night searching through articles and forums trying to find things to try. I often found myself on the same articles that the consultant’s had seen and been trying. It was nice to know I was on the right track with the experts. It was at about 2:30 a.m. when I learned something: how awesome bare-metal restores were.
My tech Shane had finally asked a question I had a great answer for, “When was the last good backup of this machine?”
We had a backup running daily. There would be a good backup from the day before, from about 5:30 p.m. They decided it was time. We created a new virtual machine on the server and allocated the hardware. I knew that we didn’t need to load an OS — that’s a pretty cool thing about it, right?
We loaded an ISO file of the recovery boot CD. We were using Symantec System Recovery at the time. I pointed them to the NAS that held the images. At the time it was confusing to me: Why would we only have to pick the incremental backup? Do we have to load the other data afterwards? How long would that take?
A few clicks later we had the restore process started. We only had to pick the most recent backup location and it knew what to do. What did I learn from this? 22 minutes. It took 22 minutes to finish the restore. 22 minutes after we clicked start we had a running server, as if nothing had happened. All I could think was, “Why the flip did we just spend HOURS trying to fix this?”
It was amazing to me. We took nothing and made it work in 22 minutes. In retrospect, if I had been a more confident tech at the time I would have known how static the data was on that server. I would have known almost immediately that we’d lose nothing by starting with a restore from backup and that it would be the fastest easiest option. That’s exactly what I learned from it though, we have a phenomenal piece of technology at our disposal and now I know how to use it.
--
Thoughts? Questions? Favorite Tom Petty song? Chime in in the comments below!