Server crashes will happen to every businesses sooner or later. But knowing how to quickly diagnose the problem can help you get back up and running faster.
Having a server go down is a nightmare scenario for any business. An unexpected crash can leave your business completely unable to operate and could, in severe circumstances, see vital data irretrievably lost. Or, in less serious instances, your website could just go offline for a few minutes while you turn it off and on again.
But how do you know which of these scenarios you're facing? The first step in fixing an unexpected server crash is understanding exactly what the root cause of the problem is. Only then will you be able to put in place backup procedures and mitigation processes to get the server up and running again and run contingencies to ensure you can continue operating in the meantime.
However, with so many potential things that can go wrong, this may not always be an easy task. Therefore, here's a quick primer for some of the key questions you should be asking, as well as some of the most common root causes of a crash that you need to be aware of.
Step one: What type of crash is it?
The first thing you need to do is establish what type of crash you're faced with. This will help you narrow down the root cause - for example, if it is a hardware or software issue - and serve as a good starting point for your diagnostics.
For instance, does the server power on at all? If not, then it's definitely a hardware issue, most likely related to the power supply. If it boots up but this shows the dreaded Blue Screen of Death (BSoD), this is usually an indicator of a hardware failure or a bad device driver.
Meanwhile, if the server loads, but critical services aren't functioning, this can suggest a network problem or configuration issue. These can be among the hardest issues to pin down, as there could be a wide variety of potential causes, so it's important to know where to look for further clues.
Also, don't forget the simple stuff! It can be easy to focus so closely on more complex, technical issues deep within the configuration of a server that more basic issues can go completely overlooked. There are many tales in the industry of admins going through lengthy troubleshooting processes, only to find someone has simply unplugged the wrong cable and shut down the server.
Step two: Deciphering the errors
If you're staring at a BSoD, it can feel daunting having a long string of errors to look at. However, if you understand the structure of the screen, it becomes a lot easier to make sense of it. In particular, one key area to focus on is the actual error message that appears at the top of the screen.
Here are some of the more common messages you may encounter here, as well as what they indicate:
- KMODE_EXCEPTION_NOT_HANDLED. This indicates an incorrectly configured device driver.
- REGISTRY_ERROR. There is a serious problem in the registry.
- INACCESSIBLE_BOOT_DEVICE. The operating system is unable to read from the hard disk.
- UNEXPECTED_KERNEL_MODE_TRAP. Signifies memory problems.
- BAD_POOL_HEADER. This can be hard to interpret, but indicates that the issue has something to do with a recent change in the system.
- KERNEL_DATA_INPAGE_ERROR. The OS is unable to read a page of kernel data from the page file.
- NMI_HARDWARE_FAILURE. This indicates the inability of the hardware’s abstraction layer to identify the cause of the error.
- NTFS_FILE_SYSTEM. You have a corrupted hard disk.
Below the error message, you'll see a list of OS modules that have been loaded into the memory successfully, so you can rule these out as the cause. Then, there is a list of those that didn't load, which could help you narrow down the issue to a specific module.
Step three: Using Safe Mode and event logs
The next thing to try is booting up your server in Safe Mode. If you can't do this, then the chances are your issue is hardware-related. If you can, this points to a driver-related issue.
If you're able to boot into Safe Mode, this will give you access to the Event Viewer, which should be able to offer further clues as to what the cause of the problem is. But if not, you can then turn to Device Manager and disable all devices that are not required for the server to start.
Turn them on one by one and reboot the server to see which, if any of them, are causing the problem by process of elimination. When you enable a device and the server crashes again, there's your issue.
Once you've narrowed it down, the hard work begins. Depending on the type of problem you've encountered, a fix could be as simple as rolling back a new change or updating a configuration. Or, more complicated problems will need expert attention and could take much longer to resolve, in which case, you'll need to have backups and contingencies at the ready to ensure continued service.
Insights for Professionals provide free access to the latest thought leadership from global brands. We deliver subscriber value by creating and gathering specialist content for senior professionals. To view more IT content, click here.