Server troubleshooting is a fine art, but there are some straightforward methods and tips to get things running smoothly.
ITIL methodology delves into how to troubleshoot a server or a related issue more deeply, but the general theme is to narrow down the problem as quickly and efficiently as possible.
Take a step back and think about how to logically resolve an issue during an outage. For example, if a user complains that they can't access something, find out if other users have the same issue, eliminating the possibility that the problem is localized to a single end-user device.
Use this generalist guide in concert with your organization's guidelines and technical strengths to think about server troubleshooting processes and procedures.
1. Identify the server problem's area of effectOne of the first pieces of information you need is how widespread the outage or slowdown is, as well as what it affects. What seems like a network issue could be a damaged cable affecting one PC or a small cluster.
If multiple users are afflicted by the same issue, it eliminates environmental variables like hardware problems on a local PC or software misuse.
If you have multiple sites, are they all affected? This will help determine if the issue lies with a localized server.
Follow these general steps to help identify server problems:
Members of a big IT team are accustomed to finger-pointing among departments. The help desk hears about a slow application, and the sysadmin blames the network; the network admin blames the storage area network (SAN); the storage admin blames the software.
If you're troubleshooting a server issue -- particularly if it's something vague, such as a slow application -- identify what area of the data center infrastructure is affected. When multiple servers and applications are malfunctioning, this usually rules out a server problem and points to network or storage arrays. With virtualization, check the physical host location of any affected VMs to ensure they don't share the same, potentially compromised, hardware.
The process of elimination usually points to a clear culprit, but not always. Find commonalities on issues and try different combinations of factors to narrow down the possibilities. For example, perhaps the issue is that copying from one file share to another is taking too long. Is it slow if you copy from one server to another on the same site? If so, it's not the WAN. Is it slow if you copy between local disks on the server? If so, it's not the SAN or LAN. If you have to resort to packet capturing or I/O speed tests, server troubleshooting could take a long time.
Follow these general steps to determine if the server itself is the problem:
Documentation is an incredibly valuable troubleshooting tool. Easy access to your environment's topology and an understanding of how an application works on it enable swift server troubleshooting.
Have a solid understanding of the data center operations: How many servers are involved with each application? What are the basic network settings? What infrastructure lives where? This proves valuable, for example, if you have two application servers that clients connect to via a round-robin domain name system and half of your users report issues. You know from the start that half the users connect to each server, so you won't waste time trying to solve a problem with the other server.
Follow these general steps to maintain detailed documentation on server settings and connections:
Communication is key in server troubleshooting. For example, your colleague changed a server setting last night. The next day, something doesn't work. You need to know about the change, as it is a likely culprit. Large companies have change process forms so everyone is on the same page, but not every IT team has that luxury -- or hindrance, depending on how you look at it.
Communication helps the data center team prepare and proactively watch the environment when a new application or other change goes into production. Otherwise, they'll reactively ask about the new application, its deployment and resource demands when end users start to complain about poor functionality.
Follow these general steps to communicate effectively with your team and company:
Save time troubleshooting server problems with a detailed and ongoing overview of operations.
There are many monitoring tools available for different sizes and structures of data centers. When configured correctly, they track key metrics, such as latency and I/O speeds, which provide the information needed to involve appropriate storage or network people. Monitoring tools also alert you to potentially useful information, such as a drive with 1% free disk space that's primed to cause server issues.
Many products also monitor services, so if a critical service crashes and stops, the tool will send an alert or automatically attempt to restart it based on the rules you set.
Surprisingly, server and related logs are often overlooked.
When an issue comes up, technicians think they know what the issue is and spend hours trying to prove their theory. But if they spend a few minutes looking at the logs, they will see the exact cause of the problem. Permission issues are easier to fix if you know what two things are trying to talk and with what account, for example.
Check the Event Viewer logs on Microsoft Windows or syslogs on Unix/Linux servers for warnings and errors. Application logs are also worth reading, as they often contain error data that points you in the right direction of a root cause. Retain log data in a sequestered storage location to track long-term server status and behavior.
Follow these general steps to have server stats and logs available:
Some server administrators consider a request for vendor assistance a defeat -- don't. After a comprehensive check on the basics, spend a few minutes to log a call, rather than wait until several hours into an outage.
When the IT environment is healthy, take the time to check the specific details of the support service-level agreements (SLAs) with your organization's vendors. If the vendor won't contact you until the next working day, log the problem as early as possible to stave off a frustrating night.
Many vendors have specific instructions -- the details of which are usually available online -- on how to troubleshoot server issues. Check resources from the vendor's knowledge base and online forums.
Follow these general steps to stay ahead of provider issues:
It can be frustrating when server troubleshooting and resolution require more than five minutes, but don't be afraid to ask for help. Preparation, communication and a strong understanding of your environment are the tools of a hero that saves the day.
Editor's Note: This article was originally written by Adam Fowler in 2020. TechTarget Editors revised it in 2023 to improve the reader experience.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4