March 18, 2001
How NOT to Troubleshoot a Network
I was recently asked to assist in providing an overall `health check' of an existing network for a client-of-a-client. This was presented as a great opportunity to meet a new group of network people, as well as provide some basic network information about the client's network.
Because this was done as a third-party engagement, I wasn't able to communicate directly with the client's network group prior to my arrival. All of the information I received about this job came directly from the middleman.
** Tip #1: Don't take the word of a middleman, no matter how nice they are. Most of the time, the middleman isn't given the entire story, and the middleman isn't usually interested in getting the entire story; that's why you're there. Often, a `simple' network engagement really means `fix all problems on our network.' This particular situation was a very casual engagement provided as a favor. If this was a formal engagement, there is the usual paperwork that sets the client's expectations. **
I've done these `favors' before, and I was prepared for a very different story once I arrived on-site. At least I took my own advice in this regard, because I unfortunately dismissed many other network troubleshooting fundamentals once I arrived.
The client was interested in basic network health statistics, but as we unpacked our equipment they stated that there was `this one network problem that perhaps you could look at?' Now that the other shoe had dropped, I said that I'd do whatever I could to help, and I'd offer any suggestions that might become apparent.
The client's problem was the universal network ailment, `the network is slow.' A user in the accounting department had been using an application for a number of months, but the application has become exponentially slower and slower. A call to the software manufacturer was met with a common theme - it must be the network! The network team was against the wall, and had to prove that the network wasn't involved, and also where the problem was actually occurring.
This problem sounded easy enough, so we mapped out our game plan. We'd place the Sniffer in the closet containing the user's workstation connection, and we would use the traffic redirection capabilities of the switch to provide the stream of network packets for the Sniffer. Once the initial trace was made, we'd determine the logical next step for analysis. Sounds easy, doesn't it?
** Tip #2: Always have a plan B. In this client's environment, the station wiring was done by one team, the access switches were installed by another team, and a third entity was responsible for the physical security of the data closet. We had to communicate with three separate entities to gain access to the closet and begin our analysis task. **
In this case, everything that could have gone wrong did go very wrong. After some dead time as we were stood around the hallway while the client went to find the key to the data closet, we were provided access to the closet. We wrote down the user's workstation jack number, and searched through the usual maze of wires until we found the correct switch port containing the user. Unfortunately, there were no available ports on the switch for the Sniffer, so we disconnected an unused port to use for our Sniffer connection.
As I searched through cables, the client's switch expert was having problems logging into the switch. A combination of laptop serial port inconsistencies, multiple access passwords to the companies switches, and a bad hub created a series of delays that found us one hour later without any further progress. We weren't able to log into the switch to redirect the network traffic, we found that the cable that we mapped back to the user's workstation really wasn't the user, and the hub that we brought as a backup didn't work properly either.
** Tip #3: Make sure your tool kit has the proper tools, and that they work properly! It's not helpful to have a laptop to use for portable terminal access if the serial port is flaky. Hubs are only helpful for network taps if the proper cables are included, and they are in optimal condition. Coordination of an enterprise's switch configurations can be time-consuming on the front-end, but it will reap many benefits when problems are occurring. **
We punted, and decided to get a new hub and connect it in the user's office. This connectivity solution was not as professional as a closet link, but at least we had access to the data as it traversed the network, and we knew that this data definitely belonged to the user!
We started a capture session, and asked the user to move through his usual slow-down procedures. Sure enough, we immediately had slowdowns! After over two hours of fumbling around the network, we had a small level of success.
The resulting protocol decode showed that a request was sent from the client's workstation, and the workstation waited about 45 seconds to receive a response to the initial request. From this information, it was obvious that we needed to investigate the server's role in this slowdown or the workstation configuration.
We didn't have any information about the inner-workings of this application, and we didn't completely understand the relationship of the client to the server, or any of the back-end processes related to this application. If this study were done in different circumstances, we'd probably consider getting a representative from the development team involved to provide us with an overview of the application's basic structure.
We needed to view the application's data movement from the relation of the client and the server simultaneously. Although we had the equipment on-site, we didn't initially set everything up for the first trace.
** Tip #4: Smoke `em if you've got `em! We dragged all of this equipment on-site, and didn't even use it for our initial trace. Many network problems are intermittent, and there was no guarantee that this problem would be reproducible during our engagement. If you're going to make a trace, make sure all of your equipment is recording at all times! **
Fortunately, we were able to connect our Gigabit Sniffer Pro to a central backbone switch and redirect all traffic to-and-from the central database server to the Sniffer Pro analyzer. Since we were now bringing all of our guns to bear on this problem, we also started server-based traces with Microsoft's Performance Monitor application.
We know that this was a judgment call. By starting PerfMon recordings on the server, we could possibly change the results of our troubleshooting by changing the operation of the database server. In retrospect, it may have been better to perform another end-to-end analysis of the slowdown before enabling PerfMon and running the test again. At the time, however, we weren't really sure how reproducible the problem would be, and we wanted to make sure we got everything on this final run.
We ran the trace again, and fortunately saw the slow-down symptom appear again (sometimes it _IS_ better to be lucky than good). This trace verified that the server received the request, and the request bounced around for 45 seconds inside of the database server before sending the reply. Because we were capturing all traffic into the server and out of the server, we knew that no additional processes were externally affecting our application slowdown.
In the end, the client was pleased with our results. Although we didn't solve the application slowdown, we were able to definitively provide an explanation for the slowdown and suggest some further studies to help resolve the problem. More importantly for the client, they had proof that their network was running at top speed, and the heat was off the network team.
We'd probably work though this problem a bit differently next time, but that's why it's called a learning process. Perfection is a great goal, but where's the fun in that?
Posted by james_messer at March 18, 2001 11:05 PM
Thanks for signing in, . Now you can comment. (sign out)
(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)
