March 14, 2001
The Case of the DOS Database Delay
The day started simply enough. I had a cup of java, a couple of network reports, and took a quick check over the file servers. The network was running well, probably because most people were out of the office over the holiday break. The calm air should have been my first clue.
We received an urgent call from the Accounting department (why is it ALWAYS the Accounting department?). A legacy DOS-based database program was used to run a job every week, and it normally takes fifteen to twenty minutes. This month, the run took almost two hours!
After a quick meeting between the database application developers, network analyst team, and the usual management representation, we went to work. In the meeting, the database developers said they had not changed the application, and had recently re-indexed the application for speed. The user felt that the application was growing slower and slower each week, and blamed the network and the file server for the slow response times.
Remember, this was a DOS application running from a file server. This application didn't have the advantages of a client/server application, where some processes run on the client and other processes run on the server. For this DOS application, every database process used the client's CPU, and all data was pulled from the file sever, across the network, to the workstation. Although this application wasn't a flashy client/server application, it presented plenty of complexity.
The Accounting department had another file server, one that was only used when testing applications before adding them to the corporate network. The application and data was moved from the production server to the test server, and the job was run again. Interestingly enough, the delay remained. We concluded that the problem didn't appear to be specific to the file server. We still weren't sure if the network was the issue, so we plugged in a Network General (now Network Associates) Sniffer, and sat in the network room while the database job ran again.
We watched the Sniffer, which showed the network with no physical errors and very few network and application retransmissions. In fact, the entire monitoring period showed the network was running at peak efficiency and putting a LOT of data over the wire.
The file server statistics showed similar results - no errors, but tons of data was transferred. After the job was complete, the network had transferred hundreds of megabytes of data! We'd found the reason the job was taking so long, but we weren't sure why the large data transfer was occurring.
We met again (corporations don't work unless you have plenty of meetings). This time, we presented our findings to the team and asked the programming team about the large data transfer. With this new information, the programmers revised their initial report and stated that they weren't sure if the database had been re-indexed or not.
So, we hovered over the database team while they double-checked their data. A few minutes later, the database had been re-indexed, and the job was run again. This time, the job ran in less than 15 minutes. I spent the next few minutes holding my associate away from the throats of the database team. We had spent an entire working day troubleshooting a 'network problem', only to find the ultimate problem was fixed with a five-minute database re-indexing that we were originally told had been completed.
This case taught us a few critical points:
* Use every tool available, including protocol analysis tools and file server statistics.
* Misunderstandings can occur between people. Double check the easy fixes first, and work towards the harder fixes.
* When things are slow, watch your back. :)
Posted by james_messer at March 14, 2001 06:47 PM
Thanks for signing in, . Now you can comment. (sign out)
(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)
