SOFTWARE REVIEWS, PROGRAMMING TIPS, AND SOFTWARE SOLUTIONS FOR ALL YOUR BUSINESS NEEDS.

Four weeks of Hell ...

Added on by Ray Insalaco.
Weekend One ...
I guess the bas way to start is at the beginning. At work we have two Dell PowerEdge 4600 servers. One of them is running VMware ESX 3.0 hosting seven servers. Three Ubuntu servers and four Windows 2003 Enterprise servers. The second server is our ERP and MS SQL server. We had purchased ESX to install on that server so that we could move things around and even out our server utilization. That is pretty much were things started to go bad. I went in on a Saturday to install some new network cards in the two servers on a Saturday. When I got to work there was and error message on the ERP server that had caused the nightly processing stop. I shutdown the server and installed the new cards and then booted it. Everything came up fine so I started working on clearing the error that stopped the processing. After an hour I gave Tech Support a call. After trying several things we decided to let the tech remote into the server to try and find the problem. After four hours and three reboots the error was cleared and the night processing ran fine. On Monday we found out that two modules from the ERP system were totally hosed and would on run. It was decided that we would work around this until the next weekend.

The Next Weekend ...
Two days before we were going to install ESX on the second server, things started to run real slow on the server running ESX. We let it run for the rest of the week, figuring that we could bring everything down and reboot the server on the weekend. It had been running for about eight months without being rebooted, so we hoped that was all it was. That would have been a nice treat. When we started the server for the reboot it could not find any drives. It has eight drives in a pair of Raid 5 arrays. Nothing could get the Raid controller to see the drives, so we ended up pulling and reseating everything. It then saw the drives, but none of the containers on the arrays. It was around this time that I started thinking that just about anything was better than were this was going. We ended up having to initialize all the drives and rebuild the array's to get the server see the arrays. It took four hours for the array's to initialize and become usable. At this point we had a server that was bare metal with no OS on it. We put the VMware install CD in and booted into the installer and started installing ESX. About 20 percent into the install there was and error. Unable to write to the drives. Both arrays had corrupted and had to be rebuild again. Another four hours later we were ready to try again. This time everything installed fine and booted into ESX. I started moving the backup Vmware images onto the server and left for the day.

Day two started pretty good. I walked in and was able to start all the images that I had moved over before I left the night before. The images were about four months old, but we run tape backups every night to pickup all the changes from the images. Now a quick restore from tape and everything will be ready to go. This is the point at which day two when into the trash. A recent Windows update kill are backup software’s ability restore over the image and be rebooted. We had tested this when we first set this up, and DR worked great. Restore the image, boot the image, restore the tape that we want to restore, insert the DR floppy, reboot and restore. Well one of the Windows updates rendered the DR floppy worthless. The good news was that all the data was on the tapes, but we had to pull the data off the tapes and then move it were it needed to be. This whole process ended up taking 23 hours to get everything back up and running.

The Third Weekend ...
Finally we get in on a Saturday and everything is running good. I made sure the backup had run without error incase something goes wrong and we need it. We reboot into the installer for VMware ESX and install it without any problems. A quick reboot and we are up and running in VMware. I created the new VMware image for the ERP server and started installing Windows 2003 Enterprise Server. About half way through the install the alarm goes off on the server. A quick check of the error code and it traces back to a bad memory module. We ended up having to pull half of the memory to clear the error and get it running clean. So now we have this server with too little memory to run what we have installed on it. We moved forward with the planned upgrades, knowing that things would be slow until we got some new memory and installed it. Everything else went smooth.

The Fourth Weekend ...
We have memory to install. Yeah ... We shutdown the server and installed the memory. After booting up we gave the vm more memory and it was amazing how much better it ran. For the first hour, then the ESX server locked up. I have to say I had never seen that before. After rebooting it did not take long before I got to see it again. After pulling the memory everything started running fine. At this point it has been running for just over a week while we wait for the memory is replaced. It seems that our supplier sent us memory that should work, but not what we were already using. As it stands today, we should have the replacement memory early next week.

I guess the moral of this story is not to let yourself get pushed into not replacing hardware when it is time. At this point one of these servers is over five years old and the other one is just over four years old. Both of the servers are over a year out of warranty. Going forward I will be making sure that the serves get replaced on a three year cycle. Along the same line as this has been the number of drives and power supplies we have lost in desktops this year. We have almost thirty desktop computers in use that are seven years old and another fifteen desktops that are six years old. I think that I am seeing a large budget request for next year.