SQL Server on VMware Server
We are deploying Deltek Vision 4.1 as our new financial management system in March. We started work a while back on this project. I built the infrastructure in VMware Server running on top of SuSE Linux Enterprise Server 9. We are using a three-box Vision implementation, with a separate VM running Windows 2003 Server Standard Edition on dedicated VMware Server hosts for each of Vision Web, Vision Reporting and Vision SQL Server. The virtualization is to allow for disaster recovery and portability of hardware. The database analyst and programmer guys got started quite a while ago getting the reports that our project managers rely on in our old system working in Vision. We’ve also been testing and troubleshooting Vision and training the accounting staff during this time.
A problem started manifesting itself with Vision and SQL Server after some of the data was imported into SQL Server and we started doing queries on it. The problem would occur particularly often whenever nested select statements were used in a query. SQL Server would fail to execute the query and error out with one of four different errors: error 5243, 5242, 823, or 682. All of these errors have multiple meanings, but a common thread is I/O problems to do with physical disks or storage drivers, when SQL Server does lots of writes in TEMPDB. In our case because we are using SQL Server on a virtual machine, it implied some kind of problem with virtual disks or with the VMware virtual storage controller driver, or possibly with the underlying filesystem on the VMware host.
Several VMware Server knowledgebase and discussion posts mentioned similar problems regarding SQL Server 2000 and SQL Server 2005 on VMware Server.
To confirm that the problem was a VMware problem, which was just a suspicion initially, I built a physical Windows 2003 Server that was otherwise identically configured to the virtual one. On the physical server the queries never failed in our tests.
That made us fairly confident that we had a problem with VMware Server somewhere. When I initially created the virtual machine, I built a reiserfs partition on our IBM DS4300 SAN to store it. When I built the VM, I created a 100 GB virtual disk that was configured in 2 GB chunks, and I did not preallocate all the space at build time. I thought that perhaps the I/O problems were occurring when the VM writes the TEMPDB and new storage was allocated as the virtual disk expanded. I decided to convert the virtual disk to a fully preallocated disk using vmware-vdiskmanager, which is a command-line tool that comes with VMware Server. I did that conversion and then tested the new VM on non-production hardware that was very similar to the production blade server, except that it had locally attached storage instead of a SAN LUN. The problem almost went away. It went from erroring out more than half the time to erroring out about once in 15 or 20 runs. That indicated that I was on the right track.
We had a momentary lapse of reason and thought that the VM might run better on a Windows VMware Server host. I moved the 100 GB preallocated disk version of the VM to a Windows XP Pro workstation running VMware Server. The error occurred nearly every time, so we abandoned that ill-conceived path.
Next, I thought that either reiserfs couldn’t cut it, or VMware Server couldn’t cut it. Since I had just received a new workstation from the vendor, I configured it with OpenSUSE 10.2 and formatted the disk in ext3. I also built an ESX Server 3 evaluation server in Engineering, on an IBM x334 pizza box connected to a fibre-channel SAN. I copied the 100 GB VM to both my new workstation and the ESX Server.
On my workstation, the moderately demanding test query ran 50 times in a row without failing until I gave up on it, proving out that ext3 works better as an underlying filesysetm for VMware Server, at least when the VM you are hosting is Windows 2003 Server with SQL Server 2005.
On the ESX server, the query also worked every time, which was fully expected.
Finally, I decided that even if the ext3 and preallocated disk fix didn’t fix the problem 100% of the time, it was worth applying it to the production system, so that the problems would be reduced during training. It would also buy us time to decide whether or not to buy ESX Server for about $10,000CDN including one year of support.
I shut down the production Vision database server after hours. I moved the existing non-preallocated VM off the SAN LUN that it used for production. Then I reformatted the reiserfs partition on the SAN LUN to ext3. After the format I was surprised to find that the available space was smaller than it was with reiserfs. I had to resize the SAN LUN up a few gigabytes to allow me to convert the non-preallocated disk to a fully preallocated one back on the SAN LUN. After the resize, I recreated the ext3 partition and used vmware-vdiskmanager to convert the non-preallocated disk to a preallocated one. The VM booted and ran fine after the disk conversion.
On the converted production VM, all errors appear to have ceased and performance may have improved slightly as well. We have decided to proceed to deployment on VMware Server using this configuration.
Take Away Points
- The problem referenced in this article occurs on SQL Server 2000 and SQL Server 2005. We discovered this after the fact while working on something else.
- It is a good idea to run VMware Server on Linux, not on a Windows host.
- It is a good idea to use ext3 instead of reiserfs as the filesystem to store your virtual machines. Other Linux filesystems might be suitable as well, but were not tested.
- Filesystems formatted with ext3 use more space for overhead than reiserfs.
- VMware Server is similar in performance to VMware ESX server for Windows 2003 virtual machines running SQL Server under light to moderate loads.
- In the future I will try very hard to not have to move a 100 GB virtual machine all over my network. It takes a long time to repeatedly move 100 GB worth of files from system to system. (duh!)
- When working with troubleshooting on large virtual machines, it is great to have lots and lots of fast storage nearby on the network. Speculative changes are much less hair-raising if you have lots of room to backup your virtual machines.
- You can do awesome stuff with virtual machines that you just can’t even consider unless you have lab hardware coming out your ears and an army of lab monkeys to help you.