SQL Server on VMware Server
2007-02-09
We are deploying Deltek Vision 4.1 as our new financial management system in March. We started work a while back on this project. I built the infrastructure in VMware Server running on top of SuSE Linux Enterprise Server 9. We are using a three-box Vision implementation, with a separate VM running Windows 2003 Server Standard Edition on dedicated VMware Server hosts for each of Vision Web, Vision Reporting and Vision SQL Server. The virtualization is to allow for disaster recovery and portability of hardware. The database analyst and programmer guys got started quite a while ago getting the reports that our project managers rely on in our old system working in Vision. We’ve also been testing and troubleshooting Vision and training the accounting staff during this time.
A problem started manifesting itself with Vision and SQL Server after some of the data was imported into SQL Server and we started doing queries on it. The problem would occur particularly often whenever nested select statements were used in a query. SQL Server would fail to execute the query and error out with one of four different errors: error 5243, 5242, 823, or 682. All of these errors have multiple meanings, but a common thread is I/O problems to do with physical disks or storage drivers, when SQL Server does lots of writes in TEMPDB. In our case because we are using SQL Server on a virtual machine, it implied some kind of problem with virtual disks or with the VMware virtual storage controller driver, or possibly with the underlying filesystem on the VMware host.
Several VMware Server knowledgebase and discussion posts mentioned similar problems regarding SQL Server 2000 and SQL Server 2005 on VMware Server.
To confirm that the problem was a VMware problem, which was just a suspicion initially, I built a physical Windows 2003 Server that was otherwise identically configured to the virtual one. On the physical server the queries never failed in our tests.
That made us fairly confident that we had a problem with VMware Server somewhere. When I initially created the virtual machine, I built a reiserfs partition on our IBM DS4300 SAN to store it. When I built the VM, I created a 100 GB virtual disk that was configured in 2 GB chunks, and I did not preallocate all the space at build time. I thought that perhaps the I/O problems were occurring when the VM writes the TEMPDB and new storage was allocated as the virtual disk expanded. I decided to convert the virtual disk to a fully preallocated disk using vmware-vdiskmanager, which is a command-line tool that comes with VMware Server. I did that conversion and then tested the new VM on non-production hardware that was very similar to the production blade server, except that it had locally attached storage instead of a SAN LUN. The problem almost went away. It went from erroring out more than half the time to erroring out about once in 15 or 20 runs. That indicated that I was on the right track.
We had a momentary lapse of reason and thought that the VM might run better on a Windows VMware Server host. I moved the 100 GB preallocated disk version of the VM to a Windows XP Pro workstation running VMware Server. The error occurred nearly every time, so we abandoned that ill-conceived path.
Next, I thought that either reiserfs couldn’t cut it, or VMware Server couldn’t cut it. Since I had just received a new workstation from the vendor, I configured it with OpenSUSE 10.2 and formatted the disk in ext3. I also built an ESX Server 3 evaluation server in Engineering, on an IBM x334 pizza box connected to a fibre-channel SAN. I copied the 100 GB VM to both my new workstation and the ESX Server.
On my workstation, the moderately demanding test query ran 50 times in a row without failing until I gave up on it, proving out that ext3 works better as an underlying filesysetm for VMware Server, at least when the VM you are hosting is Windows 2003 Server with SQL Server 2005.
On the ESX server, the query also worked every time, which was fully expected.
Finally, I decided that even if the ext3 and preallocated disk fix didn’t fix the problem 100% of the time, it was worth applying it to the production system, so that the problems would be reduced during training. It would also buy us time to decide whether or not to buy ESX Server for about $10,000CDN including one year of support.
I shut down the production Vision database server after hours. I moved the existing non-preallocated VM off the SAN LUN that it used for production. Then I reformatted the reiserfs partition on the SAN LUN to ext3. After the format I was surprised to find that the available space was smaller than it was with reiserfs. I had to resize the SAN LUN up a few gigabytes to allow me to convert the non-preallocated disk to a fully preallocated one back on the SAN LUN. After the resize, I recreated the ext3 partition and used vmware-vdiskmanager to convert the non-preallocated disk to a preallocated one. The VM booted and ran fine after the disk conversion.
On the converted production VM, all errors appear to have ceased and performance may have improved slightly as well. We have decided to proceed to deployment on VMware Server using this configuration.
Take Away Points
- The problem referenced in this article occurs on SQL Server 2000 and SQL Server 2005. We discovered this after the fact while working on something else.
- It is a good idea to run VMware Server on Linux, not on a Windows host.
- It is a good idea to use ext3 instead of reiserfs as the filesystem to store your virtual machines. Other Linux filesystems might be suitable as well, but were not tested.
- Filesystems formatted with ext3 use more space for overhead than reiserfs.
- VMware Server is similar in performance to VMware ESX server for Windows 2003 virtual machines running SQL Server under light to moderate loads.
- In the future I will try very hard to not have to move a 100 GB virtual machine all over my network. It takes a long time to repeatedly move 100 GB worth of files from system to system. (duh!)
- When working with troubleshooting on large virtual machines, it is great to have lots and lots of fast storage nearby on the network. Speculative changes are much less hair-raising if you have lots of room to backup your virtual machines.
- You can do awesome stuff with virtual machines that you just can’t even consider unless you have lab hardware coming out your ears and an army of lab monkeys to help you.
Entry Filed under: Feature Article, Linux, Virtualization. .
12 Comments Add your own
Leave a Comment
Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
Trackback this post | Subscribe to the comments via RSS Feed
1.
Erick | 2007-02-23 at 18:20
Thanks for the article. I’ve been searching the web for hours trying to find some sort of tips as to what I should do with my virtual machine setup. I am seeing the same errors as you are seeing, but with a fairly busy website. I have been using XFS as a file system to store virtual machines and thought that had something to do with corruption. I was going to try ext3 and being you said it works for you, I may just give it a try now.
The hardware I am using is very fast hardware, so I didn’t see how that was an issue. I had looked over the thread you posted on the VMWare forums, and I pretty much have the same issue as all those people. I will report back after moving it over to ext3 and let you how it goes. I’m getting inconsistency errors about every three hours.
2.
Mark | 2007-03-25 at 23:13
Have you considered a non-journaling file system like (basic I know) ext2 or atleast turn off some of the intensive ext3 items (in fstab) ie:
/dev/sda6 /var ext3 data=writeback,noatime,nodiratime
ESX of course uses vmfs (vmware specific) hence why you might be getting 100% results.
3.
Scott | 2007-03-26 at 08:24
Mark: Thanks for your suggestion. After reading some tuning documentation on ext3, I’m going to try noatime,nodiratime first, but I’m not sure about data=writeback. SQL server requires scrupulous ordered writes, and from my limited understanding of data=writeback, some writes might be made out of order. Of course, I could be mistaken, so I’ll probably try out some variations in Engineering
4.
Marco | 2007-07-19 at 02:37
I think your “overhead” on the ext3 filesystems just represents the reserved root space. This is preallocated space that can only be used by root and processes that run under this account. It is there to prevent from crashing a system by user error (e.g. filling the filesystem to 100% with no use).
For partitions that are just used for data, or where no user-access is allowed you can switch that off: tune2fs -m 0 .
The default reserved space is 5%.
5.
Craig Chumney | 2007-10-01 at 09:03
Thanks for taking the time to post. We stumbled on to your blog when researching a SQL Server VMWare issue. I would hate to think how many hours you saved us. Please send your email address.
Thanks
Craig
6.
Scott Flowers | 2007-10-01 at 13:53
This post has consistently been my most popular post. It has gotten far more hits than any other thing I’ve written in this blog. I’ve also received several emails from people who have expressed thanks for the information, saying that I’d saved them a lot of hassle and head-scratching. I’m thrilled to get that kind of feedback, and that’s one of the reasons I write this blog.
7.
Kelly | 2007-11-22 at 10:20
I guess one question that I haven’t seen answered is whether its safe to ignore the error. I’ve got a small shop and its an error I very rarely see on our develoment server (vmware server on CentOS/ext3). But I do see it from time to time. Are there consequences to ignoring the error?
-k
8.
Scott | 2007-11-22 at 10:27
I guess the safety of ignoring the error depends on the type of query you’re trying to run. If you are doing a SELECT, then the consequence of ignoring the error is that you don’t get your result. If you are doing an INSERT or UPDATE, then the changes to the database will be lost.
9.
Todd | 2008-03-31 at 13:43
I have written a few whitepapers about running SQL Server on VMware and I found your blog entry today while doing some google searches. I am hosting a webchat tomorrow at 3 CST to discuss SQL Server on VMware and would like to invite you (and anybody else who is interested) to come and participate in the discussion. Here is the link for the chat - http://www.delltechcenter.com/page/TechCenter+Chat
Thanks - Todd
10.
SQL Server on VMware Serv&hellip | 2008-05-05 at 14:06
[...] Interesting… May 5, 2008 | Filed Under Words [...]
11.
Richard | 2008-06-10 at 05:41
You will have data corruption running SQL 2005 in VMWare Workstation hosted VM with Windows Host OS if you use virtual SCSI disks (the recommended and default option when creating VMs).
It would appear that virtual IDE disks are OK.
Its easy to demo the problem by running sqliosim.exe testing tool from Microsoft inside any Windows VM using SCSI virtual disks. No need to install SQL to prove there is a problem.
VMWare have finally acknowledged there is a BUG (Bug# 288470) in VMWare Workstation. Look out for a fix. Till then, I would suggest you not try run SQL2005 in VM unless you are doing predestrian speed queries.
12.
Scott | 2008-06-10 at 11:36
I never really bothered much with VMware Workstation hosted on Windows, other than to confirm it is excessively slow compared to hosted on Linux. Everything I do that’s using a hosted virtualization platform uses Linux as the host.