Home > Feature Article, Linux, Virtualization > SQL Server on VMware Server

SQL Server on VMware Server

We are deploying Deltek Vision 4.1 as our new financial management system in March. We started work a while back on this project. I built the infrastructure in VMware Server running on top of SuSE Linux Enterprise Server 9. We are using a three-box Vision implementation, with a separate VM running Windows 2003 Server Standard Edition on dedicated VMware Server hosts for each of Vision Web, Vision Reporting and Vision SQL Server. The virtualization is to allow for disaster recovery and portability of hardware. The database analyst and programmer guys got started quite a while ago getting the reports that our project managers rely on in our old system working in Vision. We’ve also been testing and troubleshooting Vision and training the accounting staff during this time.

A problem started manifesting itself with Vision and SQL Server after some of the data was imported into SQL Server and we started doing queries on it. The problem would occur particularly often whenever nested select statements were used in a query. SQL Server would fail to execute the query and error out with one of four different errors: error 5243, 5242, 823, or 682. All of these errors have multiple meanings, but a common thread is I/O problems to do with physical disks or storage drivers, when SQL Server does lots of writes in TEMPDB. In our case because we are using SQL Server on a virtual machine, it implied some kind of problem with virtual disks or with the VMware virtual storage controller driver, or possibly with the underlying filesystem on the VMware host.

Several VMware Server knowledgebase and discussion posts mentioned similar problems regarding SQL Server 2000 and SQL Server 2005 on VMware Server.

To confirm that the problem was a VMware problem, which was just a suspicion initially, I built a physical Windows 2003 Server that was otherwise identically configured to the virtual one. On the physical server the queries never failed in our tests.

That made us fairly confident that we had a problem with VMware Server somewhere. When I initially created the virtual machine, I built a reiserfs partition on our IBM DS4300 SAN to store it. When I built the VM, I created a 100 GB virtual disk that was configured in 2 GB chunks, and I did not preallocate all the space at build time. I thought that perhaps the I/O problems were occurring when the VM writes the TEMPDB and new storage was allocated as the virtual disk expanded. I decided to convert the virtual disk to a fully preallocated disk using vmware-vdiskmanager, which is a command-line tool that comes with VMware Server. I did that conversion and then tested the new VM on non-production hardware that was very similar to the production blade server, except that it had locally attached storage instead of a SAN LUN. The problem almost went away. It went from erroring out more than half the time to erroring out about once in 15 or 20 runs. That indicated that I was on the right track.

We had a momentary lapse of reason and thought that the VM might run better on a Windows VMware Server host. I moved the 100 GB preallocated disk version of the VM to a Windows XP Pro workstation running VMware Server. The error occurred nearly every time, so we abandoned that ill-conceived path.

Next, I thought that either reiserfs couldn’t cut it, or VMware Server couldn’t cut it. Since I had just received a new workstation from the vendor, I configured it with OpenSUSE 10.2 and formatted the disk in ext3. I also built an ESX Server 3 evaluation server in Engineering, on an IBM x334 pizza box connected to a fibre-channel SAN. I copied the 100 GB VM to both my new workstation and the ESX Server.

On my workstation, the moderately demanding test query ran 50 times in a row without failing until I gave up on it, proving out that ext3 works better as an underlying filesysetm for VMware Server, at least when the VM you are hosting is Windows 2003 Server with SQL Server 2005.

On the ESX server, the query also worked every time, which was fully expected.

Finally, I decided that even if the ext3 and preallocated disk fix didn’t fix the problem 100% of the time, it was worth applying it to the production system, so that the problems would be reduced during training. It would also buy us time to decide whether or not to buy ESX Server for about $10,000CDN including one year of support.

I shut down the production Vision database server after hours. I moved the existing non-preallocated VM off the SAN LUN that it used for production. Then I reformatted the reiserfs partition on the SAN LUN to ext3. After the format I was surprised to find that the available space was smaller than it was with reiserfs. I had to resize the SAN LUN up a few gigabytes to allow me to convert the non-preallocated disk to a fully preallocated one back on the SAN LUN. After the resize, I recreated the ext3 partition and used vmware-vdiskmanager to convert the non-preallocated disk to a preallocated one. The VM booted and ran fine after the disk conversion.

On the converted production VM, all errors appear to have ceased and performance may have improved slightly as well. We have decided to proceed to deployment on VMware Server using this configuration.

Take Away Points

  • The problem referenced in this article occurs on SQL Server 2000 and SQL Server 2005. We discovered this after the fact while working on something else.
  • It is a good idea to run VMware Server on Linux, not on a Windows host.
  • It is a good idea to use ext3 instead of reiserfs as the filesystem to store your virtual machines. Other Linux filesystems might be suitable as well, but were not tested.
  • Filesystems formatted with ext3 use more space for overhead than reiserfs.
  • VMware Server is similar in performance to VMware ESX server for Windows 2003 virtual machines running SQL Server under light to moderate loads.
  • In the future I will try very hard to not have to move a 100 GB virtual machine all over my network. It takes a long time to repeatedly move 100 GB worth of files from system to system. (duh!)
  • When working with troubleshooting on large virtual machines, it is great to have lots and lots of fast storage nearby on the network. Speculative changes are much less hair-raising if you have lots of room to backup your virtual machines.
  • You can do awesome stuff with virtual machines that you just can’t even consider unless you have lab hardware coming out your ears and an army of lab monkeys to help you.
About these ads
  1. Erick
    2007-02-23 at 18:20

    Thanks for the article. I’ve been searching the web for hours trying to find some sort of tips as to what I should do with my virtual machine setup. I am seeing the same errors as you are seeing, but with a fairly busy website. I have been using XFS as a file system to store virtual machines and thought that had something to do with corruption. I was going to try ext3 and being you said it works for you, I may just give it a try now.

    The hardware I am using is very fast hardware, so I didn’t see how that was an issue. I had looked over the thread you posted on the VMWare forums, and I pretty much have the same issue as all those people. I will report back after moving it over to ext3 and let you how it goes. I’m getting inconsistency errors about every three hours.

  2. Mark
    2007-03-25 at 23:13

    Have you considered a non-journaling file system like (basic I know) ext2 or atleast turn off some of the intensive ext3 items (in fstab) ie:
    /dev/sda6 /var ext3 data=writeback,noatime,nodiratime

    ESX of course uses vmfs (vmware specific) hence why you might be getting 100% results.

  3. 2007-03-26 at 08:24

    Mark: Thanks for your suggestion. After reading some tuning documentation on ext3, I’m going to try noatime,nodiratime first, but I’m not sure about data=writeback. SQL server requires scrupulous ordered writes, and from my limited understanding of data=writeback, some writes might be made out of order. Of course, I could be mistaken, so I’ll probably try out some variations in Engineering

  4. 2007-07-19 at 02:37

    I think your “overhead” on the ext3 filesystems just represents the reserved root space. This is preallocated space that can only be used by root and processes that run under this account. It is there to prevent from crashing a system by user error (e.g. filling the filesystem to 100% with no use).
    For partitions that are just used for data, or where no user-access is allowed you can switch that off: tune2fs -m 0 .
    The default reserved space is 5%.

  5. Craig Chumney
    2007-10-01 at 09:03

    Thanks for taking the time to post. We stumbled on to your blog when researching a SQL Server VMWare issue. I would hate to think how many hours you saved us. Please send your email address.
    Thanks
    Craig

  6. 2007-10-01 at 13:53

    This post has consistently been my most popular post. It has gotten far more hits than any other thing I’ve written in this blog. I’ve also received several emails from people who have expressed thanks for the information, saying that I’d saved them a lot of hassle and head-scratching. I’m thrilled to get that kind of feedback, and that’s one of the reasons I write this blog.

  7. Kelly
    2007-11-22 at 10:20

    I guess one question that I haven’t seen answered is whether its safe to ignore the error. I’ve got a small shop and its an error I very rarely see on our develoment server (vmware server on CentOS/ext3). But I do see it from time to time. Are there consequences to ignoring the error?

    -k

  8. 2007-11-22 at 10:27

    I guess the safety of ignoring the error depends on the type of query you’re trying to run. If you are doing a SELECT, then the consequence of ignoring the error is that you don’t get your result. If you are doing an INSERT or UPDATE, then the changes to the database will be lost.

  9. 2008-03-31 at 13:43

    I have written a few whitepapers about running SQL Server on VMware and I found your blog entry today while doing some google searches. I am hosting a webchat tomorrow at 3 CST to discuss SQL Server on VMware and would like to invite you (and anybody else who is interested) to come and participate in the discussion. Here is the link for the chat – http://www.delltechcenter.com/page/TechCenter+Chat

    Thanks – Todd

  10. Richard
    2008-06-10 at 05:41

    You will have data corruption running SQL 2005 in VMWare Workstation hosted VM with Windows Host OS if you use virtual SCSI disks (the recommended and default option when creating VMs).

    It would appear that virtual IDE disks are OK.

    Its easy to demo the problem by running sqliosim.exe testing tool from Microsoft inside any Windows VM using SCSI virtual disks. No need to install SQL to prove there is a problem.

    VMWare have finally acknowledged there is a BUG (Bug# 288470) in VMWare Workstation. Look out for a fix. Till then, I would suggest you not try run SQL2005 in VM unless you are doing predestrian speed queries.

  11. 2008-06-10 at 11:36

    I never really bothered much with VMware Workstation hosted on Windows, other than to confirm it is excessively slow compared to hosted on Linux. Everything I do that’s using a hosted virtualization platform uses Linux as the host.

  12. Joe
    2008-07-24 at 15:27

    Scott,

    Thanks for this post. Is SQL server still running fine on your VM since you wrote this blog message? Would I be crazy to put SQL server on a VM in a production environment?

  13. fornetti
    2008-08-31 at 05:19

    I do not believe this

  14. 2008-09-10 at 15:53

    Very interesting article.
    We are using something similar running a mysql+apache web server in a VM hosted on CentOS5.
    We started noticing journal errors and the LVM would remount read-only.
    We are using an ext3 fs but the VMware hard drive is NOT preallocated.
    We are currently using tune2fs to delete the journal and recreate it.

    I am now wondering if we should preallocate the VM drive.
    We are running VMware server on Centos5.
    We also have a large InnoDB mysql database running on the host.

    The problem showed worse after converting to a PAE Kernel to get the remaining 13Gb Ram (Only 3Gb Showed up with standard kernel)

  15. Terence
    2008-11-02 at 13:26

    Nice article. I’ve been using vwware for the past 4 years and each year I manage to tweak it a bit more. Many of my clients cant afford those expensive SCSI/fibre channel systems but ext3 can be tweaked…as can be mdadm (software raid)

    I use OpenSuse 11 with the kernel heavily optimized…I cut down as much as I can from it.

    1. add nodiratime,noatime,commit=60 up to 300 to the mount of the drive
    2. set read ahead of drive to 1MB
    blockdev –setra 1024 /dev/sda
    3. Set read ahead drive cache
    echo 256 > /sys/block/sda/queue/read_ahead_kb
    4. Use LVMs and per VM use its seperate LVM

    And to solve the large disks issue I am now using FreeNAS (as a VM of course) and connect it to the server using iScsi..such that the server and data are seperate

  16. 2009-01-11 at 02:20

    What an incredibly informative post; thanks Scott! I was concerned about migrating my SQL 2K8 server to an ESXi server and now I feel much better.

    Terence, please consider making a blog post with everything you’ve discovered! I’m running VMware Server 2 with SuSE 11.1 but haven’t tweaked it just yet. A post like that would be super helpful :)

  17. Blaine
    2009-03-28 at 20:02

    I am running SEL 10 and VMware Server 2.0 we had similar issues with reiserfs and switched to ext3.

    We are using an Areca (ARC-1231) Raid Controller with 1GB cache.

    I have found that performing snapshots while the system is running will produce useless snapshots because I recovered from one and the database was totally corrupted. If you shut the VM down and take the snapshot everything is fine.

    I have not tested the new features with the latest VMware Server 2.0 that use VSS to quiesce the database during snapshots. If you have let me know. I am planning on doing some testing tonight or tomorrow.

  18. payne
    2009-04-29 at 09:18

    I hope this is not a stupid question, but can anyone tell me if using Symantec net backup would be an acceptable and troublefree way to backup our SQl server instances as follows?
    We are in testing stage -only one user deployed: we have ESX upon which Win 2k3 server is running. Installed on the Win2k3 Server are multiple instances of SQL server express 2005 linked to multiple Recorder 6 (hybrid biological databases) which are also installed on the Win2k3 server. There is a single Citrix Zen app 5 installed on the win2003 server. The Citrix app serves up the various recorder 6 apps to where they are required to users on our Novell network. Our IM department has proposed to use Symantec/Veritas Net Backup to backup the entire Win2003 server instead of using the SQL management console to run individual backups.

  19. test
    2010-04-16 at 20:13

    test to see whether i need to register to comment.

  20. Mahmoud
    2012-02-08 at 05:56

    try to disable fire wall in vm machine

  1. 2008-05-05 at 14:06
  2. 2009-11-16 at 09:15

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: