In the ongoing battle between email and disk space in our company, I lost another round. Our users managed to fill up all the disk space on one node of our mail server cluster this past weekend, and I’ve spent all day trying to get things back to normal. All the wonderful telephone calls from irate users are really motivating. Probably about as motivating to me as all the requests we make to them to please keep their mailboxes cleaned up. It’s amazing how much down time a full disk can cause. Mail servers consist of various components that handle messages in different layers. When one component’s disk space fills up the messages start queuing at a higher level in the system, which causes other bits of disk to start rapidly filling up, and server components to start thrashing. That slows everything down, even on the servers that aren’t full.
It’s fun and games until everything sorts itself out. The most challenging part is trying to explain to users why they can’t have infinite storage for email, because they think disks are so cheap. They don’t see that high-performance fibre-channel disks are an order of magnitude more money than disks for their cheapo home PC, and we also have to pay for equipment to back everything up.
I should have everything back to normal late tonight, with a bit of additional space available to stave off the next crash until we have a working automatic archiving system.
Our executives camp out in the offices of whoever is away from the office when they travel. Sometimes they camp out on your computer. I’ve got the perfect deterrent for keeping non-technical staff off your computer when you are away from your desk: Check out the “Das Keyboard II“.
It’s totally blank and totally cool, and totally impossible for your typical executive to work.
We are in the final stretch before deploying our Deltek Vision setup. One of the outstanding items left is getting Novell Identity Manager setup to replicate eDirectory users into Vision as Vision login accounts. That way our users will be able to use their everyday login IDs in Vision.
Vision is built on Microsoft SQL Server 2005, and it stores it’s user accounts as rows in a SQL table. The passwords are stored as SHA1 hashes. We have a fair amount of experience with Identity Manager, because we use it to maintain a centralized enterprise-wide eDirectory with all our user credentials in it and we also use it to automatically provision our GroupWise accounts. We’ve never used it before though to synchronize user credentials with a database using the IDM JDBC driver, so this is a bit of a learning experience.
It took me a few days to get a grasp of how the Identity Manager driver for JDBC works, learn how to configure it and install JDBC drivers, learn how to work with SQL server (I have very little recent experience with SQL Server), and figure out how to get eDirectory to spit up clear passwords. We manage passwords in GroupWise, upon initial account creation, but we don’t synchronize passwords except between eDirectory instances, which is easy, so figuring out password synchronization was a bit of work.
Last night I had the “A-Ha!” moment and figured out how to get eDirectory to cough up passwords in the clear upon a password change, and now I have everything synchronizing over to our Vision SQL Server. The only thing left to do is to transform the output so that the clear text password is replaced with a SHA1 hash of itself, in lower case, before the data is stuffed into SQL Server 2005. Then it’s a matter of me working with Bart so he can write some triggers and stored procedures in SQL to take my data from eDirectory which I’m synchronizing to a transfer table, and inserting it into the proper Vision tables.
I read today about an easy-to-use exploit on Solaris to allow login via telnet without even a password. I tried it out on a couple of our Solaris boxes and it worked. The exploit works as follows: Add -l “-rusername” to the telnet commandline, where username is the name of the user you want to become as you telnet into the Solaris box. For example, this will login as “bob” on a server called “solaris.box.com”:
telnet -l "-rbob" solaris.box.com
The exploit doesn’t work for root (-l “rroot” doesn’t work) but it worked on my Solaris boxes for any other users that I tried.
Here’s how you turn off Telnet on Solaris 10: Become root and then run this command.
svcadm disable telnet
I have two questions: In 2007, why the heck is telnet service enabled by default on Solaris 10? and Why didn’t I turn that off when I set up those boxes. The answer to the second question is that I stupidly expected it to be disabled by default.
We are deploying Deltek Vision 4.1 as our new financial management system in March. We started work a while back on this project. I built the infrastructure in VMware Server running on top of SuSE Linux Enterprise Server 9. We are using a three-box Vision implementation, with a separate VM running Windows 2003 Server Standard Edition on dedicated VMware Server hosts for each of Vision Web, Vision Reporting and Vision SQL Server. The virtualization is to allow for disaster recovery and portability of hardware. The database analyst and programmer guys got started quite a while ago getting the reports that our project managers rely on in our old system working in Vision. We’ve also been testing and troubleshooting Vision and training the accounting staff during this time.
A problem started manifesting itself with Vision and SQL Server after some of the data was imported into SQL Server and we started doing queries on it. The problem would occur particularly often whenever nested select statements were used in a query. SQL Server would fail to execute the query and error out with one of four different errors: error 5243, 5242, 823, or 682. All of these errors have multiple meanings, but a common thread is I/O problems to do with physical disks or storage drivers, when SQL Server does lots of writes in TEMPDB. In our case because we are using SQL Server on a virtual machine, it implied some kind of problem with virtual disks or with the VMware virtual storage controller driver, or possibly with the underlying filesystem on the VMware host.
Several VMware Server knowledgebase and discussion posts mentioned similar problems regarding SQL Server 2000 and SQL Server 2005 on VMware Server.
To confirm that the problem was a VMware problem, which was just a suspicion initially, I built a physical Windows 2003 Server that was otherwise identically configured to the virtual one. On the physical server the queries never failed in our tests.
That made us fairly confident that we had a problem with VMware Server somewhere. When I initially created the virtual machine, I built a reiserfs partition on our IBM DS4300 SAN to store it. When I built the VM, I created a 100 GB virtual disk that was configured in 2 GB chunks, and I did not preallocate all the space at build time. I thought that perhaps the I/O problems were occurring when the VM writes the TEMPDB and new storage was allocated as the virtual disk expanded. I decided to convert the virtual disk to a fully preallocated disk using vmware-vdiskmanager, which is a command-line tool that comes with VMware Server. I did that conversion and then tested the new VM on non-production hardware that was very similar to the production blade server, except that it had locally attached storage instead of a SAN LUN. The problem almost went away. It went from erroring out more than half the time to erroring out about once in 15 or 20 runs. That indicated that I was on the right track.
We had a momentary lapse of reason and thought that the VM might run better on a Windows VMware Server host. I moved the 100 GB preallocated disk version of the VM to a Windows XP Pro workstation running VMware Server. The error occurred nearly every time, so we abandoned that ill-conceived path.
Next, I thought that either reiserfs couldn’t cut it, or VMware Server couldn’t cut it. Since I had just received a new workstation from the vendor, I configured it with OpenSUSE 10.2 and formatted the disk in ext3. I also built an ESX Server 3 evaluation server in Engineering, on an IBM x334 pizza box connected to a fibre-channel SAN. I copied the 100 GB VM to both my new workstation and the ESX Server.
On my workstation, the moderately demanding test query ran 50 times in a row without failing until I gave up on it, proving out that ext3 works better as an underlying filesysetm for VMware Server, at least when the VM you are hosting is Windows 2003 Server with SQL Server 2005.
On the ESX server, the query also worked every time, which was fully expected.
Finally, I decided that even if the ext3 and preallocated disk fix didn’t fix the problem 100% of the time, it was worth applying it to the production system, so that the problems would be reduced during training. It would also buy us time to decide whether or not to buy ESX Server for about $10,000CDN including one year of support.
I shut down the production Vision database server after hours. I moved the existing non-preallocated VM off the SAN LUN that it used for production. Then I reformatted the reiserfs partition on the SAN LUN to ext3. After the format I was surprised to find that the available space was smaller than it was with reiserfs. I had to resize the SAN LUN up a few gigabytes to allow me to convert the non-preallocated disk to a fully preallocated one back on the SAN LUN. After the resize, I recreated the ext3 partition and used vmware-vdiskmanager to convert the non-preallocated disk to a preallocated one. The VM booted and ran fine after the disk conversion.
On the converted production VM, all errors appear to have ceased and performance may have improved slightly as well. We have decided to proceed to deployment on VMware Server using this configuration.
Take Away Points
- The problem referenced in this article occurs on SQL Server 2000 and SQL Server 2005. We discovered this after the fact while working on something else.
- It is a good idea to run VMware Server on Linux, not on a Windows host.
- It is a good idea to use ext3 instead of reiserfs as the filesystem to store your virtual machines. Other Linux filesystems might be suitable as well, but were not tested.
- Filesystems formatted with ext3 use more space for overhead than reiserfs.
- VMware Server is similar in performance to VMware ESX server for Windows 2003 virtual machines running SQL Server under light to moderate loads.
- In the future I will try very hard to not have to move a 100 GB virtual machine all over my network. It takes a long time to repeatedly move 100 GB worth of files from system to system. (duh!)
- When working with troubleshooting on large virtual machines, it is great to have lots and lots of fast storage nearby on the network. Speculative changes are much less hair-raising if you have lots of room to backup your virtual machines.
- You can do awesome stuff with virtual machines that you just can’t even consider unless you have lab hardware coming out your ears and an army of lab monkeys to help you.
My new Sun Ultra 20 M2 workstation arrived yesterday. It came preinstalled with Solaris, which as awesome as it is, I am not able to use. I need to be able to run VMware Server for my work, which means Linux, not Solaris (at least right now). I booted Solaris and poked around a bit just to satisfy curiosity, and then popped in my OpenSUSE 10.2 DVD and started the install.
I’m not running yet, so you can guess that there were some challenges. The workstation came with an NVidia Quadro FX 560 dual head video card. James has the same box, and he managed to install SLED 10 without any challenges, so I figured OpenSUSE 10.2 would be no problem. Unfortunately, that wasn’t the case. I started with the 32 bit version. The install went fine, but then at first boot, Xwindows started and showed a blank screen. I thought I maybe needed to use the other port on the Quadro FX 560. That didn’t work.
Then I looked at the supported OS list for the Ultra 20, and it listed SLES 9 SP3 and very explicitly stated 64-bit version only. I decided to try the 64-bit version even though I had had some stability issues with it previously. I repeated the install, with the same result. Then just because James had gotten SLED 10 working I tried the 64-bit version of that. It worked fine, and the video was nice and smooth in Xwindows. I scratched my head for a while, and then came to the conclusion that the proprietary NVidia X driver was at fault. That driver is installed automatically in SLED, but in OpenSUSE it isn’t because it’s proprietary.
I’m retrying the OpenSUSE 10.2 64-bit install right now and at the end I’m going to manually install the proprietary NVidia driver and see if that works.
I started actually trying to talk to SQL server with Novell Identity Manager today. We want to be able to push login IDs into Deltek Vision’s database from eDirectory, with password synchronization. Step 1 was to get IDM3 talking to a test schema in the database. After some futzing, I got the jtds jdbc driver installed and connected to the SQL Server 2005 database. One thing that the IDM driver documentation for jdbc fails to mention is that you have to reboot / restart Identity Manager after you stick the jdbc driver jar file into the classes directory in order for it to work.
We use Solaris with ZFS as a target for disk-to-disk backup. We mirror two storage servers in two locations the hard way at present, using rsync and many scripts to do replication. With this new OpenSolaris project, we can now set up two Solaris machines and use the Storage Tek Availability Suite pieces to maintain real-time mirrors of the filesystems, in a very flexible way. That’s wicked cool. I read about this on Ben Rockwood’s blog.
We’re doing our Deltek Vision deployment right around the time of Brainshare this year. Since I built all the infrastructure for Vision, I get to hang around while the thing gets rolled out. We’re sending James and Stuart instead. Bastards.