My Home Lab vSAN Capacity Disk Upgrade

Well, I have upgraded my capacity in my vsanDatastore for my home lab.  Before the upgrade, my home lab (powered by three HP DL360 G7 hosts running ESXi 6.0 Update 3) contained one disk group per host, each consisting of a 120GB Kingston HyperX 3K SSD (link) as well as three HP-branded 300GB 10K SAS drives.  The idea was to upgrade my 300GB 10K SAS disks with newly purchased 600GB 10K SAS disks for capacity (thereby doubling my capacity).

When following the VMware HCL, ESXi 6.0 is the last version that is allowed on the G7 series, so I wanted to remain there for the bulk of my testing.  This allows me to use the HP iso instead of crossing my fingers and using the VMware iso.

The capacity disk upgrade price was within the proper WAF range (Wife Acceptance Factor) due to the deal I got online.  I managed to slurp up 10 of the 600GB SAS drives for a nice price of $30 each (I decided to include a spare).  I wanted to make sure I got the most money possible when I sold my 300GB drives, so I also bought caddies that were compatible with the G7 for ~$5 each.  That way, I could sell my 300GB drives with the HP caddies.  I ended up selling the 300GB drives for $30 each, as well!  Not bad for an upgrade!

Now you may notice that my cache-to-capacity ratio before the upgrade was about 13.5% (well within the VMware guidelines for cache-to-capacity ratio calculations).  Due to budget constraints, and the amount of load I place on my lab, I decided to stick with my cache disks, understanding that I am deviating from VMware recommendations a bit there.  Here is a great link talking about recommended cache-to-capacity disk ratio design.

Designing vSAN Disk groups – All Flash Cache Ratio Update

Sometimes when you are working in your lab, there are times you need to deviate, because the aforementioned WAF metric may fall below the recommended range.  There are other times that my lab does not follow some common VMware recommendations, such as:

  • My RAID controller (a P410i controller) is not on the vSAN HCL
  • My cache disks are not on the vSAN HCL
  • My capacity disks are not on the vSAN HCL
  • My vSAN cluster is only 3 nodes
  • My disks are using RAID0 instead of HBA-mode (my controllers do not allow HBA-mode)
  • My vSAN VMkernel adapters are bound to 1Gbps NICs instead of 10Gbps

But, this is not production (other than for my family), so I decided to proceed anyway.

I had a conversation online with Jase McCarty (@jasemccarty on Twitter) in Technical Marketing for the Storage and Availability Business Unit (SABU) at VMware and he asked me to give him a call in-between working on his Jeep.  It’s not often that folks at that level in any organization are that willing to help, and especially in a lab environment!

I had a few questions on how he would recommend I proceed due to being in a 3-node cluster.  You see, VMware doesn’t recommend running in a 3-node cluster because it limits some of your flexibility in maintenance operations.  I wanted to place each host in maintenance mode, choose full migration to create a second copy of the data on the surviving hosts, blow away the disk group on the host I had in maintenance mode, replace the 300GB drives with 300GB drives, create a new disk group, and repeat.  It didn’t work that way.  Here’s why:

I asked Jase what he would recommend.  He said the best way to go about it would be to identify each of your drives, remove one drive at a time from the disk group on a host, eject the drive from your host, replace it with a 600GB drive, and add it back to the disk group.  That would work well… if I was running in HBA-mode.  For those of you who don’t know, running vSAN in HBA-mode allows you to do hot-plug.  When vSAN is configured, you do not create a RAID array on your controller to aggregate the capacity disks in a vSAN disk group.  vSphere wants to work with these disks as separate entities.  If you do not run in HBA-mode on  your controller, there is another option.  You can configure each disk as its own RAID-0 array.  What happens when you pull a disk in a RAID-0 array?  The array fails.  That’s what happens when you hot-eject a disk.  This was going to be a lot of work for me due to my RAID-0 disk configuration.

I was pretty sure I knew what I needed to do.  It meant I would need to put a host “ensure accessibility” maintenance mode, pull a disk out of the vSAN disk group, restart the host, enter the array configuration utility, delete the failed RAID-0 array, create a new array with the new 600GB drive, power it on, and add the new disk to the vSAN disk group… three times per host, for three hosts.

That was a bit too much for the WAF metric.  I decided to delete all my virtual desktops, and svMotion my VM’s to a FREENAS iSCSI system that I have running in my lab.  Then, I deleted all my vSAN disk groups, rebooted one host at a time, inserted all my new capacity drives, created three new RAID-0 arrays per host, and created my disk groups again.

Here is my disk group setup after the upgrade:

Moral of the story, when you are designing a vSAN deployment, you do not always need to follow VMware’s recommendations to the tee.  Just know that if you don’t, it might make things much more difficult than if you did.



  1. Peter Hendrickx

    Hey Matt,
    I’m also on the same road with a VSAN-cluster on HP DL380G7’s with P410i Storage-controller.
    I have an issue that the VSAN datastore is really slow and doesn’t allow any operation on it.
    Did you use certain firmware versions of the controller? (I have them all upgraded with the latest PSP of HP)
    I used the ESXi6.0 build 3620759 BUILD customized for HP.

  2. Matt Heldstab

    Hello, Peter. Sorry for taking a while to get back to you. Has your vSAN datastore ever performed up to par, or is this how it has always been?

    There is a newer feature called vSAN performance diagnostics. Within the vSphere Web Client, under the vSAN Cluster object, go to Monitor and then vSAN. There, you can go to performance diagnostics that will show you if a configuration issue is preventing you from getting the maximum throughput, IOPS or latency for your cluster.

    You need to enable the Customer Experience Improvement Program to enable this feature.

    Otherwise, you can always use the HCIbench Fling to benchmark your cluster (assuming you can get VM’s installed at all).



Leave a Comment

Your email address will not be published. Required fields are marked *