Assessing virtual disk performance within a guest virtual machine hosted on Hyper-V can be a challenging exercise, as there are usually a number of items in the stack that could be causing issues.  Generally you will have your storage device, storage fabric, physical server hardware, and potential shared physical resources (such as a blade chassis).  Then there is the configuration of the presented storage within Hyper-V, and other software configuration, such as firmware.  Obviously, there could be less, or more, depending on your setup, but the point is that there is a lot to look at when troubleshooting disk performance within a virtual machine.

I started looking into how to investigate disk performance issues after implementing a very large Hyper-V cluster, which at first was fine, but once the number of virtual machines started to ramp up, we noticed a number of I/O errors in the event log on virtual machines, such as SQL servers.  But, it wasn’t isolated to SQL servers, we also saw I/O errors in the log on our SCOM management servers during busy periods, so it was obvious it wasn’t a SQL issue (even though we still teased the DBA’s about it).

This led me to start looking at the workloads on the servers that were generating the I/O errors, and there was nothing that was demanding particularly high reads or writes from the disk, and certainly nothing near what the physical storage device could provide.  So, somewhere in the stack was an issue that was causing potential performance issues, but first I needed to validate it was a performance issue, and not an OS issue, or something else.

In order to test the performance of the VHD(X) disks within the guest, I used the SQLIO tool from Microsoft.  This tool is great for determining the I/O capacity of a disk, and provides a detailed output of the collected data.  To get a complete picture of performance 8 test were run with different configurations, but they all used a 20GB DAT file to test against.

  1. 8K random writes on a dynamic disk
  2. 8K random reads on a dynamic disk
  3. 64K random writes on a dynamic disk
  4. 64K random reads on a dynamic disk
  5. 8K random writes on a fixed disk
  6. 8K random reads on a fixed disk
  7. 64K random writes on a fixed disk
  8. 64K random reads on a fixed disk

The output from each test is displayed below:

  • SQLIO_8K_Random_Dynamic_Write

SQLIO_8K_Random_Dynamic_Write

 

  • SQLIO_8K_Random_Dynamic_Read

SQLIO_8K_Random_Dynamic_Read

 

  • SQLIO_64K_Random_Dynamic_Write

SQLIO_64K_Random_Dynamic_Write

 

  • SQLIO_64K_Random_Dynamic_Read

SQLIO_64K_Random_Dynamic_Read

  • SQLIO_8K_Random_Static_Write

SQLIO_8K_Random_Static_Write

 

  • SQLIO_8K_Random_Static_Read

SQLIO_8K_Random_Static_Read

 

  • SQLIO_64K_Random_Static_Write

SQLIO_64K_Random_Static_Write

 

  • SQLIO_64K_Random_Static_Read

SQLIO_64K_Random_Static_Read

 

The first thing you may notice about the output above is that the IOs/sec and MBs/sec are significantly better on dynamic disks than on static disks!!  The difference isn’t minor either, it is huge… random 64k reads on a dynamic disk returned almost 37,000 IOs/sec and almost 2,300 MBs/sec, whereas the same test on a static disk returned 1,865 IOs/sec and 116 MBs/sec.  Also, looking at the histogram data on all the outputs, once again it looked far more consistent on the dynamic disks over the fixed disks.  This all seemed wrong though, as it’s always stated the fixed disks provide the best performance, and Microsoft only recommend fixed disks for critical workloads, plus some of the numbers returned by the test were beyond the capability of the underlying contended storage, so why was I getting these test results??

In order to see if these results meant anything with a real workload, I built a test server that mirrored a production server with a SQL workload, however used nothing but dynamic disks for the SQL data.  When I started the workload everything functioned as normal and then the same I/O errors starting appearing in the event log that we had received on the production system.  This clearly showed that dynamic disks were definitely no quicker than static disks, however it still didn’t explain the results of the SQLIO test, and the performance issues were getting even more confusing.

This led me to start trying to understand the architecture of the different VHD(X) types, to see if this could help me make sense of the test results compared to real world use.  After searching Microsoft resources online, I really couldn’t find much in the way of detail around virtual hard disks, so I contacted Microsoft to see if they could provide any answers.  After many weeks working with Microsoft we still had no answers, and even they were stumped by what was being seen in terms of performance, until we eventually received some answers from someone who knew how the disks work in far more detail.

The reason for the odd results from the SQLIO tool was because of the type of file being used, and the way dynamic disk optimises performance.  Because I was using a blank 20GB DAT file to run the tests, the dynamic disk was able to optimise the writes by only updating the metadata to mark a block in the zero state where a write’s payload is equal to zero, rather than actually allocating the block.  As the DAT file didn’t contain ‘real’ data, all bytes in the majority of write payloads were equal to zero, therefore no actual block allocation was occurring, which was providing the huge IOs/sec and MBs/sec numbers against dynamic disks.  As dynamic disks don’t allocate all blocks for a disk upfront, it saves this workload where it can, and so many reported IOs will never, in effect, touch the physical disk.

On the flip side of this, no such optimisation can be made when using static disks, as the blocks are allocated upfront, therefore the metadata cannot just simply be updated.  To prove this out, I performed a simple copy/paste test against to a new non-expanded dynamic disk, and a new fixed disk.  The first test involved copying the 20GB DAT file used previously, and as expected, it wrote to the dynamic disk considerably quicker than the fixed, as it wasn’t actually writing.  For second test I used a 5GB video file, as this would contain no payloads equalling zero, and this proved the dynamic disk behaviour explained above, as it was now quicker writing to the static disk over the dynamic disk!

This was great to finally understand the test results from my performance investigations, and although didn’t give me an answer as to what was causing the disk performance issues on my virtual machines, it did show that it wasn’t an issue with Hyper-V, and that there was a problem elsewhere.  This forced investigations into other areas of the physical stack where it was discovered that the firmware for the network adapters on the blades hosting Hyper-V were out-of-date, and once updated, all disk performance issues in the Hyper-V guests were resolved.

Although the fix in the end was simply a firmware update, I’m very pleased that I went through the investigative steps as not only was I able to prove to other areas of IT that it was not Hyper-V causing the performance issues, I also got a  much better understanding of VHD(X) disks…. Which, hopefully, I have shared clearly 🙂

Finally, more information from Microsoft on performance tuning can be found at, https://msdn.microsoft.com/en-us/library/windows/hardware/dn529133.aspx.

David