I saw this post by my respected colleague Duncan Epping at Yellow-Bricks, and it prompted a comment. As soon as I was three paragraphs in I figured “this really should be a blog post”.
Within the next 24hrs, I got pinged from all three continents from trusted sources on the same question – IO scaling and VDI. Over the last week I’ve been deeply engaged with 3 massive VDI deployment projects that are struggling with this point.
I’ve got an interesting (and I think insanely fortunate!) perspective/visibility into what’s going on in VDI. I’m not claiming that it’s better/worse than anyone else – but I get to see VDI projects around the globe, in all sorts of verticals, and at all scales (100’s of desktops up to 10’s of thousands and even hundreds of thousands) as my team of people partner with customers.
Duncan/Richard G asked a good question – why isn’t there more View-oriented blogging? IMHO, View blogging is rare because VDI architectures in general are complex (not that any individual element is complex, but rather the end-to-end picture and client lifecycle of patching and A/V can be complex) and very, very variable from customer to customer (workload type, client type, connectivity type, connection management type, virtualization layer config, storage type/config all vary wildly). This means it takes a bit more time to get your head wrapped around it all – but also suggests that a ton of value could come from more open dialog via blogs.
So… Let’s do some View posts 🙂 Read on….
The first whitepaper Duncan pointed out (here) nailed a core issue that I see (there are many that are in a VDI project design) – the IOPs challenge.
This is the third most common cause I see stopping VDI projects from getting started. The first most common traditionally is “client experience” – which is a function of network, remote graphics protocol, use cases. The second is “TCO” – which is rooted in the fact that people are used to capex-oriented TCO models with VMware (on server virtualization), and client virtualization is neutral on capex at best, only shines when the elements of opex and information security/availability are factored in.
That said, scaling IO effectively is the number one most common cause I see (admittedly, I’m sure I see a “storage centric” world-view). derailing VDI projects as they scale up (into the “thousands of users” )– not in PoC, but in production.
Don’t understand? Don’t feel bad, people don’t naturally think about it… So think of this:
Q: How many hard drives do 1000 desktops have? A: 1000.
Q: When you virtualize those 1000 desktops… will you use a shared storage (FC/iSCSI/FCoE SAN or NAS) config with 1000 drives? A: of course not.
In a nutshell, that’s the core question.
There are loads of things that help, and of course that trivializes the question, but that paints the problem in a way people seem to understand. Why does that trivialize it? Well – the math isn’t that simple.
Things that help with this IOps scaling problem:
- duty cycle of desktop (not everyone does the same thing at the same time) – though it’s VERY important to look at things you do that DO affect every desktop at the same time, and consider changing them (patch/AV are the big ones).
- The disks in laptops/desktops are generally 2-3x slower on random IO than those used in enterprise arrays.
- cache architecture of the shared storage array buffers burst writes, and does read caching (every array does this differently, but they all do it to varying extents) – but in remember that in the end, you need to commit the write volume fast enough, otherwise write cache is guaranteed to overflow, no matter what.
- the VMware and storage layers do some minimal coalescing of IO
- you can mitigate “simultaneous boot” effect through a variety of means at the connection manager – eg. VMware View can stage client boot/logon behavior
The author of the first whitepaper also was bang on – initially we expected 90/10 read/write ratios at customers, but in practice we’re seeing more along 50/50 read/write. We’re also seeing IO per client that range more towards the high rather than the low end of the band. Ergo, if I had a penny for every customer who said “design it for 2-5 IOps per user”, and then complained when it turned out to be 8-20 IOps per user… Well, I’d be a rich man.
Are you using VDI in production – if so, I’d love to see your comments on what you’re seeing in practice for your clients re: client IO profile…. please comment!
Array caching models help and are part of the picture. It’s important to understand where and how.
Read cache, once loaded and in steady state is always full = Bigger is generally better, and can help with cacheable reads. Read cache has a gradual performance increase as the cache fall through rate (the time data stays in cache) decreases due to memory pressure.
Conversely, write cache starts empty, gets filled as I/O comes in to the array, and is drained (destaged) by the backend spindles. This helps by absorbing bursts, and allowing the arrays to try to coalesce the write I/Os. They also differ in the effect when they fill. When write cache fills, it has a big and instantaneous performance drop – as all of a sudden the host is directly exposed to the latency and performance envelope of the backend spindles.
Remember that write cache protects you against bursts – but all storage arrays needs to be able to “sink” (drain the write cache) at a rate that is greater the sustained write IO workload. Otherwise cache destage (i.e. write commit on the backend spindles as if the array had no cache) becomes the gating performance envelope.
I hate customer PoCs (by EMC or anyone) that are structured to avoid this point – because they prove little to nothing IMO. The challenge is that generating big, and realistic client workloads in a customer PoC (and in our reference architecture work) is exceptionally difficult.
There are several very important things you can do to mitigate extra guest IO:
- guest OS alignment is important, and reduces extra backend IO caused by stripe crossings (which induce extra IOs).
- avoid vswap at all costs in this use case. People think that using memory density will be their bottleneck in the economic model. It can be, but just as often its the storage. in general configure guest mem = reserved mem.
- avoid guest-level swapping if at all possible (don’t undersize the amount of configured VM memory) – follow the guest OS guidelines.
- any time you are able to move the user data (”my documents”) out of the guest and into a NAS device, it’s a huge win (in many dimensions – capacity efficiency, minimize VM). This is not an option for some user use cases (for example, doesn’t work easily on “check in/out” use cases).
- Disable automated AV updates
- Disable boot optimization
- Disable system restore
In my experience (and my team’s experience) working on many VDI projects (with all sorts of configs) – at larger scales, the problem of economic IO scaling becomes very hard. Seeing some customers deploy Atlantis (and similar approaches) to increase scaling, decrease IO density (through distributed caching on the ESX hosts) – though these often break the encapsulation model (everything is a trade-off).
Other things to think of…
VDI can generate BIG workloads fast.
Let’s say once again that your peak workload is 12 IOps per client, and you have 15,000 desktops you want to virtualize. That’s a total of 180,000 IOps, which is a very, very large workload for common storage configurations. It would hammer a large CX4, for example. You would need to carefully scale out all the aspects of the design, and consider it just like you would consider the system design for a MASSIVE database. Can it be done? Of course – but there’s a reason why the “what’s the single ESX host maximum IOPs” test at the vSphere 4 launch (365,000 IOPs) was backended by 3 CX4-960s with 30 solid-state disks. That’s a whackload of IO.
Protocol factors to consider….
If you’re using Fibre-Channel – every FC target port on arrays have port limits (measured in I/Os) – usually around the 3000 IOps mark. You need to make sure that you have enough front-end port connectivity. Does this sound weird? It’s not. Let’s say you have a massive V-Max able to chew up and spit out IOps out the ying-yang, but the vSphere cluster is using 4 FA ports (front end array interfaces). Well, you have a total “ingest” of around 12000 IOPs. If you have users generating 12 IOps, that means once you’re at 1000 clients, you’re going to saturate the front-end ports form an IOps standpoint. You better make sure you are using lots of front-end ports, and are load-balancing across all of them.
NFS has some advantages here, but also some disadvantages that need consideration. Whereas steady-state IO is mostly IOps gated – the precise periods of peak IO – patching/AV – have high bandwidth (MBps) gated workloads. You need to remember the “one active vmkernel interface per NFS datastore” rules of the NFSv3 support in vSphere 4. You can of course use 10GbE with NFS datastores today (which we support). Remember though, the NFS servers usually have backend interfaces (which also have IOps limits) which are used to connect the NFS server with the disks that support the filesystems. As an FYI – an updated Celerra vSphere NFS best practices doc is coming soon. While it does fine with 4K IOs, since the Celerra has an internal 8K allocation size on the UxFS filesystem it uses, if the NTFS guest volume uses a 8K allocation size, performance is even better.
So… What are we doing about it?
Clearly caches can help a lot with the read portions of the workload effect, and write caches can absorb spikes and help a little on the backend (decoupling the host from the backend disk IO). Remember to assume that cache is non-existant (or negligible) for your write worksloads (you must avoid the write-cache full effect). And larger cache is generally better (though not a panacea).
On the EMC side, we think that EFD (enterprise-class solid state) is an important part of the answer. The author of the whitepaper is right in terms of the consumer SSD life span, but there are enterprise SSDs which can sustain the same duty cycle and lifetime of any enterprise magnetic media. Over time, this will apply even to consumer SSDs.
BUT – they are not currently a simple answer – as with View 4, currently you can’t put a base replica on one datastore, and the linked clones on another. This means you would need to use EFDs for the entire datastore, which makes them less viable (though they can help) with the current $/GB of EFDs. It also means that if you’re not using View Composer, you can leverage them, but only if the entire datastore is on EFD.
We’ve used all the methods above in the current View 4 VMware/Cisco/EMC reference architecture to get to a $750/client end-to-end cost (everything – including Microsoft VECD licensing, but it doesn’t include the client hardware).
For those interest, of the $750/client end-to-end cost:
- it assumed 2000 users on the total config
- 4GB dimms were superior on an economic breakpoint when compared with 8GB dimms on the UCS blades.
- The breakdown was:
- 26% on the storage
- 49% on the servers/network
- 21% on the VMware software (vSphere 4 Enterprise Plus, the full View package)
- 3% on VECD and incidentals
The initial doc was worked on prior to the View 4 GA (and for people who are close to View and VMware know that right before the GA, it stretched out a couple of days) – which meant it needed to use pre-GA versions of View 4 on pre-release versions of vSphere 4 update 1. Through now and the end of Dec we’re working on an update based on all the GA elements and will include many more detailed findings.
To try to make the economics even better, we’re working on a couple things. This was an extensive topic of discussion with the View team when I was at VMware two weeks ago for our EMC/VMware QBR and QTR.
- vStorage APIs for Array integration (VAAI) – “hardware accelerated locking” – this will help with VM density per VMFS datastore (to match the “VMs per datastore” model of NFS for customers using block models), as well as the VAAI fast/full copy hardware-offloaded copy/move. On the NAS side of things, over time working on pNFS support in vSphere and in our GA NFS platforms for scale-out NAS
- also vStorage APIs for write same/write zero (we demoed this at VMworld) will reduce some of the I/O from the host to the array by eliminating some duplicate I/Os and zeros.
- FAST v2 (block level autotiering) will enable a datastore (on NFS or VMFS) will help in the sense that a datastore will be able to be “blended” with EFDs supporting “hot” blocks/portions of files, and large, slow SATA/SAS being used for ones that are “cold” – this results in the best of $/IOPs of EFDs and the $/GB of large magnetic media.
- Future versions of View and View Composer have some things that are explicitly targeted to help with the various things I noted (decoupling base replicas and linked clones, creative stuff about vswap/guest swap handling)
- Working on tools to more easily capture a given client workload (think of a “Capacity Planner for desktops”)
- Working on tools to more accurately model client workloads at large scale