Tag Archives: vmware

The input arguments had entities that did not belong to the same datacenter.

Symptoms

Trying to vMotion (in my case a cross-vCenter vMotion from a 6.5 system to 6.7) a VM the error “The input arguments had entities that did not belong to the same datacenter.” occurred. Other VMs were migrating fine.

image

Cause

The CD Drive in the source VM was mapped to (although not connected to) a Content Library ISO file.

Solution

Point the source VM CD drive at “Client Device” and retry the vMotion.

image

RTO With Cohesity @ vRetreat

How Cohesity’s Approach to VM Backup Affects the Recovery Time Objective

This week I attended another vRetreat online, this time featuring data vendor Cohesity who I saw presenting at the (in-person) event last year. These are great events, and the small panel of delegates works well in the virtual format.

One thing that stood out to me in their presentation was the focus on the Recovery Time Objective (RTO)- in essence how long it takes to recover from an incident. In this post I will briefly discuss how I understand the definition of RTO before looking at how the Cohesity products work to keep this time down when working with Virtual Machines.

Recovery Time Objective

There’s plenty of material out on the interwebs which will explain RTO in great detail, but I’m taking the definition to be:

the expected length of time between an incident occurring and users being able to work normally again

As this diagram shows, the Time can be split into a number of notable sections, I’ve chosen the following three:

RTO

  1. Discovering the Incident. How long is it before we notice something is broken? Do we have to wait for a user to contact the service desk, or do we have responsive monitoring and alerting in place?
  2. Starting the Restore. How long does it take to actually start the restore operation? Is there a clear process to be followed? There might be internal decisions to be made as to whether to kick off a backup restore or attempt an in-place repair. Does somebody need to physically power on some equipment or find and load some tapes before a backup restore can commence?
  3. The Restore Operation. How long does it take between “Go” being pushed on the restore console and the service being usable again?

You’ll notice there’s also a fourth section on the diagram- the “Tidy Up”. This is all those processes that need to happen after the user is working again to get the system back into a normal state. This might include things like tidying up the original (broken) copies of the VM, returning a backup tape to the library, or investigation of the root cause. In any of these cases, I’ve put this step outside of the RTO as by the definition above, the Users are working normally again.

Ransomware Detection

imageRecovery from ransomware attacks seem to be the current favoured feature pushed by backup vendors, and Cohesity are no exception. Their take here is that because the Cohesity Data Platform handles all the backups, it sees all the data and this position in the data flow gives the rest of the Cohesity stack an opportunity to spot both when an unusual number of files have been changed and also when files suddenly can’t be indexed because they’ve been encrypted.

Tied with an alerting mechanism, this helps address our question in point 1 above- “Can we discover the incident quickly?”. The sooner someone in IT is aware that a ransomware infection has happened, the quicker a response can be started.

Additionally, Regular point-in-time snapshot backups make it easier to spot the time the infection started (or if not the point of infection, at least when the malware started acting) and the more granular the timestamps the less data is potentially lost between a backup and the incident. But we’re straying into RPO, not RTO, there.

Starting Restore

Most of the time when responding to a major incident and orchestrating a restore operation the user interface will be key to assessing the situation and bringing services back online. Cohesity offers a clean and tidy web-based UI, complete with the now-obligatory Dark Mode.

2020-07-09_21-58-27

Whilst the platform isn’t going to make those go/no-go decisions on kicking off a restore- it can influence that decision. Because the restores are so quick (as we’ll see shortly) the discussion on whether to repair or restore might favour the latter. It’s also possible to bring up the VMs in a network-disconnected state without touching the production systems so that once any discussions are complete the restore is even quicker (or if the repair option is chosen then that restore can just be cancelled)

Restoring User Service

Once recovery is started in Cohesity Data Protect an NFS datastore is created on the Data Platform- the VMDK is already here so there is no need to spend time at this point moving blocks across the network. The NFS datastore is mounted within vCenter and the VM registered and at this point the VM can be powered on and the users can get working again.

Once service has been restored, the longer process of putting the VM files back where they belong is achieved with the hypervisors own Storage vMotion technology (the fourth step above). Applications are available throughout this, and once the Cohesity datastore has been cleared, it is unmounted from vCenter.

As this slide extract from the Cohesity presentation shows, one of their big selling points is this quick recovery process. Notice how the “Recover data to target storage device” is positioned after the User access is restored.

image

Thanks to Patrick Redknap and the Cohesity team for hosting this informative event, and I look forward to the next one. For more information about Cohesity, check out their website: https://www.cohesity.com/

Please read my standard Declaration/Disclaimer and before rushing out to buy anything bear in mind that this article is based on a sales discussion at a sponsored event rather than a POC or production installation. I wasn’t paid to write this article or offered any payment, although Cohesity did sponsor a prize draw for delegates at the event.

Datrium @ vRetreat May 2020

Last week I received an invitation to the latest in the vRetreat series of events. These events bring together IT vendors and a selected group of tech bloggers- usually in venues like football clubs and racetracks, but in the current circumstances we were forced online. The second of the two briefings at the May 2020 event came from Datrium.

To paraphrase their own words, Datrium was founded to take the complicated world of Disaster Recovery and make it simpler and more reliable, they call this DR-as-a-service. The focus of this vRetreat presentation was around their ability to protect an on-premises VMware virtual environment using a VMware Cloud on AWS Software-Defined-Data-Centre as the DR target.

These days the idea of backing up VMs to a cloud storage provider and then being able to quickly restore them is fairly commonplace in the market. Datrium, however, take this a step further and integrate the VMware-on-AWS model to reduce RTO but also ensure reliability by enabling easy, and automated, test restores.

When Disaster Strikes

In the event of a disaster Datrium promises a 1-click failover to the DR site through it’s ControlShift SaaS portal. One of the great benefits here is the DR site2020-05-19 (44)– or at least the compute side of it- doesn’t exist until that failover is initiated. This means the business isn’t paying for hardware to sit idly by just in case there’s a disaster.

The backup data is pushed up to “cheap” AWS storage and at the point the failover runbook is activated a vSphere cluster is spun up and the storage is mounted directly as an NFS datastore. VMs can then start to be powered on as soon as the hosts come online – with Datrium handling any required changes in IP addresses etc.

Whilst the system is running in this DR state, changes are monitored so that when the on-premises environment is restored failback only requires the delta change to be synchronised back from the cloud. And at this point the VMware environment on AWS is removed until the next time one is required.

2020-05-19_15-09-20

Testing – Practice Makes Perfect

This ability to spin-up and decommission the entire DR site on demand enables realistic testing to be performed without risk to the production workloads. Test restores can be run, and workload-specific tests run on the test environment, but the SDDC built on AWS only exists for the duration of the test.

The Datrium platform contains runbooks, and these are not just restricted to disaster events, but can be used to automate testing. The system will, on a schedule, spin up some or all of the VMware environment in a temporary SDDC then run some specified tests and shutdown and destroy the test infrastructure when complete. The results of this testing are compiled into an audit report.

Conclusion

As I’ve alluded to at the top of this post, there are plenty of “Backup” and “DR” products out there servicing Enterprise IT and leveraging the public cloud to do so. Of those, I think Datrium is worth considering particularly if you are focussed on protecting a vSphere environment with a short RTO, and are interested in using VMware on AWS as a DR solution but not that keen on the not-insubstantial costs of running that DR SDDC 24/7.

Please read my standard Declaration/Disclaimer and before rushing out to buy anything bear in mind that this article is based on a sales discussion at a sponsored event rather than a POC or production installation. I wasn’t paid to write this article or offered any payment, aside from being entered in a  prize draw of delegates to win a chair (I was not a winner).

VMworld Europe 2019-Day 2 Keynote highlights

The Wednesday General Session at VMworld Europe is usually where VMware puts the meat onto the bones of the Tuesday announcements and this year was no exception. Here’s a quick rundown of my highlights.

imageExecutive VP Ray O’Farrell kicked off proceedings with a video of a near-future environment where a person is making use of futuristic apps, devices, and transport- a storyline which was then tied in to the new VMware announcements. Following on from the success of Elastic Sky Pizza in 2017, attendees were introduced to the latest (ficticious) company- Tanzu Tees – who must be opening a European branch following their success at VMworld US in August.

The Keynote was divided into four sections to follow this theme- “Build and Run”, “Connect and Protect”, “Manage” and “Experience”. This split the hour into 10-15 minute sections and showed the breadth of todays’ VMware profile.

Less than 7 minutes into the show and we’re already diving into product demos, with Joe Baguely brought in to show an application being built with Spring Initializr to build out a framework for developers, deploying this to a Bitnami catalogue with Project Galleon and make it available in VMware Cloud Marketplace.

The second demo showed off the new Tanzu Mission Control managing Kubernetes clusters across vSphere, AWS, VMware Cloud, Azure, and Google Cloud- all on one screen. A key feature here was the ability to apply policies across all these different platforms from one consistent interface- no need to dive into 3, 4, or 5 different workflows, each with their own GUI, CLI, and API components to deal with.

A demo of Project Pacific followed this. I’ve heard lots of people say how much they appreciated these demonstrations and being able to see what the products actually look like as slide decks can only take you so far.

In this third demo we saw the vSphere Client we all know managing Kubernetes clusters alongside VMs and container pods- all natively within ESX. VMware are already using this technology in house- currently creating and destroying 800,000 containers weekly- a number which is growing.

Moving onto the “Connect and Protect” section Ray was joined onstage by Marcos Hernandez who had more demos. The first of these looked at the NSX Intelligence features- picking up risks, threats, and vulnerabilities which have been surfaced using the new Distributed IDS/IPS technology in NSX and then applying recommended firewall rules to remediate the faults.

Marcos’s second demo looked at how Carbon Black Cloud Workload adds another layer to protecting the application- spotting known vulnerabilities, locations in the infrastructure where encryption wasn’t implemented, The demo included a simulated hack on the Tanzu Tees application and showed how Carbon Black and AppDefense detected the intrusion attempt.

The “Manage” segment brought Purnima Padmanabhan to the stage. Wavefront was the first product up here, collecting metrics from the components of the Tanzu Tees apps and drilling down into individual microservices to diagnose performance problems- in this demo identifying a specific SQL query which was the root cause.

Project Magna was next up in the demonstrations- this uses AI and ML to optimise application performance- in this example by modifying cache size based on the current workload on the storage device.

CloudHealth was used by Tanzu Tees to analyse the usage of the components of the applications and recommend right-sizing of VMs and produce budget alerts to help proactively manage cloud spend.

The final section- “Experience” – was led by Shikha Mittal who continued the demo heavy theme by showing how Horizon Virtual Desktops sites can be created on both AWS and Azure clouds and use on-premises style images alongside the Microsoft Windows Virtual Desktops deployments of Windows 10.

VMware Workspace One was shown managing a variety of end user devices, and connecting to Carbon Black to spot anomalies in usual device behaviour, for example spotting malicious logins and potentially compromised endpoints. Again VMware uses this internally for their 60,000 endpoints across the globe.

The new CTO of VMware, Greg Lavender, closed out the presentations talking through some of the forward-looking activities of his office including using Bitfusion appliances to provide GPU resources across a network thus sharing a pool of GPU resources amongst a CPU-only ESX infrastructure.

In summary this was a session full of product demonstrations- definitely worth a watch or picking out the bits relevant to you. You can now tune into the full keynote (1 hour) on Youtube.

246520-vmworld2019-contentcatalog-eu-blank-1600x250