Data centre without hassle

Here you will find basic information about Nutanix hyperconverged infrastructure - the modern and optimal solution for data centers. Use the following links for quick reference:

A general introduction to hyperconverged infrastructure

Traditional three-layer infrastructure has been perfected over the years, but the limitations imposed by its principles can no longer be circumvented by even the fastest NVMe drives or SANs. In the past, we've stripped data away from processors and complicated the resulting bottleneck. What if we put the data back into the servers and bypass the need for SANs? Almost a heretical idea when we've been developing it for so many years. The fundamental barrier to deploying hyperconverged infrastructure (HCI) is this very inherent resistance to change. But if we get past it, the benefits of HCI open up in all their glory.

A simple, effective and flexible solution
Hyperconverged infrastructure is brilliantly simple in principle and architecture. Think of it as a group of servers connected to a LAN. A group means anything between three and as many as the floor of your hall can hold. Unlike the classic three-layer architecture, these servers also have disks, i.e. data storage, in them. The hardware is therefore trivially commodity, and this gives two major advantages - low cost and easy availability. What makes these ordinary servers highly available, durable, infinitely scalable, versatile and efficient is the Nutanix software layer. It gives you a fully featured enterprise solution with everything you're used to from a modern datacenter. Almost everything! You'll have to spare the downtime during upgrades or the hot spots when expanding capacity.

princip nutanix

The versatility of hyperconverged infrastructure comes from the sheer flexibility with which the cluster can be built and expanded. It can be a mini-cluster of 3 servers with a total of 3 processors, or at the other extreme, a monster that meets certification to run SAP HANA. Or an infrastructure for VDI, database server, PACS or anything else... by the way, even giants like Google or Facebook use hyperconverged solutions.

Nutanix as a leader
Among other things, the Nutanix software layer offers the greatest freedom regarding the choice of HW platform or hypervisor to be operated. When you opt for Nutanix's integrated native hypervisor, you save on VMware or Hyper-V licenses. However, if you want to use a different hypervisor, there is nothing to stop you. Whether we want it or not, hyperconverged infrastructure is the future of data centers. And why wouldn't you want to?

Benefits of hyperconverged infrastructure

  • Low TCO compared to traditional infrastructure
  • High level of automation and low management and human resource requirements
  • substantial savings in server room space, cooling costs and electricity consumption
  • unlimited scalability, flexibility and incremental growth of the solution
  • the convenience and speed of cloud services for on-premise deployments
  • robust solution with native support for DR scenarios and replication
  • versatile solution in terms of both size and workload
  • Hundreds of VDIs or nearly a million IOPS in a 2U appliance
  • freedom of choice of hardware platform and hypervisor

Reference case study and video with customer

One of our reference installations is the infrastructure for the operation of PACS systems at Motol University Hospital. This is a vital and critically important application on the customer side, on which human lives literally depend. The solution built on the Nutanix platform met the demanding technical and performance requirements while maintaining low acquisition and operational costs. From a technical perspective, it is two separate clusters with high disk capacity in two locations ensuring high availability. In addition to the reliability, the customer greatly appreciated the low management requirements of the solution, the quality of the Nutanix Acropolis hypervisor used, the seamless scalability and the versatility of use for systems other than PACS. How does the Head of the IS Department at FN Motol, Mr. Ing. Martin Voříšek you can see in the following video:

Testing - Proof of Concept

Of course, our customers can test the offered solution directly in their own environment and possibly on their own data. If you are interested in Nutanix and are considering the purchase of a hyperconverged infrastructure, the implementation of a proof of concept (POC) has several major advantages. In addition to demonstrating the fact that Nutanix really works as promised, a POC can be used to refine the sizing and clearly validate the optimizations and savings from using hyperconverged infrastructure.

The actual POC is typically done by renting a 2U appliance with four nodes forming a cluster. We will help you deploy your applications on this cluster, test the required scenarios, measure performance and then evaluate the data after a few weeks of operation. We always tailor the test configuration and the POC flow to the customer's requirements as much as possible, so that the POC runs exactly according to your scenario. Do you need to expand datacenter capacity, are your servers and disk arrays nearing end of life or are you experiencing outages or performance issues? Don't hesitate to contact us and we'll show you that it can be easier and better.

Technical details

Basic technical principles of the Nutanix hyperconverged solution

Every production data center environment must contain two key elements for its operation - the infrastructure functions themselves and the management and control functions. While the requirements for the operational part are usually clearly defined, the management and control environment often struggles with unclear requirements and is also fundamentally affected by fragmentation and fragmentation. Firstly, as the different functions evolve dynamically, and secondly, the need to integrate technologies and processes from different vendors and to work with their often different philosophies and approaches to each functionality.

The Nutanix hyper-converged solution implementation, like the three-layer architecture, also includes an operational layer - the AOS - and the Nutanix Prism environment management and control functionality. As the development and design of both layers proceeds in a coordinated manner, each functionality is designed to be manageable while making sense within existing processes and workflows. As a result, not only can both layers be used efficiently, but many complex processes and functions can be completely automated.

From our experience in running data centers, we see an interesting contradiction where it is quite logical that computing resources automatically grow as the physical number and size of computing resources grows when running and expanding traditional infrastructures, but in the case of the administration and management layer, the automatic expansion usually does not occur. Installing a new appliance with new features often means changing workflows and reconfiguring the existing management and monitoring environment. In Nutanix Prism, however, the management and control layer is fully distributed and automatically grows as the infrastructure grows. Thus, there is no overloading of individual management elements, and furthermore, when functionality is extended, capabilities in other parts of the management will automatically be expanded. For example, automatic expansion of virtual network capabilities when implementing a SW or HW firewall, or automatic expansion of virtual machine management capabilities when deploying an H/A environment. Another nice consequence of this philosophy is that the Nutanix environment manages just as seamlessly regardless of the size of the infrastructure.

For a basic understanding of how this all actually works, we need to describe the fundamental element of the Nutanix hyperconverged solution - CVM or Control Virtual Machine. This is a dedicated virtual machine that is responsible for all Nutanix functionality, such as control and operation of distributed storage, distributed networking, administration and management. In fact, the CVM contains all the logic, functionality and management, except for the hypervisor layer. This architectural decision is the key to why Nutanix can also run under other hypervisors such as VMware, HyperV, Xen and its own AHV hypervisor. For the same reason, Nutanix is not limited to x86_64 processors, but can also be run on IBM Power processors. Another advantage of a CVM solution is that this virtual machine has everything it needs to run, including management and monitoring tools. As a result, the infrastructure is able to absorb very drastic hardware failures while still being able to use the remaining resources efficiently. Often it happens that not the whole HW block (server) fails, but perhaps only part of the memory, part of the disks, a controller, etc. If, for example, a disk controller fails, Nutanix is able to use this HW at least as a computing resource, with data provided by other parts of the infrastructure. Importantly, these reconfigurations occur automatically without the need for operator intervention. This brief introduction to CVM functionality is important to make the overall philosophy of the Nutanix environment clearer.


Features for modern data center operations

The operation of traditional infrastructures (triple layer, monolithic, etc.) has demonstrated the necessity and usefulness of advanced data center features over its existence, and hyperconverged infrastructure can leverage and further develop these principles. These include data storage functions, snapshots, synchronous and asynchronous replicas, data cloning, network security functions, segmentation, failover protection functions, data redundancy, RAID, Storage Tiering, etc. Because the Nutanix environment contains all layers of infrastructure and also has sufficient traffic information, it is possible to dynamically change the settings of individual parameters and automatically generate complex event chains. A good example is a situation where one virtual machine negatively affects the entire infrastructure, so it is advantageous to relocate it. In a traditional infrastructure, you need to deal with replicas, replication pairs, data coherence and data consistency, replication pair conversion scripts, etc. In a Nutanix environment and replication setup, we just choose where we want to migrate to. The system automatically starts a cascade of processes at the storage, networking, and virtual machine level, at the end of which a snapshot is added to the last replica and the VM starts in another location, so that even in the case of asynchronous replication, there is zero data loss at the moment of the planned migration. In addition, all activities are generated and chained automatically. Another interesting scenario is, for example, an unexpected change in the load character of entire groups of VMs, e.g., VDI, where individual groups of machines chaotically load the datastore. From a disk subsystem perspective, this is a random read where there is nothing to optimize. However, Nutanix is able to recognize that it is a group of VMs with similar workloads, and a single read operation can service an entire group of VMs. The whole situation is greatly eased by Nutanix's ability to chain long rows of clones and snapshots without performance penalty, effectively reducing disk capacity requirements and thus greatly speeding up read operations for groups of similar VMs.

Core elements of Nutanix hyperconverged infrastructure

  • AOS Distributed Storage
  • AHV virtualization - Nutanix hypervisor (or third party hypervisors)
  • Scale Out Storage services - Distributed data or object services.
  • Advanced Virtual Networking - Nutanix Flow - Virtual distributed networking services.
  • Other interesting features

AOS Distributed Storage

An absolutely fundamental and essential element for running a hyperconverged infrastructure is distributed storage, which allows you to combine the available disk capacity into a single software-defined storage pool that provides space for individual virtual machines, applications, and the hypervisor. The disk storage must support all of the advanced features mentioned above while fully supporting VMware® vSphere, Microsoft® Hyper-V, Citrix® hypervisors. Let's describe the essential technologies that enable distributed storage to operate in an enterprise environment.

Support for different hypervisors and transport protocols

Different hypervisors use different transport protocols such as iSCSI, NFS and SMB for their operation. Nutanix can support all of them, not only for the virtualization operation itself, but it is also able to provide data services for other systems, making it possible to use Nutanix as a distributed disk array.

Flexible redundancy

Nutanix allows you to choose the level of redundancy for both the overall Nutanix Cluster and for individual VMs. The redundancy rate is defined by the replication factor. By default, a factor of 2 is sufficient, which means that all data is stored in two copies. For particularly critical systems, it is advisable to choose a replication factor of 3. The system is thus able to absorb 2 simultaneous HW failures. Since the data is stored in at least two copies, it is possible to define a redundant path to each data object. This makes it possible to access the data even in case of partial or total unavailability or failure of the compute node (server). This functionality is crucial for two types of events:

  • Scheduled Updates or Scheduled Outages. In this case, the remaining part of the compute cluster will take over the operation without failure.
  • Unplanned outages and failures. Again, the rest of the infrastructure will take over operation, but Nutanix will begin recalculating and distributing additional redundant data to meet the two or three redundant copies requirement. Since the rest of the infrastructure is involved in the rebuild, Nutanix can absorb and repair this type of outage very quickly.

Data integrity check

During operation, Nutanix checks for data disintegration at the HW level. HW failure of the disk subsystem is not always completely obvious. Occasionally, random bit swapping errors (Bit Rot) occur during read operations during operation. Because Nutanix checks data integrity not only at each read, but also continuously, it is able to detect these errors, then correct them, while isolating the faulty data segments from operation. This technology provides protection against creeping data disintegration, which is often detected only when the damage is beyond repair and far beyond the retention of backup systems.

Domain Availability

In large data centers, entire blocks of HW, such as entire racks or multiserver containers, can and do fail. This may not be due to a power failure, but may be due to a simple connectivity failure and so on. Extremely increasing the number of redundant copies would be highly inefficient. A simple and effective defense against these types of outages is just domain availability. Availability domains are definitions that allow Nutanix to determine where a particular data object is located so that it will not allow redundant copies to be written to the same availability domain. Sometimes this feature is also called Rack/Block awareness. This allows Nutanix to be able to absorb a simultaneous failure of an entire block or rack without losing outage data.

The Nutanix Distributed Storage fabric includes many features to optimize performance.

Data Locality

Data locality is a unique feature that improves disk subsystem performance parameters quite dramatically, while requiring no special hardware resources. The architectural reasoning is quite simple. The fastest disk operations are always the local ones because they do not carry the latency of the entire infrastructure. In addition, Nutanix not only has all the information about the data, but also about the location of the virtual machine, and is able to proactively colocate the data so that all data is locally available to the virtual machine. This feature works automatically and in case of moving a VM to another node/server, Data Locality automatically moves the data to the corresponding compute node.


Automatic disk tiering

Automatic disk tiering eliminates the need to define and manually manage the usage of each disk tier. It is therefore possible, and even advantageous, to combine different compute nodes with different speeds or sizes of disk subsystem or to use hybrid configurations such as NVMe and SSD or SSD and HDD. In addition, NVMe and SSD layer is used continuously so that hotspots (places with excessive disk wear) are not created, so that excessive wear (SSD wearing) of disks does not occur.

Automatic Disk Balancing

Since Nutanix allows the use of different combinations of nodes with different capacities and speeds, Automatic Disk Balancing ensures even data distribution even in such a heterogeneous configuration without the need for manual rebalancing.

Shadow Clones

Shadow copy technology dramatically improves performance for systems that are defined as a clone from a Master Image, typically for example, Citrix MCS Master VMs or VMware View replica disks. In these cases, a single read operation is capable of serving large groups of servers. Using shadow clones also allows virtually unlimited chaining of clones without penalizing disk subsystem performance. Shadow clones also dramatically increase disk subsystem utilization efficiency.

Disk capacity optimization

Deduplication
Nutanix offers two options to deduplicate data and increase application speed.
The first option is Inline Deduplication, which prevents duplicate data in the cache and SSD layer.
The other option is global post process deduplication, which reduces duplication in the capacity tier and thus significantly improves the utilization potential.
For duplicate identification and fingerprinting, Nutanix uses the hardware SHA-1 Hash feature, so performance impacts are minimal and mostly speedups occur due to space savings.

Compression
Compression, like deduplication, can be an inline or post process with the same benefits. Deduplication and compression can be combined and both functionalities are designed so that deduplication enables compression and compression enables deduplication.

Erasure Coding EC-X
The EC-X algorithm provides significant space savings. As the number of nodes increases, the ability to save disk capacity increases up to a maximum of 75% of the available capacity of the gross capacity of the disks used. The primary purpose of EC-X is to eliminate the negative effects of data duplication for the replication factor. EC-X resembles RAID technology in its characteristics, but does not cause write operations to slow down.

AHV virtualization
AHV virtualization is a native virtualization environment that is part of the Nutanix solution. Of course, other supported hypervisors can also be used, but AHV can be used in the Nutanix environment without additional licensing requirements while offering all the important features needed to run data centers operating in even the most stringent modes.

It is not necessary to talk about the possibility of virtual machine operations, but Nutanix allows other interesting features.

Image Management - Central management of installation media and master image
Nutanix is not limited to the possibility of using its own formats, but also allows the use of external formats such as .raw, .vhd, .vdmk, .vdi and .qcow2. This greatly minimizes the need for manual conversion of existing and Master Images and brings a major saving in time and effort when defining multiple different versions of Master Images for different hypervisors.

AOS Dynamic Scheduling (ADS)
ADS continuously evaluates status and performance characteristics and automatically assesses where it is most advantageous to place new VMs or installation media. ADS can also monitor anomalous VM behavior and use AI to proactively rebalance VM placement. AOS naturally respects the Affinity and AntiAffinity rule settings.

Affinity and AntiAffinity rules
Because Nutanix allows you to combine different hardware configurations of compute nodes, it can be advantageous for a virtual machine to prefer or lock it to a specific HW type or directly to specific nodes. Reasons can be licensing, capacity (e.g., specific nodes are designed to run SAP HANA), or hardware (e.g., some nodes contain GPUs that the VM needs to run). Affinity rules allow this locking.
AntiAffinity rules are useful when it is not desirable to have two VMs running on the same compute node. For example, this reason may be either functional, where two VMs negatively interact, or not architecturally appropriate in terms of increasing availability.

Live Migration
Live Migrations allow you to move a VM on the fly to a different compute node. This feature is particularly important during HW upgrades or maintenance, where these activities can be performed on the fly without impacting operations. In addition, live migration is also possible in case of synchronous replication to a completely different Nutanix cluster. Thus, DR-driven failover and failback can be performed on the fly without VM downtime.

Cross-Hypervisor Migration
Nutanix DFS allows you to migrate between ESXi and AHV. This feature is interesting if you decide to keep your existing hypervisor in the primary site, for example, while the DR site can run Nutanix AHV.

Automatic HA
In case of a HW failure of one node, Nutanix will automatically start the affected VM on another node. In the case of optimal N+1 allocation, everything is done automatically, however, situations where Nutanix is overallocated and the HW resources of the remaining nodes do not allow all VMs to run can be handled. If this happens, it is possible to define critical systems and overallocate resources to keep at least these critical systems running. This feature is particularly advantageous if the system runs both critical systems and at the same time e.g. test and development environments that are not so critical.

AHV Turbo
AHV Turbo combines two techniques to speed up I/O operations.
If the guest fully supports Vrit-IO PCI, Nutanix allows bypassing the virtualized I/O Stack layer. This feature significantly reduces I/O operation latency.
The second feature is multichannel operations. To simply illustrate the problem, suppose we have a single VM running a database and processing small and random segments of data, while running a backup that processes admittedly few but very large data blocks in parallel. If only one I/O queue (channel) is available, the backup process will completely overwhelm that channel and the database will virtually stop.
Multiple channels reduce this problem substantially and AHV turbo automatically defines the optimal number of channels according to the number of vCPUs. These two technologies significantly speed up disk subsystem response times and their benefit is especially noticeable when using NVMe disks.

RDMA
Remote Direct Memory Access (RDMA) allows you to leverage the RDMA over Converged Ethernet (RoCEv2) protocol to significantly reduce TCP/IP latencies for write operations by having RDMA write redundant data directly to the other node. RDMA technology not only significantly reduces latency, but also dramatically reduces the load on CVM for write operations.

Data Locality, AHV Turbo, and RDMA technologies have a significant positive impact on throughput and latency of the entire environment and enable the full potential of NVMe, Intel Optane, etc.

vNUMA
Modern Intel processor-based servers have separate memory banks assigned to each processor. If one processor needs to service a memory block that is not assigned to it, it must ask another processor to handle the operation. The negative effects of Non Uniform Memory Access (NUMA) technology are not noticeable in normal cases, but extremely large virtual machines can have serious problems with individual processors overloading each other with memory operations. To prevent this from happening, Nutanix includes the vNUMA feature. This technology allows the virtual machine to respect the physical architecture and avoid individual processors negatively interacting with each other. This technology is crucial for running extremely large VMs such as SAP HANA.

GPU support
Modern compute-intensive systems often use GPUs to offload compute workloads. Nutanix offers the ability to utilize the full range of GPUs and can be assigned physically, as well as virtualize, share and assign individual virtual GPUs (vGPUs) among multiple systems, which can save on the number of physically deployed GPUs and dramatically improve utilization.


Scale Out Storage services

Nutanix Files
In virtual environments, it is often advantageous to use disk storage not only as block storage, but also as NAS. Nutanix offers the option to use the easy-to-configure Nutanix Files service, which supports SMB 2.1 and NFS v4, and its main advantages are easy configuration and expansion, convenient management and an automatic balancer.

Nutanix Objects
Similar to Nutanix Files, Nutanix Object is an automatically defined and load-balanced object storage service in the S3 standard compatible with REST APIs.
Both systems have the advantage of easy management, easy deployment, and proven reference operation in petabyte data environments.


Advanced Virtual Networking

In order to securely and reliably connect individual systems to each other and to the surrounding infrastructure, Nutanix uses vSwitch technology fully integrated into the management and workflow environments of Nutanix AOS and Nutanix Prism. Using vSwitch technology allows for easy setup and management of virtual segments.

Nutanix Flow
In some cases, it may be appropriate to supplement segmentation with application security features, traffic detection, and security enhancements using third-party solutions. Nutanix Flow addresses all of these issues by enabling the management and deployment of application firewall and microsegmentation at both the VM and individual service level.
In the event that VM and service communication is not sufficiently documented, Nutanix Flow allows running in permissive mode and documents the communication processes itself. Once the mapping is complete, it is very easy to create application rules and rules for communication with the surrounding infrastructure. In addition, Nutanix Flow includes an API to enable deeper collaboration with existing supported security solutions


Features to ensure high availability and site protection

Nutanix offers several strategies to ensure site-wide protection:

Asynchronous replication
Nutanix offers the ability to replicate to a remote Nutanix Cluster ev. The Nutanic Cloud can be connected to the Nutanic Cloud in a time period of days to 1h. The advantages are very simple deployment, low data link requirements and minimal load on the production environment.

Near Sync replication
Near sync replication is very similar to asynchronous replication, with replication times on the order of minutes. It can be advantageously used for replication on links that are fast but have very high latency.

Asynchronous and Near sync replication have major advantages in very simple implementation, easy monitoring and clear orientation to the virtual machine. Complete consistent groups are defined per VM and also support pre/post scripts if the application requires it. Another very nice feature is that there is no data loss in case of a managed transition. All data is replicated to the last point before starting in the other site. Async and near sync replicas can be defined in either a 1:1 or 1:Many configuration. Failover and failback are mutually equivalent operations, and therefore transitions to and from DR are not a problem. VMs can be recovered from both local and remote replicas, so it is advantageous to use, for example, a slower site with higher capacity as a storage with many times longer retention.


Synchronous replication

Synchronous Replication / Stretched Cluster is more infrastructure intensive, but brings interesting tools for managing high availability for mission-critical environments. In particular:

- Moving to an environment with a different network configuration
VMs can be prepared to start in an environment that does not match the IP scheme of the primary site.

- Testing DR transition in traffic
The test starts the complete environment in the DR site, but on a separate network segment. Thus, it is possible to test the DR transition function and VM functionality during operation and then cancel the test environment.

- Start-up sequence
The definition of the startup sequence specifies the procedure to stop and start each machine.
The sequence can define a parallel, serial, or combined prescription for how to start an entire group of VMs.

- Tertiary site. Witness
If automatic failover is required, a tertiary site system is highly recommended. Stretched Cluster technology brings with it the danger that if the sites are disconnected, both sites will be evaluated as active. This situation is called split brain and can cause catastrophic data decoherence. Tertiary locality and Witness uniquely determine which locality is available and will carry the production load. Witness VM is very small, uncomplicated and supports all available hypervisors.

- Live VM Migration between clusters
Similar to Live VM migration between nodes, it is possible to migrate VMs between sites and perform Failover without downtime.

- Recovery points
Although the purpose of synchronous replication is to write immediately to both sites, Nutanix periodically creates recovery points on both sites for security reasons. Thus, the transition is not necessarily to the last point in time, but it is possible to transition to a specific time - Point in time recovery. It is thus possible to recover the last correct data.

Other interesting features

Update management system - Nutanix LifeCycle Manager (LCM)
Nutanix LCM handles complete update and firmware management. LCM automatically detects the different bios, firmware and component versions and automatically defines the update procedure. The LCM also automatically checks version consistency with each other, measuring whether Nutanix is critically overloaded beyond the recommended threshold and therefore whether the update can be performed without failure.
The LCM also detects and automatically resolves dependencies, so that, for example, it will not allow an AOS upgrade to an unsupported firmware version. If multiple versions need to be upgraded, LCM automatically creates a rolling update. LCM also alerts when support for currently running versions is about to end.

Comprehensive periodic Nutanix Cluster Check (NCC)
NCC periodically checks the operation and settings of the entire environment. This detects hidden infrastructure failures, out-of-optimal environment settings, or deficiencies in security settings. All tests that are evaluated as an error, security risk, or unexpected value refer to a clear troubleshooting procedure and a specific Nutanix KB number with additional information.

Contact
Nutanix Expert
Nutanix Expert

Tomáš Osička
Infrastructure System Engineer