What causes performance issues with Flash Storage — and how to fix them

“For modern data centers that rely on server virtualization, flash storage has proven to be revolutionary in terms of performance gains.”

For over ten years, NAND flash has been a primary component of data storage. Early on, there was quick user adoption of USB thumb drives using flash. These days, we see Storage Admins in enterprise data centers using storage arrays that are based on SSDs and proprietary flash.

“For modern data centers that rely on server virtualization, flash storage has proven to be revolutionary in terms of performance gains.”

In the never-ending quest for better performing storage, flash in a denser form factor has a high adoption rate among users. In fact, for modern data centers that rely on server virtualization, flash storage has proven to be revolutionary in terms of performance gains.

Modern data reduction techniques such as inline deduplication, inline compression and some server virtualization APIs, such as VAAI, help flash-based storage provide faster, more reliable and feature-rich storage using commodity hardware by way of servers and SSDs.

What do storage pros need to know about flash?

In this blogpost, we’ll identify what storage pros need to know about performance issues with flash storage. In this blogpost, we’ll share the following:

  • What does performance mean in the world of flash storage?
  • How is performance measured — what are key performance characteristics?
  • How do you troubleshoot performance issues in a Solution? How do you know if the issue is storage performance, SAN/NAS (network) performance, or an issue with the Server/Hypervisor?
Performance Paramaters

Datasheets for flash storage arrays usually specify the following per controller:

  1. Number of CPU cores/sockets and clock speed
  2. Amount of DRAM on controller
  3. Number/types of IO ports (e.g., 16Gbps FC, 10GbE Ethernet)
  4. Number of SSDs in base enclosure and capacity of SSD (RAW Terabytes)
  5. Number of SSD Shelves the controller can take (Total RAW Terabytes)
  6. IO Operations per second for 8K block size at 70/30 or 80/20 R/W ratio (usually at < 1 milli-sec latency)
    Note: Some vendors do not publicly publish performance numbers on their datasheets.
  7. Data Reduction: by virtue of deduplication and compression: this compression ratio could be 3:1 to 5:1 based on the dataset.

What does storage performance mean?

In layman’s terms, faster performance means moving more IO between storage and server in a shorter time compared with the previous storage system. In the days of storage that was based on spinning disk, the IO performance was limited by the number of spindles. Customers needed to buy a large number of spindles in order to achieve a higher number of IOPS (Input/Output Operations per Second).

Flash storage has changed completely what is required for IOPS. Now Storage Admins can choose a smaller storage size for their array and still get ample IOPS in a smaller form factor, plus get data reduction functions such as inline dedupe and compression for the 5+ year life of the storage array. On the table to the right, see typical performance paramaters for flash storage arrays.

Additionally, applications need to use less CPU cycles to access faster flash storage, thereby reducing CPU requirements for applications. There are advantages to users where the application is licensed by CPU core/socket: users can run the same application in fewer cores and reduce their license bill. The savings from buying fewer cores can be reinvested into buying faster flash storage.

Performance Paramaters

Datasheets for flash storage arrays usually specify the following per controller:

  1. Number of CPU cores/sockets and clock speed
  2. Amount of DRAM on controller
  3. Number/types of IO ports (e.g., 16Gbps FC, 10GbE Ethernet)
  4. Number of SSDs in base enclosure and capacity of SSD (RAW Terabytes)
  5. Number of SSD Shelves the controller can take (Total RAW Terabytes)
  6. IO Operations per second for 8K block size at 70/30 or 80/20 R/W ratio (usually at < 1 milli-sec latency)
    Note: Some vendors do not publicly publish performance numbers on their datasheets.
  7. Data Reduction: by virtue of deduplication and compression: this compression ratio could be 3:1 to 5:1 based on the dataset.

From the above, there are three parameters we just learned about performance that adorn the Performance Dashboard on every flash array.

Top Performance Parameters for Flash Storage
  1. Block Size (e.g., 8k):
    This could be different based on the application
  2. IOPS:
    (Input/Output Operations per second)
  3. Latency:
    This is usually measured in microsec or millisec (the average round-trip time to read or write a block of IO)
  4. Bandwidth / Throughput (computed using block size and IOPS):
    IOPS x IO-size = Bandwidth
    Bandwidth is traditionally measured in MegaBytes (MB)/per second for Fiber Channel and Megabits (Mb)/per second for Ethernet Networks.

In this blog , we will not dive into type of SSD technologies of TLC/3D NAND, etc. For performance issues with flash storage we will focus on which aspects make flash perform better than common legacy storage systems.

“As long as an SSD lasts five years…and the SSDs are supported under warranty, the type really should not matter.”

How does flash storage perform better than legacy systems?

NAND has a wear life that is much worse than its predecessor, the spinning disk. Typically SSD vendors use different levels of over-provisioning in the same capacity to differentiate between enterprise and consumer storage. Different array vendors use different types, for example, E-MLC and C-MLC. Array vendors cite different reasons why one type is better than another type. The reality is that as long as an SSD lasts five years (or whatever the array vendor claims) and the SSDs are supported under warranty, the type really should not matter.

“As long as an SSD lasts five years…and the SSDs are supported under warranty, the type really should not matter.”

SSD technology continues to improve. SSDs are getting faster, denser and cheaper every year. At the same time, storage software that runs in the controller for performance and storage efficiency performance, such as dedupe and compression, is getting increasingly “smart.” As the technology matures to take advantage of reduced wear life characteristics of the newer SSDs, this “smart” storage software makes sure the SSD lasts five or more years.

Writes have a penalty in MLC Flash, as an entire page has to be re-written for a block of write on a page. So, rewrites are not possible. Storage software takes care of this by “collecting multiple writes” and writing them in an “aligned burst.” Additionally storage software features such as inline deduplication and compression enable “write less” to flash for duplicate blocks and compressible data.

Segment-Cleaning/Garbage collection

Arrays that have enabled inline dedupe and compression typically run a “log structured file system” and the RAID subsystem is always scrubbing the filesystem to identify “stale/dirty blocks” to determine if any segments can be “cleaned” and made available for new writes.

What causes performance issues — and how to troubleshoot them:

The following three tables summarize what causes performance issues — and how to solve them. Click here to download a tip sheet to Performance Tuning and Troubleshooting for flash storage arrays.

In the three tables below, performance issues can be identified and troubleshot as follows:

  1. Inside the storage array
  2. Between the array and the server
  3. Inside the server

1)  Performance Issues Inside the Storage Array
CPU running 100% inside the array controller.
CPU inside the storage array has cores assigned to multiple tasks.

Tasks Action to take
Protocol IO

Typically when running small block IO (4k, 8K) every read/written block takes CPU cycles at the protocol layer.

If the allocated CPU cores to protocol are running at 100% capacity, the array cannot ingest more IO faster. As a result, latency increases and this means lower read and write performance. This status typically means that you need more CPU and you are running the array controllers beyond their capacity to provide sub-millisecond latency for the IO sent to them.

Dedupe and compression computation

This is usually not an area of concern as most arrays are designed to work well when the dataset size can be reduced as there’s less work to do when writing to flash.

Segment Cleaning/GC

As the Log Structured Filesystem (a.k.a. POOL) starts running over 70% full, there is significant fragmentation on the filesystem. The number of free segments for landing writes is reduced.

Segment Cleaning (GC) has to do more work to scrub dirty blocks in multiple segments and determine which segments need to be cleaned and returned to the free pool. This scrubbing typically affects write performance at higher capacity of the pool and also affects the CPU usage on the controllers. Therefore, write latencies can be expected to be higher than normal at greater than 70% pool capacity.

Degraded RAID group

If there is an SSD failure in the array, the array could have reduced performance until the failed SSD is replaced and the rebuild is complete.

Unaligned write IO from application

Some SSD-based arrays have “variable block log-structure file systems” that allow them to ingest blocks of any size and append them to an an aligned raid-stripe write.

Arrays that don’t support this feature have LUNs created with a fixed block size usually selected at the time of LUN creation. In this case, if the application does not do aligned writes to the LUNS, the amount of IO that the array handles will double for partial write and this will in turn “over-run the CPU” allocated for protocol IO causing higher latency. This phenomenon may also take an additional toll on the wear-leveling on the SSDs and may draw more capacity from its over-provisioning.

Modern operating systems, such as Windows Server 2012R2 and VMware vSphere, take care of partition alignment; but some older operating systems, such as Windows Server 2003 and Linux versions or Linux filesystems, do not align unless created with certain flags.

2)  Performance Issues Between the Array and the Server

Action to take

The plumbing between the servers and the array is typically 10GbE Ethernet or 8/16 Gbps Fibre Channel.

Having enough bandwidth/throughput* between the Server HBA/CNA and the Array Target Adapters is essential to achieving optimal performance. The following table provide the bandwidth numbers achievable with 2 HBAs or 2 NICs per controller.

Tuning your SAN means having “enough buffer-buffer credits” for FC ports, less than 3-4 hops, single initiator-multi-target zoning. For Ethernet networks, it means proper MTU settings/jumbo frames enabled, port fast disabled, static routes, etc., for a loss-less Ethernet network for iSCSI or NFS traffic. Network troubleshooting is a chapter of its own and beyond the scope of this blog.

Always remember that network troubleshooting is a very important part of performance monitoring and tuning.

Line Rate Megabytes per second Megabits per second Gigabytes per second
2x 10GbE links on one controller 2,560 (1,280 MBs per link) 20,480 2.5
2x FC8G links on one controller 1,740 (870 MBs per link) 13,920 1.69

*The formula for calculating effective bandwidth required is given in a previous section of this blog.

3)  Performance Issues Inside the Server

  Action to take
Server

4+ socket Servers have PCIe buses attached to a set of sockets and DRAM which has affinity to those sockets. It is recommended that we DO NOT “round robin” across ports that have affinity to different sides of the NUMA bus as this flood interrupts the CPU queue by the adapter and causes IO contention.

In other words , stick all your Adapters into PCIe bus(es) that have affinity to a set of sockets on the same side of Numa bus. On Linux, let irq balance handle interrupts. Hyper-V and vSphere will take care of this automatically when VMs are created.

OS Multi-pathing: Please follow the best practices provided by array vendor on multi-pathing and other tunables (e.g., udev rules for Linux). Windows provide perfmon for performance analysis . Linux provides various tools like iostat, vmstat, and sar to monitor performance.

Hypervisor Tuning

There are infodocs available from VMware on tuning HBA tunables. Additonally the Tegile vSphere plugin will tune the Tegile array automatically for you.

ESXtop: a utility on vSphere is an exceptional tool that helps analyze latency at the hypervisor level and also helps to analyze VAAI statistics to determine if VAAI is doing its job or not.

InGuest tuning

There are certain guidelines available to tune the guest OS for lower CPU use, for example, using PVSCSI instead of LSI Logic for SCSI controllers in a VMware virtual machine.

Linux provides udev rules and creating filesystems with specific offsets to create aligned IO.

Also the same tools that are available in the operating system are available inside the guest, such as perfmon for Windows, and iostat, vmstat, and sar for Linux to monitor guest performance.

Expect significant performance gains with flash storage

Flash storage has revolutionized today’s data centers in primary storage by providing faster and cheaper storage. SSDs are getting denser, better and cheaper every year and are expected to kill spinning disk. If you use the above guidelines to troubleshoot performance issues, you can expect vast improvements in your storage performance that are sure to translate to an improved bottom line for your company.

If you want to know what Tegile flash storage can do for performance in your data center, we invite you to request a demo today.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>