Who’s on First – Compress or Dedupe?

Most new generation storage arrays come with data reduction technologies. For All-Flash storage arrays, inline data compression and dedupe are table stakes features. Once in a while the question is asked by inquisitive minds – should the data be compressed first and then deduped, or should it be the other way round? The answer, as is the case with most non-trivial questions, is it depends.

Let’s begin with a couple of basics. With most compression algorithms, compressing two identical data blocks will result in two identical compressed blocks. Dedupe involves calculating and comparing cryptographic checksums for data blocks, and the data block itself remains unchanged.

There are two aspects to data reduction – 1) amount of data reduction, and 2) overall compute efficiency for data reduction.

As far as the amount of data reduction in storage arrays is concerned, compressing first or deduping first does not change the overall amount of data reduction.

  • Consider three data blocks “AAAAAAAA”, “BBBBBBBB” and “AAAAAAAA”.
  • Let’s assume that compressing block “AAAAAAAA” yields “XXXX” and compressing block “BBBBBBBB” yields “YYYY”.
  • Let’s also assume that the checksum for “XXXX” is “C1” and the checksum for “YYYY” is “C2”.
  • After the three blocks go through compression and dedupe, the end result is that one copy of “XXXX” and one copy of “YYYY” will be stored (with the requisite metadata updates), regardless of whether the blocks are compressed or deduped first.

Before we address the question of CPU efficiency for data reduction, we should consider two different approaches to dedupe – 1) use a strong cryptographic checksum, or 2) use an initial weak checksum followed by a post-process strong checksum (or a byte compare).

Tegile does #1. Compression is more CPU efficient than calculating strong cryptographic checksums for deduplication. Compressing first leaves fewer bytes to dedupe. Hence it is more efficient for us to compress first and then dedupe a smaller set.

The bottom line is that there are several ways to do data reduction. Within a given architecture, it is important to deliver maximum data reduction in a CPU efficient manner, regardless of the order of data reduction operations.

When it comes to data reduction, what really matters is whether it is in-line or a post-process job. Check out Chris Naddeo’s blog on this.

Leave a Reply

Your email address will not be published. Required fields are marked *