Compress First or Hashing First

All modern storage filers/arrays deploy various data reductions mechanisms. Compression and deduplication play dominant roles in achieving that objective. Inline deduplication, which first gained wide use in backup and archival storage, has become one of the prominent techniques of data reduction on primary storage as well.

On primary storage systems, fixed-block deduplication is more prevalent than variable-block deduplication due to the latency constraints and the wide variation of data access patterns and usage models (many more than with traditional backup). The penalty of doing variable-block deduplication outweighs the additional savings it would reap.

For systems that use variable-block chunking (and hash computation), compression is the next logical step (if compression is enabled). Otherwise, if the data is compressed before it’s hashed, the deduplication rate will be negatively impacted.

That isn’t the case when performing fixed-block deduplication, though. The order does not impact the data reduction rate. Some primary-storage vendors may try to convince you otherwise. But it doesn’t.

To clear things up I’ve included the performance characterization of the hash computation and the compression rate, which was performed on a real system shipped by Tegile (table below). Assuming that everything else is the same, the efficiency of data reduction should be the ultimate determining factor to decide whether to do compression first or hashing first. When deduplication is done first, based on the hit rate, less bytes need to be compressed. Similarly, when the system compresses the data first, depending on the compression rate, less net bytes need to be hashed.

The following attempts to gauge the impact of compression first or hashing first:

CF – Compress First
HF – Hashing First
c – cost of compression per unit
h – cost of hashing per unit

For a better comparison, I’ve used some real numbers. X is put as 1000 units, sha256 has been used for h, and, LZ4 numbers are used for c. The number in bracket shows the time to process those units. The lower the time, the better the approach.

Case 1: 0% Dedupe, 0% Compression

CF: X*c + Y*h [4760 ns]
HF: X*h + X*c [4760 ns]
Case Verdict: CF and HF are equal

Case 2: 0% Dedupe, 50% Compression.

CF: X*c + X/2*h [2940 ns]
HF: X*h + X*c [4760 ns]
Case Verdict: CF is better

Case 3: 50% Dedupe, 50% Compression

CF: X*c + X/2*h [2940 ns]
HF: X*h + X/2*c [4200 ns]
Case Verdict: CF is better

Case 4: 50% Dedupe, 0% Compression

CF: X*c + X*h [4760 ns]
HF: X*h + X/2*c [4200 ns]
Case Verdict: HF is better

Case 5: 80% Dedupe, 0% Compression

CF: X*c + X*h [4760 ns]
HF: X*h + X/5*c [3864 ns]
Case Verdict: HF is better

Case 6: 80% Dedupe, 25% Compression

CF: X*c + 3X/4*h [3850 ns]
HF: X*h + X/5*c [3864 ns]
Case Verdict: CF is better

Case 7: 80% Dedupe, 50% Compression

CF: X*c + X/2*h [2940 ns]
HF: X*h + X/5*c [3864 ns]
Case Verdict: CF is better

Case 8: 80% Dedupe, 75% Compression

CF: X*c + X/4*h [2030 ns]
HF: X*h + X/5*c [3864 ns]
Case Verdict: CF is better

As you can see from these six scenarios, if the data can be compressed, it is best to do compression first. That’s because the speed of performing compression is much faster than the speed of performing the hash computation. In the above comparison, even when data is 80% redundant (deduplicatible?), performing compression first is better—even when the compression rate is as low as 25%.

But does that conclusion stand up if the system performs a quick weak checksum first and then performs a ‘full compare’ post process? The answer to that question is not so straightforward. As with many non-trivial things in life, it depends.

In that scenario, whenever a collision occurs, a full data compare is needed. If the data needed for comparison is not in cache (which is highly likely due to the extremely random nature of hashing) it will incur additional I/O and latency. This makes the worst case very nondeterministic. In addition, positive hits increase as the pool of data grows, requiring additional memory comparison. Hence, gauging the real impact is very hard to characterize. It depends largely on the data type and system status.

What’s more, the decision to perform compression first or deduplicate first largely affects CPU. And since most enterprise storage systems today have plenty of CPU, the advantage, if any, is even less pronounced.

Compression and Hash Computation Performance Characterization

System configuration
Intel® Xeon® CPU E5-2470 v2 @ 2.40GHz with 98G of memory. Timing is calculated using rdtsc counters and calibrated in nanoseconds (hrtime).

50GB of total data processed in 4k chunk size. The random data is first created in 16GB buffer. The 16GB is rolled over when it reaches the end. The processing was done using single thread.

Algorithm Time per byte (ns) Throughput (MB/S)
sha256 (SSE optimized) 3.64 274
sha512 (SSE optimized) 2.53 394
LZ4 compression 1.12* 892*

* Compression speed is somewhat dependent on the data used for compression and the results achieved can vary depending on the data profile. In the test, random data was used for compression.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>