Thursday, September 6, 2007

What Is the Difference Between Data Deduplication, File Deduplication, and Data Compression?

Data deduplication is one of the hottest topics in storage. eWEEK IT expert W. Curtis Preston, vice president of Data Protection for Glasshouse Technologies, explains how it differs from other storage technologies.


Q: Can you explain the differences between compression, file deduplication and data deduplication?
A: All of these products fit into an overall market and technical concept, which is capacity optimization or data reduction. This refers to a broad group of products that seek to reduce the amount of data that has to be stored. Roughly speaking, you can rank these techniques by the amount of data reduction they yield. Compression might typically get you a 2-to-1 reduction. File deduplication, which is commonly known as content addressable storage or CAS, might yield a 3-to-1 or 4-to-1 reduction. But data deduplication—which is deduplication at the level of individual disk blocks or "chunks" rather than entire files—can often give you a 20-to-1 reduction or better, depending on the type of data. Remember, we're talking about the aggregate reduction in the total amount of data stored on your backup storage device, not necessarily the reduction in any particular file or block, which can vary considerably.

Q: Why is data deduplication so much more effective in reducing data than file deduplication?
A: Data deduplication examines all your data on the block level and eliminates redundant blocks. So obviously it will take care of entire files that are redundant, but unlike file deduplication it will also eliminate the redundant pieces that occur when many slightly different versions of the same file are created by users or by applications like Microsoft Exchange. If users have been e-mailing back and forth a PowerPoint file while making minor changes, you can end up storing 10 or 20 files whose content is 95 percent identical. Data deduplication will catch that.

Q: When should you use data deduplication and when should you use file deduplication?
A: A very short answer would be that file deduplication is often used for backup solutions in so-called ROBO environments (remote office, branch office). Data deduplication can be used either in the data center itself, as a software function installed on the intelligent disk target, or on the backup client side in a ROBO environment.

Q: Who are some of the more commonly used data deduplication vendors?
A: There are plenty of vendors, because data deduplication is a very hot area these days, especially now that the VTL (virtual tape library) vendors are getting involved. There is Avamar (acquired by EMC), Symantec Puredisk, Asigra, Data Domain, Diligent Technologies, Falconstor, Sepaton, Quantum. Network Appliance has a product in beta.

Q: Who are some of the more commonly used file deduplication or content addressable storage vendors?
A: EMC has the Centera product line. Then there is Archivas (recently acquired by Hitachi Data Systems) and Caringo.

Q: What accounts for the difference in yield between compression and file deduplication?
A: With compression you are using some algorithm or other to reduce the size of a particular file by eliminating redundant bits. But if your users or applications have stored the same file multiple times, then no matter how good your compression method is your backup storage will end up with multiple copies of the compressed files. File deduplication goes a step further and eliminates these redundant copies, storing only one. So it gives you more reduction than just compression alone.

Q: Where does delta block optimization fit in?
A: This is another capacity optimization technique. It's used by incremental remote backup products like Connected (acquired by Iron Mountain) and EVault (acquired by Seagate). When you go to back up the most recent version of a file that has already been backed up, the software looks at it and tries to figure which blocks are new. Then it writes only these blocks to backup and ignores the blocks in the file that haven't changed. But again, this technique has the same shortcoming compared with file deduplication as compression. If two users sitting in the same office have identical copies of the same file, then delta block optimization will create two identical backups instead of storing just one like file deduplication.