Drilling Down Into Dedupe
Data deduplication has become one of the most ubiquitous features of the storage industry, with storage vendors continuing to find new places to add the technology.
It’s a move that at first glance seems odd given that, with deduplication, customers can be counted on to purchase less storage capacity.
Deduplication, also called ’dedupe,’ removes duplicate information as data is stored, backed up or archived. It can be done at the file level, where duplicate files are replaced with a marker pointing to one copy of the file, and/or at the subfile or byte level, where duplicate bytes of data are removed and replaced by pointers, resulting in a significant decrease in storage capacity requirements.
Logically, dedupe is not the kind of technology storage vendors would be expected to embrace. By adding dedupe to their storage arrays and other appliances, the vendors are actually making it possible for their customers to purchase less capacity.
However, they have no choice. Thanks to the success of pioneers in the technology such as Data Domain (acquired by EMC) and Quantum, both of which have successful lines of stand-alone dedupe appliances, other vendors need to follow suit as customers look for ways to squeeze more efficiency from their storage infrastructures.
Dedupe products can be classified in several different ways.
The first is according to where the dedupe process takes place.
Source dedupe dedupes the data before it is sent across a LAN or WAN. This results in fewer files and less data being sent over the network but can affect the performance of the backup because of the processing overhead caused by the dedupe process. However, with new high-performance processors, this is less of an issue than it has been in the past.
Target dedupe starts the dedupe process after the data is copied onto a destination device such as a virtual tape library. This takes away any overhead related to deduping data at the source but requires more storage capacity at the target to temporarily store the entire data set.
The second is according to when the dedupe process occurs.
With in-line dedupe, files are deduped as they are stored on a device. This adds processing overhead to the dedupe process but does not require extra storage capacity.
With post-process technology, data is sent to the target device to be deduped. This requires extra capacity to temporarily store the incoming files before they are deduped but takes the overhead away from the originating storage device.
Dedupe is a big market. In a recent survey of IT users, research firm IDC found that more than 60 percent of respondents are either already deduping data or plan to do so in the coming year.
Vendors are implementing dedupe in several different ways.
Dedupe originally was developed using stand-alone appliances, mainly by Data Domain, which brought the technology to the mainstream storage market, and Quantum, which got dedupe technology with its acquisition of ADIC.
Mainline and second-tier storage hardware vendors have added dedupe technologies to many of the new midrange and enterprise storage arrays.
Also, data protection software vendors including CA Technologies, CommVault and Symantec offer dedupe. Symantec in 2010 moved from a software-exclusive focus to produce its first hardware appliance based on its dedupe technology.
NEXT: Acquiring Dedupe Technology
While dedupe is commonly done on data sent to tape or to virtual tape libraries as a way of cutting the capacity of backup data, NetApp and several smaller startup vendors are offering technology to dedupe primary storage. This use of dedupe technology is still relatively uncommon as customers are concerned about the possible performance hits to storage stemming from the dedupe process.
For many vendors, dedupe technology was the result of acquisitions.
For instance, Quantum acquired ADIC in 2006. That same year, EMC acquired dedupe software maker Avamar and followed that up three years later with its $2.1 billion acquisition of market leader Data Domain.
EMC had to wrest control of Data Domain from archrival NetApp, which gave up its own attempt to acquire the dedupe vendor after a prolonged bidding war. Since then, NetApp has focused on dedupe for its primary storage appliances.
2010 saw a flurry of activity as storage vendors went on a dedupe buying spree, including Dell’s acquisition of Ocarina Networks, EMC’s acquisition of Bus-Tech, and IBM’s acquisitions of Diligent and Storwize.
Solution providers said that, despite the variety of ways customers can use dedupe technology, the stand-alone hardware appliance still has a strong appeal over software or array alternatives.
Using stand-alone appliances has the greatest potential for increasing dedupe performance, said Keith Norbie, vice president of sales at Nexus Information Systems, a Minnetonka, Minn.-based solution provider.
A dedupe appliance from manufacturers such as Data Domain depends on the processor and memory performance to dedupe data and push it through to disk, and so they become the bottleneck, Norbie said. ’No matter how fast the disk is, it all goes through the CPU and memory,’ he said. ’The good news is, the faster the Intel processor is, the faster the performance.’
The performance of appliances or storage arrays for dedupe depends on the number of spindles, or physical hard drives, Norbie said. ’To increase performance, you need to add spindles, even if you already have excess capacity,’ he said.
The big choice for best dedupe performance, then, comes down to technology that is CPU-bound or spindle-bound, Norbie said. ’And for now, CPUs are sizzling,’ he said.
Another important factor often overlooked in dedupe performance is the granularity of the process, Norbie said. Companies such as Data Domain and Quantum offer variable block size dedupe, which automatically adjusts the size of the blocks of data that can be examined for duplications.
With processing power the key factor in determining dedupe performance, the stand-alone appliance is still the primary choice, said Michael Spindler, data protection practice manager at Datalink, a Chanhassen, Minn.-based solution provider.
Spindler said solution providers should not dwell too much on the difference in performance of appliances from different vendors.
’We’ve recently seen that Quantum, because of the increase in the Intel performance and number of cores in the process, was able to double the dedupe performance over its previous models,’ he said. ’This gives it better performance than Data Domain now. But six to eight months later, Data Domain will get faster.’
Good dedupe appliances also make it easier to automate the dedupe process, Spindler said. ’In the midterm backup space, say 5 TB to 20 TB, the appliance really offers more of a set-it-and-forget-it process,’ he said.
Spindler said in his experience, variable block dedupe gives a boost in terms of capacity reduction, but not a significant one. For instance, he said increasing the dedupe data compression rate from 10:1 to 12:1 may only give back about 4 percent of disk space.
NEXT: Taking Advantage Of Dedupe
Spindler also said solution providers and customers should remember when looking at dedupe technologies that vendors can claim anything they want about that dedupe data compression.
’Most vendors make outrageous claims, including some who say they offer up to 500:1 dedupe,’ he said. ’But the actual data capacity recovered really depends on the data. If data is retained for very long times, customers will get the biggest ratios as there is a lot of redundancy in long-term data archives.’
Mark Teter, CTO of Advanced Systems Group, a Denver-based solution provider, said he is seeing a spark of customer interest in applying dedupe technology to primary data as well as to backup data, especially in heavily virtualized environments.
Teter cited NetApp as a leader in primary dedupe technology, which the vendor makes available free-of-charge with its primary storage arrays.
’Just turn it on,’ Teter said. ’NetApp has a file system that manages dedupe performance when it’s in use. It’s the one big storage vendor who is really providing a solution suitable for deduping online primary storage.’
Even so, Teter said, deduping at the primary storage level will take time to catch on. With new 2-TB SATA hard drives, the high capacity of newer storage arrays means customers see less of a reason to dedupe just to cut capacity requirements, he said. ’It’s still new, and slowly moving into primary data.’