Michael Chen | Content Strategist | February 14, 2024
The data deduplication process systematically eliminates redundant copies of data and files, which can help reduce storage costs and improve version control. In an era when every device generates data and entire organizations share files, data deduplication is a vital part of IT operations. It’s also a key part of the data protection and continuity process. When data deduplication is applied to backups, it identifies and eliminates duplicate files and blocks, storing only one instance of each unique piece of information. This not only can help save money but also can help improve backup and recovery times because less data needs to be sent over the network.
Data deduplication is the process of removing identical files or blocks from databases and data storage. This can occur on a file-by-file, block-by-block, or individual byte level or somewhere in between as dictated by an algorithm. Results are often measured by what’s called a “data deduplication ratio.” After deduplication, organizations should have more free space, though just how much varies because some activities and file types are more prone to duplication than others. While IT departments should regularly check for duplicates, the benefits of frequent deduplication also vary widely and depend on several variables.
Key Takeaways
In the data deduplication process, a tool scans storage volumes for duplicate data and removes flagged instances. To find duplicates, the system compares unique identifiers, or hashes, attached to each piece of data. If a match is found, only one copy of the data is stored, and duplicates are replaced with references to the original copy.
The dedupe system searches in local storage, in management tools such as data catalogs, and in data stores and scans both structured and unstructured data. To fully understand what’s involved, the following terms and definitions are key:
Data deduplication can help save resources—storage space, compute power, and money. At its most basic, deduplication is about shrinking storage volumes. But when every device produces massive amounts of data and files are constantly shared among departments, the impact of duplicate data has far-reaching consequences; for example, it can slow processes, consume hardware resources, create redundancies, and add confusion when different teams use different redundant files. Deduplication can help take care of all this, which is why many organizations keep it on a regularly scheduled cadence as part of their IT maintenance strategies.
Because data deduplication is a resource-intensive data management process, timing should depend on a number of variables, including the design of the network and when employees access files. The following are the most common situations where data deduplication is used:
General-purpose file servers provide storage and services for a wide variety of data, including individual employees’ caches of files and shared departmental folders. Because these types of servers often have both a high volume of users and a diversity of user roles, many duplicate files tend to exist. Causes include backups from local hard drives, app installations, file sharing, and more.
Virtual desktop infrastructure technology provides centralized hosting and management of virtualized desktops for remote access. The issue is, virtual hard drives are often identical, containing duplicate files that eat up storage. In addition, when a high volume of users boot up their virtual machines all at once, such as at the start of the workday, the ensuing "VDI boot storm" can grind performance to a crawl, if not a halt. Deduplication can help assuage this by using an in-memory cache for individual application resources as they’re called on demand.
Backups create duplicate versions of files, for good reason. However, the same file doesn’t need to be copied over and over in perpetuity. Instead, data deduplication ensures there’s a clean backup file, with other instances in newer backup versions simply pointing to the primary file. This allows for redundancy while also optimizing resources and storage space.
Deduplication tools make for a more efficient data transfer process. Instead of doing a start-to-finish overwrite, data deduplication tools identify files in segments. For the file transfer process, the tools scan for updated segments and move segments only as necessary. For example, if someone is receiving a new version of a very large file and the new version has just a few segments of updated code, the transfer/overwrite process can complete quickly by writing only to those segments.
Archival systems are often confused with backups as they’re both used for long-term data storage. But while systems generate backups for the purposes of disaster recovery and preparedness, organizations use archival systems to preserve data that’s no longer in active use. Duplicates may be generated when combining storage volumes or adding new segments to an archival system. The deduplication process maximizes the efficiency of archives.
From a big-picture perspective, data deduplication tools compare files or file blocks for duplicate identifying fingerprints, also known as hashes. If duplicates are confirmed, they’re logged and eliminated. Here’s a closer look at the specific steps in the process.
Chunking refers to a deduplication process that breaks files down into segments, aka chunks. The size of these segments can be either algorithmically calculated or set using established guidelines. The benefit of chunking is that it allows for more precise deduplication, though it requires more compute resources.
When data is processed by a deduplication tool, a hashing algorithm assigns a hash to it. The hash is then checked to see if it already exists within the log of processed data. If it already exists, the data is categorized as duplicate and deleted to free up storage space.
The results of the deduplication process are stored in a reference table that tracks which segments or files are removed and what they duplicated. The reference table allows for transparency and traceability while also providing a comprehensive archive of what sources a file referenced across a storage volume.
Organizations can choose from several data deduplication approaches based on what best suits their budgets, bandwidth, and redundancy needs. Where to process, when to process, how finely to process—all of these are mix-and-match variables that are used to create a customized solution for an organization.
Inline Vs. Post-Process Deduplication diagram:
Inline deduplication:
Post-process deduplication
Just as editing a document removes repetitive words or phrases to make the content more concise, deduplication streamlines an organization’s data, offering potential payoffs such as lower storage costs, less bandwidth consumption, and increased backup efficiency.
When fewer files exist, organizations use less storage. That’s one of the most clear-cut benefits of data deduplication, and it extends to other systems. Companies will require less space for backups and consume fewer compute/bandwidth resources for scanning and backing up data.
Because data deduplication reduces the burden of running backups, a key by-product is faster, easier disaster recovery. Smaller backups are created more efficiently, which means fewer resources are required to pull them for recovery purposes.
With data deduplication, the footprint of backup files shrinks, leading to lower resource use during backup processes across storage space, compute, and process time. All this gives organizations added flexibility in how they schedule their backups.
The fewer the files that need to transfer, the less bandwidth required, meaning the transfer uses fewer network resources. Thus, data deduplication can improve network efficiency by shrinking demand in any transfer process, including transporting backups for archiving and recalling backups for disaster recovery.
Exploding data volumes have led to a rapid increase in storage spending in organizations of all sizes. Deduplication can help create cost savings by reducing the amount of storage needed for both day-to-day activities and backups or archives. Secondary cost savings come from reduced energy, compute, and bandwidth demands and fewer human resources needed to manage and troubleshoot duplicative files.
Data deduplication is an effective tool to maximize resource use and reduce costs. However, those benefits come with some challenges, many related to the compute power required for granular dedupe. The most common drawbacks and concerns related to data deduplication include the following:
Data deduplication is resource intensive, especially when performed at the block level. IT teams need to be thoughtful when scheduling and executing deduplication processes, taking into consideration available bandwidth, organizational activities and needs, the backup location, deadlines, and other factors based on their unique environments.
Hash collisions refer to instances when randomly generated hash values happen to overlap. When the deduplication process uses a block-level approach, hashes are assigned to data chunks, which raises the possibility of hash collisions that can corrupt data. Preventing hash collisions involves either increasing the size of the hash table or implementing collision resolution methods, such as chaining or open addressing. Chaining involves storing multiple elements with the same hash key in a linked list or another data structure, while open addressing involves finding an alternative location within the hash table to store the duplicate element. Each method has advantages and disadvantages, so IT teams need to consider the length and complexity of the hashing algorithm versus using workarounds.
No process is foolproof, and during the dedupe process, there’s always the possibility of unintentionally deleting or altering data that is, in fact, unique and important. Causes of integrity issues include hash collisions; corrupted source blocks; interrupted processes from unexpected events such as disk failures, manual error, or power outages; a successful cyberattack; or simple operator error. While integrity issues are rare given the quality of today’s data deduplication tools and protocols, they remain a possibility and can cause serious headaches.
The deduplication process creates a new layer of metadata for change logs and the digital signatures attached to every processed block. This is called a “fingerprint file.” Not only does this metadata require storage space, but it may also create its own data integrity issues. If it becomes corrupted, for example, then the recovery process becomes significantly more challenging.
While data deduplication saves money in the long run via reduced space requirements, it does require an up-front investment. These costs include the dedupe tool itself, usually priced based on the number of records, as well as the IT staff time required to design, execute, and manage the deduplication process.
How does data deduplication work in the real world? In theory, it’s a simple data science concept: Eliminate duplicate data to reduce resource consumption and minimize errors that happen when there are several versions of a file floating around. But different sectors, industries, and even departments have unique goals and needs. Here are some common use cases.
Customer relationship management: Within a CRM system, customer records, contact info, and deals may be recorded using multiple sources, levels of detail, and formats. This leads to inconsistent data, where one manager may have a slightly different record than another; for example, if the record for a point of contact is held in multiple data repositories and only one is updated after they leave the company, some employees will likely continue to use the outdated information. Data deduplication can help ensure a single source of accurate customer information, allowing every individual and group to use the latest data to generate visualizations or run analytics.
Data integration: When two organizations merge, whether through an acquisition or internal reshuffling, data contained in different instances of the same application can create duplicate records. Say a larger company purchases a smaller competitor with a 40% overlap in customers, and that’s reflected in their ERP systems. Deduplication can eliminate this redundancy, freeing up storage space while also ensuring that everyone within the newly formed organization uses only the latest version of each record.
Virtual computing: When using virtual desktops, such as for testing environments or virtual access for specialized applications or internal systems, data deduplication can increase efficiency—particularly with heavy user volume. Virtual machines often contain very similar data, which makes for many duplicate versions of files. Data deduplication purges these duplicates to help ensure storage doesn’t get overrun with data generated by virtual machines.
Banking: Within a financial institution, different departments or branches may hold duplicate records of customer information. Every duplicate record is a potential entry point for criminals to steal identities, make fraudulent transactions, and perform other unlawful activities. And examining and processing duplicate data to check for fraud requires more resources. Data deduplication can help improve efficiency and security for banks and credit unions.
This is just a sampling of use cases. Any organization that creates a lot of data can benefit from deduplication.
Numerous providers offer data deduplication tools, but which is right for your organization? Here are the key factors for teams to consider when making a short list.
The best way to resolve data deduplication problems is to minimize them in the first place. Oracle HeatWave helps with that by combining transactions, real-time analytics across data warehouses and data lakes, machine learning, and generative AI in one cloud service. HeatWave customers don’t need to duplicate data from a transactional database into a separate analytics database for analysis, which presents several benefits.
With the built-in HeatWave AutoML, customers can build, train, and explain machine learning models within HeatWave, again without the need to duplicate data into a separate machine learning service.
HeatWave GenAI provides integrated, automated, and secure GenAI with in-database large language models (LLMs); an automated, in-database vector store; scale-out vector processing; and the ability to have contextual conversations in natural language—letting customers take advantage of GenAI without AI expertise and without moving data to a separate vector database.
By eliminating data duplication across several cloud services for transactions, analytics, machine learning, and GenAI, HeatWave enables customers to simplify their data infrastructures, make faster decisions that are more informed, increase productivity, improve security, and reduce costs. Additionally, customers get the best performance and price-performance for analytics workloads.
AI can help CIOs analyze data to optimize cloud spend and suggest code tweaks to architect to minimize egress. Learn how to harness the power of artificial intelligence now to address talent, security, and other challenges.
An example of deduplication can come from running version-based backups and archives of an organization’s data. Each of these archives will contain many instances of the same untouched files. With deduplication, the backup process is streamlined by creating a new version of an archive without those duplicative files. Instead, the new version contains pointers to the single source, allowing it to exist within the archive without using up additional storage space.
Duplicate records needlessly eat up storage space. That additional storage space winds up taking more resources, including storage volume, transfer bandwidth, and compute resources, during processes such as malware scans. Deduplication reduces the volume of storage space used, shrinking overall resource use, be it bandwidth or storage capacity.
Duplicates can emerge through both data duplicity and data redundancy. Data duplicity refers to situations when a user adds a duplicate file to the system themselves. Data redundancy refers to situations when databases with some overlapping files or records merge to create duplicates.
Deduplication can free up storage space for greater long-term efficiency and cost savings. However, the actual process of deduplication is resource intensive and can slow down various parts of the network, including compute performance and transfer bandwidth. This means IT departments must think strategically about scheduling deduplication.