Click here to Skip to main content
15,881,882 members
Articles / Desktop Programming / WPF
Tip/Trick

Yet Another Duplicate File Detector

Rate me:
Please Sign up or sign in to vote.
5.00/5 (2 votes)
22 Jan 2016CPOL6 min read 25.2K   1.1K   16   15
Simple utility to scan and find duplicate files in a directory.

Updated Article & Code

http://www.codeproject.com/Articles/1106099/Yet-Another-Duplicate-File-Detector-MVVM-Pattern

Github: https://github.com/kaviteshsingh/DuplicateFileDetectorMVVM

Download DuplicateFileDetectorMVVM_exe.zip - 28.4 KB

Download source - 26.2 KB

Old Source Code

Introduction

I needed a simple reliable utility which could basically find duplicate files in a directory. Main requirement was that it should not be based on the name of a file. The reason for that is that in a directory you could have single file with two different names. If you just use copy/paste approach the duplicate will not be detected. 

Secondly, I could not use file time because two different files can have same time. Similarly, two files can be of same size but completely different.

I used the idea of using MD5 hash for each file which will be unique (most of the time) for each unique file. The utility scans the directory and populate MD5 for each file and saves in SQLite database. I then query database to extract the list of files with same hash value. 

Since MD5 scans the bytes of the file, this ensures that two different files with even same size and same file time will generate two different hash files.

The utility should also provide the option to remove the duplicate files based on the user selection.

Background

The utility is a simple WPF application using SQLite database to store list of file information and then figure out duplicates.

Application

Since a directory can contain lots of files and it takes a little while to calculate MD5 hash especially for bigger files, the UI still needs to be responsive. If application performs directory scanning and MD5 calculation in UI thread, the application might appear to freeze or appear as hung though it might be busy doing work. 

To resolve this, the simple solution is to spawn a BackgroundWorker to offload the directory scanning and MD5 calculation. 

In case user stops the scan in the middle, the application will finish calculating the MD5 hash for current file and then terminate the scan. It will also populate the results with whatever duplicate files it has figured out. 

Why SQLite?

Once I generate the MD5 for the file, I need to store file attributes along with hash value in some data structure. Simplest data structure which come to mind is list. .NET provides it and I could perform LINQ operations to query information I needed.

For small number of files, the .NET list was very efficient. As the number of files increased, the list started consuming lot of memory. Though it was not deal-breaker but I thought of trying something different. 

One option was to use SQLite to store this information and query it. Some might say it is an overkill for what this utility is doing and I might agree with them.

The main motivation to use SQLite was to get my hands dirty with some local SQL database and learn how it works with .NET framework. SQLite was perfect solution with no installation hassles and lightweight. 

I first decided to use file based database but quickly realized that continuously writing to disk for each file info was causing lot of overhead and a big performance hit. I read more about SQLite and found that I could create an in-memory database similar to .NET list implementation. That turned out to be really fast and with no disk read writes and boasted the performance. 

Why not use Directory.GetFiles to get list of files in directory and sub-directories?

.NET provides a simple API Directory.GetFiles to get all the files in a list recursively for sub-directories. I chose not to use this API for three main reasons:

  1. If there was a permission issue for any of the file, the API would throw an exception. As a result, there is no way to get partial list of files scanned before exception was thrown. I wanted to consume exception for which application does not have rights to access but still continue to read other files in the directory. Typical example would be scanning c:\windows directory.
  2. This API would only return after it has retrieved list all the files in a directory and its sub-directories. This is ok for directory with small number of files. But typical scenario would be to scan a directory with few thousand files in a directory. So I needed a way to get list of files as we scan a directory and then go into each sub-directory and get list of files. To solve this problem, I wrote my own Breadth/Depth First file and directory enumeration class. 
  3. Since this is UI application, the application has to be responsive. This means, it should notify the user about the file scan status along with option to cancel it midway if user decides so. With while this API is scanning a directory, I could not find a way to cancel this operation. 

DirFileEnumeration class addresses above problems and also provides events for both File and directory found which the calling class can subscribe to notify information to user. It also provides a mechanism to cancel the enumeration process if user requests so. 

SQLite Database Queries

The SQLite database is created in memory using this connection string

string ConnectionString = "Data Source=:memory:;Version=3;New=True;";

We create table using the below query.

CREATE TABLE FileDB(Id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, Name TEXT, FullName TEXT, Hash TEXT, Size INTEGER, FileExt TEXT, Directory TEXT, LastModTime INTEGER);

Each file with calculated MD5 is stored in the database using the below query.

INSERT INTO FileDB (Name, FullName, Hash, Size, FileExt, Directory, LastModTime)
VALUES (@Name, @FullName, @Hash, @Size, @FileExt, @Directory, @LastModTime);

Once we have populated the database, we need to get list of all the files which have same MD5 hash value. To get this information, below query does the trick and I use data grid to display the results.

SELECT s.id, s.Name, s.Size, s.Hash, s.Fullname 
FROM FileDB s INNER JOIN 
(SELECT Hash FROM FileDB GROUP BY Hash HAVING COUNT(*) > 1) q ON s.Hash = q.Hash 
ORDER BY s.Hash DESC;

DataGrid View

The utility provides a simple grid view to view files with same hash. User can use left click with Ctrl to select multiple items and delete them in one go. 

The ideal control to view this probably would be a TreeView control with checkboxes. I would keep it as an enhancement because I wanted to get the first cut out with all the functionality. 

 

Improvements

This utility does what it claims i.e. find duplicate files using MD5 signature. That said there are few things which can be improved and enhanced:

  1. The columns in DataGrid view are obtained directly from public properties of the class. This can be changed and DataTemplate can be used to add or remove fields. 
  2. Instead of  DataGrid I feel Treeview control would have been better choice. An extended TreeView control with checkboxes and having each node based on the Hash value would have presented the information in more organised manner. However, in the interest of time, I had to use the simplest option available to me. 

History

First version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
Questionimproving checksum code Pin
Khayralla26-Jan-16 4:00
Khayralla26-Jan-16 4:00 
AnswerRe: improving checksum code Pin
kaviteshsingh28-Jan-16 17:20
kaviteshsingh28-Jan-16 17:20 
PraiseInterested in seeing where your approach goes Pin
jmeaux24-Jan-16 6:19
jmeaux24-Jan-16 6:19 
GeneralRe: Interested in seeing where your approach goes Pin
kaviteshsingh28-Jan-16 17:11
kaviteshsingh28-Jan-16 17:11 
So in each bucket of same size, do you compare each file with the other one? Even in RAM that would be too much, don't you think. Lets say I have 10 files of same size with size 1GB each. They all match file size and loading them in RAM will not be possible on a small PC with say 4GB RAM.
GeneralRe: Interested in seeing where your approach goes Pin
kaviteshsingh11-Jun-16 22:19
kaviteshsingh11-Jun-16 22:19 
SuggestionConsider changing your Algorithm Pin
Andreas Kroll23-Jan-16 7:36
Andreas Kroll23-Jan-16 7:36 
GeneralRe: Consider changing your Algorithm Pin
kaviteshsingh23-Jan-16 10:05
kaviteshsingh23-Jan-16 10:05 
AnswerRe: Consider changing your Algorithm Pin
Andreas Kroll23-Jan-16 10:40
Andreas Kroll23-Jan-16 10:40 
GeneralRe: Consider changing your Algorithm Pin
kaviteshsingh23-Jan-16 10:49
kaviteshsingh23-Jan-16 10:49 
AnswerRe: Consider changing your Algorithm Pin
Andreas Kroll23-Jan-16 11:15
Andreas Kroll23-Jan-16 11:15 
GeneralRe: Consider changing your Algorithm Pin
kaviteshsingh23-Jan-16 11:26
kaviteshsingh23-Jan-16 11:26 
AnswerRe: Consider changing your Algorithm Pin
Andreas Kroll23-Jan-16 11:34
Andreas Kroll23-Jan-16 11:34 
GeneralRe: Consider changing your Algorithm Pin
kaviteshsingh11-Jun-16 22:20
kaviteshsingh11-Jun-16 22:20 
Questionhave you consider to post this as a tip? Pin
Nelek23-Jan-16 1:09
protectorNelek23-Jan-16 1:09 
AnswerRe: have you consider to post this as a tip? Pin
kaviteshsingh23-Jan-16 10:14
kaviteshsingh23-Jan-16 10:14 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.