Click here to Skip to main content
15,881,882 members
Articles / Programming Languages / C#
Tip/Trick

IEqualityComparer<FileInfo> using MD5 Hash

Rate me:
Please Sign up or sign in to vote.
5.00/5 (3 votes)
28 May 2018CPOL1 min read 12.1K   6   5
The presented code snippet compares two given files using the IEqualityComparer.

Introduction

The presented code snippet compares two given files using the IEqualityComparer.

The comparison does the following:

- Use the FileInfo's equality operator (==) to try to determine whether these are the same instances of the FileInfo class
- If one of the objects is null, they can't represent the same files, thus false is returned
- If the file path is for both objects the same, true is returned since the file must be the same
- If the file sizes differ, the files can't be the same either thus false is returned
- And at the end we resort to comparing the MD5 hash of both files.

Please keep in mind that MD5 hashing is an expensive operation, which is not suitable for comparing a lot of files or large files. The code presented here was initally intended to be used within an integration test. If you need to compare a lot of files (or very large ones) you may resort to your own implementation - You may want to start by reading this stackoverflow discussion, though.

Background

Please keep in mind that this implementation reads the file contents into the memory to create the MD5 for each file. If you're trying to compare very large files, this may slow down your application considerably.

The code

C#
/// <summary>
/// An <see cref="IEqualityComparer{T}"/> for files using <see cref="FileInfo"/>
/// </summary>
public class FileMd5EqualityComparer : IEqualityComparer<FileInfo>
{
    /// <summary>
    /// See <see cref="IEqualityComparer{T}.Equals(T, T)"/>
    /// </summary>
    public bool Equals(FileInfo x, FileInfo y)
    {
        // Use basic comparison
        if(x == y)
        {
            return true;
        }

        // if one of both parameters is null, they can't be
        // the same - Except both are null, but this case is
        // handled above.
        if(x == null || y == null)
        {
            return false;
        }

        // If both file paths are the same, the
        // files must be the same.
        if(x.FullName == y.FullName)
        {
            return true;
        }

        // The files can't be equal if they don't
        // have the same size
        if(x.Length != y.Length)
        {
            return false;
        }

        // At last, compare the MD5 of the files.
        var md5X = GetMd5(x.FullName);
        var md5Y = GetMd5(y.FullName);

        return md5X == md5Y;
    }

    /// <summary>
    /// See <see cref="IEqualityComparer{T}.Equals(T, T)"/>
    /// </summary>
    public int GetHashCode(FileInfo obj)
    {
        return obj.GetHashCode();
    }

    /// <summary>
    /// Returns the MD5 of the file at <paramref name="filePath"/>
    /// as string
    /// </summary>
    private string GetMd5(string filePath)
    {
        using (var md5 = MD5.Create())
        {
            using (var stream = File.OpenRead(filePath))
            {
                return Encoding.Default.GetString(md5.ComputeHash(stream));
            }
        }
    }
}

History

2018-05-25 Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Switzerland Switzerland
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionGetHashCode for object equality comparison Pin
Member 95112430-May-18 4:23
Member 95112430-May-18 4:23 
QuestionWhy determine hash... Pin
User 1106097925-May-18 12:43
User 1106097925-May-18 12:43 
AnswerRe: Why determine hash... Pin
Marco Bertschi28-May-18 4:19
protectorMarco Bertschi28-May-18 4:19 
Suggestionmd5 is expanciv and slow in this special case ! Pin
Peter BCKR25-May-18 2:18
Peter BCKR25-May-18 2:18 
for a IEqualityComparer calculation MD5 hashes for every compare is a very expanciv operation because multiple problems.

1. you must compute the full hash for every compare, there is no way to break up as early its possible (for example, if you compare a 2KB file with a 2GB file you complete full hashes but this files can NEVER be the same, or if you comare to files with the same size, but the first one is a pdf and the second one is a image, you can break after the first bytes with the header)

2. compute cryptographic hashes is a very expanciv operation in terms of cpu instructions (in equal to a simple compare byte per byte)

3. there is no way to parallelise the computation for very large files in case you have a fast RAID-Array oder SSD's.


Hashes like MD5 are very use-full for deduplication or compare many files (precompute hashes and compare). but its not a good idear to compute hashes for every compare.

i suggest using a rolling compare algorithm like the following (with is fast, and allows future optimization like parallelise):

C#
if (x.Length != y.Length)
  return false;

using (var fs1 = x.OpenRead())
using (var fs2 = y.OpenRead())
{
  fs1.Position = 0;
  fs2.Position = 0;

  const int bytesToRead = sizeof(long);
  var iterations = (int)Math.Ceiling((double)fi1.Length / bytesToRead);
  var fb1 = new byte[bytesToRead];
  var fb2 = new byte[bytesToRead];
  for (var i = 0; i < iterations; i++)
  {
    fs1.Read(fb1, 0, bytesToRead);
    fs2.Read(fb2, 0, bytesToRead);

    if (BitConverter.ToInt64(fb1, 0) != BitConverter.ToInt64(fb2, 0))
      return false;
  }
}
return true;


modified 25-May-18 8:26am.

GeneralRe: md5 is expanciv and slow in this special case ! Pin
Marco Bertschi28-May-18 4:18
protectorMarco Bertschi28-May-18 4:18 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.