31. Finding duplicate files

One way to determine whether two files are identical is to compare the files byte-by-byte. The following method uses this approach to determine whether two files are identical:

// Return true if the files are identical.
public static bool FilesAreIdentical(FileInfo fileinfo1,
FileInfo fileinfo2)
{
byte[] bytes1 = File.ReadAllBytes(fileinfo1.FullName);
byte[] bytes2 = File.ReadAllBytes(fileinfo2.FullName);
if (bytes1.Length != bytes2.Length) return false;
for (int i = 0; i < bytes1.Length; i++)
if (bytes1[i] != bytes2[i]) return false;
return true;
}

This method uses the File class's ReadAllBytes method to read the two files into byte arrays. If the arrays have different lengths, then the files are not identical, so the method returns false.

If the arrays have the same lengths, then the method loops through the arrays' bytes. If two corresponding bytes are different, the method again returns false.

If the method finishes its loop, all of the arrays' bytes are the same, so the method returns true.

Unfortunately, this method could be relatively slow, particularly if the files are large or if the directory holds many files. If the directory holds N files, then the program would have to use the method to compare N × (N - 1) pairs of files. For example, if N is 1,000, then the program would have to make 999,000 comparisons. If each of the comparisons is slow, this could take a while.

Fortunately, you can reduce the number of comparisons that you need to make by first comparing the files' sizes. If two files have the same size, they may not be identical, so you still need to compare them byte-by-byte, but if two files have different sizes, then you know for certain that they are different.

That's the approach taken by the example solution. It finds the files in the directory, sorts them by size, finds groups of files with matching sizes, and then examines those groups more closely to see which files are identical.

The following GetSameSizedFiles method searches a directory for groups of files that have the same sizes:

// Return lists of files with the same sizes.
// If a file is the only one of its size, do not include it.
public static List<List<FileInfo>> GetSameSizedFiles(
this DirectoryInfo dirinfo)
{
// Get the directory's files.
FileInfo[] fileinfos = dirinfo.GetFiles();

// Get the file sizes.
long[] filesizes = new long[fileinfos.Length];
for (int i = 0; i < fileinfos.Length; i++)
filesizes[i] = fileinfos[i].Length;

// Sort by file size.
Array.Sort(filesizes, fileinfos);

// Find groups of files with the same sizes.
List<List<FileInfo>> groups = new List<List<FileInfo>>();
int num = 1;
while (num < fileinfos.Length)
{
if (fileinfos[num].Length != fileinfos[num - 1].Length)
// No match. Move on to the next size.
num++;
else
{
// We have a match. Make a list of files with this size.
List<FileInfo> files = new List<FileInfo>();
groups.Add(files);
files.Add(fileinfos[num - 1]);
long length = fileinfos[num - 1].Length;
while ((num < fileinfos.Length) &&
(fileinfos[num].Length == length))
{
files.Add(fileinfos[num++]);
}
}
}
return groups;
}

The method uses the DirectoryInfo class's GetFiles method to get the directory's files. It then creates an array holding the files' lengths and uses Array.Sort to sort the files by their sizes.

Next, the code creates a list of lists of FileInfo objects named groups. It then loops through the files looking for files that have the same size. Because the files are sorted by their sizes, any files with matching sizes will be adjacent in the fileinfos array.

When the code finds two files that have the same size, it creates a list named files and adds it to the list of lists named groups. It adds the files with the matching size and any other files that have the same size to the files list.

When it has finished examining all of the files, the method returns the groups list.

After calling the GetSameSizedFiles method, the program has lists of files with matching sizes. It still needs to examine files with matching sizes to see which are truly identical. The following GetIdenticalFiles method does that:

// Return lists of identical files.
public static List<List<FileInfo>> GetIdenticalFiles(
this DirectoryInfo dirinfo)
{
// Get lists of files that have the same sizes.
List<List<FileInfo>> sameSizedFiles = dirinfo.GetSameSizedFiles();

// Make a list to hold identical file lists.
List<List<FileInfo>> results = new List<List<FileInfo>>();
if (sameSizedFiles.Count == 0) return results;

foreach (List<FileInfo> sizeGroup in sameSizedFiles)
{
while (sizeGroup.Count > 1)
{
// Make a list for the first file.
List<FileInfo> identicalGroup = new List<FileInfo>();
FileInfo fileinfo1 = sizeGroup[0];
identicalGroup.Add(fileinfo1);
identicalGroup.RemoveAt(0);

// See if any other files should be in this group.
for (int i = sizeGroup.Count - 1; i >= 0; i--)
{
if (FilesAreIdentical(fileinfo1, sizeGroup[i]))
{
// The files are identical.
// Add the new one to the identical list.
identicalGroup.Add(sizeGroup[i]);
sizeGroup.RemoveAt(i);
}
}

// See if this identical group is empty.
if (identicalGroup.Count > 1) results.Add(identicalGroup);
}
}

// Return the identical groups.
return results;
}

This method calls the GetSameSizedFiles method to get the same-sized file lists. It then creates a new list of lists named results to hold the final lists of identical files.

Next, the code loops through lists of same-sized files. For each size list, the program enters a loop that continues until that size list is empty.

Inside the loop, the code saves the first item in the fileinfo1 variable, adds it to a new identicalGroup list, and removes it from the size list. The method then loops through the other files in the list and compares them byte-by-byte to the file that it just removed. If a file is identical to the removed file, then the code adds it to the identicalGroup list and removes it from the same size list.

After it has finished looking for files that are identical to fileinfo1, the code examines the identical file list. If that list contains more than one file, it adds the list to the results. If the list contains only one file, then the method ignores it and it is discarded.

When it finishes examining all of the size lists, the method returns its results.

The final interesting piece in the example solution is the following ProcessFiles method:

// Process the files.
private void ProcessFiles()
{
DirectoryInfo dirinfo = new DirectoryInfo(directoryTextBox.Text);
List<List<FileInfo>> groups = dirinfo.GetIdenticalFiles();

if (groups.Count == 0)
filesTreeView.Nodes.Add("No identical files");
else
{
char label = 'A';
foreach (List<FileInfo> group in groups)
{
// Create a branch for this group.
TreeNode branch =
filesTreeView.Nodes.Add(label++.ToString());

// Add the files.
foreach (FileInfo fileinfo in group)
{
// Display the file's name.
TreeNode node = branch.Nodes.Add(fileinfo.Name);

// Save the FileInfo in case we want it later.
node.Tag = fileinfo;
}
}
filesTreeView.ExpandAll();
}
}

This method creates a DirectoryInfo object for the directory entered by the user and then calls that object's GetIdenticalFiles extension method. If the result contains no groups of identical files, the program displays a message inside its TreeView control.

Otherwise, if there are groups of identical files, the code loops through them. For each group, the method adds a branch to the TreeView. It then loops through the FileInfo objects in the group and adds their file names to the new branch.

The code also saves the files' FileInfo objects in the new nodes' Tag properties in case the program needs them later. For example, you could modify the program to let the user right-click on a branch to delete its file. The program would use the Tag property to determine which file should be deleted.

Download the FindDuplicateFiles example solution to see additional details.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.124.145