Delete Duplicate Image Files

By Xah Lee. Date: 2005-03-20. Last updated: 2022-06-10.

Suppose you have 30 thousands image files in many directories. You want to delete duplicates.

Here's a Perl script that solves the problem.

How to Use It

perl del_dup.pl --help

To find dup files in a dir:
 perl del_dup.pl dirpath

To find dup files in several dir:
 perl del_dup.pl dirpath1 dirpath2 dirpath3 …

To delete dup files:
 perl del_dup.pl --delete dirpath
or
 perl del_dup.pl --delete dirpath1 dirpath2 …

When there are duplicate files, the first one found (in the order the dir is given) is preserved, the others are deleted.

To see this help again:
 perl del_dup.pl --help

Note: the options --help and --delete must be first argument.

A file is considered duplicate of another only if the 2 file's content are exactly identical. If you have 2 images, one is scaled version of the other, they are not considered identical.

I use this script on 30 thousand image files regularly over the years. For over 30k files, it runs under a minute.

Sample Output

perl del_dup.pl --delete C:\Users\h3\Pictures\keyboard "C:\Users\h3\Pictures\keyboard - Copy"

Adding dir to check: C:\Users\h3\Pictures\keyboard
Adding dir to check: C:\Users\h3\Pictures\keyboard - Copy
There are a total of 32 files examed.
==============================
There are 16 unique file size.
==============================
---------------------
These following files are identical:
C:\Users\h3\Pictures\keyboard/windowslogo.gif
C:\Users\h3\Pictures\keyboard - Copy/windowslogo.gif

C:\Users\h3\Pictures\keyboard/DSC_1108.jpg
C:\Users\h3\Pictures\keyboard - Copy/DSC_1108.jpg

C:\Users\h3\Pictures\keyboard/ms-sidewinder-x6-gaming-keyboard-full.jpg
C:\Users\h3\Pictures\keyboard - Copy/ms-sidewinder-x6-gaming-keyboard-full.jpg

C:\Users\h3\Pictures\keyboard/g510.jpg
C:\Users\h3\Pictures\keyboard - Copy/g510.jpg

C:\Users\h3\Pictures\keyboard/71Uvd2tZOZL._AA1500_.jpg
C:\Users\h3\Pictures\keyboard - Copy/71Uvd2tZOZL._AA1500_.jpg

C:\Users\h3\Pictures\keyboard/g510 red.jpg
C:\Users\h3\Pictures\keyboard - Copy/g510 red.jpg

C:\Users\h3\Pictures\keyboard/ms x4.jpg
C:\Users\h3\Pictures\keyboard - Copy/ms x4.jpg

C:\Users\h3\Pictures\keyboard/81fuOEG-2lL._AA1500_.jpg
C:\Users\h3\Pictures\keyboard - Copy/81fuOEG-2lL._AA1500_.jpg

C:\Users\h3\Pictures\keyboard/g110.jpg
C:\Users\h3\Pictures\keyboard - Copy/g110.jpg

C:\Users\h3\Pictures\keyboard - Copy/81hTgnd037L._AA1500_.jpg
C:\Users\h3\Pictures\keyboard/81hTgnd037L._AA1500_ - Copy.jpg
C:\Users\h3\Pictures\keyboard - Copy/81hTgnd037L._AA1500_ - Copy.jpg
C:\Users\h3\Pictures\keyboard/81hTgnd037L._AA1500_.jpg

C:\Users\h3\Pictures\keyboard/lenovo_thinkpad_usb_trackpoint_keyboard-2.jpg
C:\Users\h3\Pictures\keyboard - Copy/lenovo_thinkpad_usb_trackpoint_keyboard-2.jpg

C:\Users\h3\Pictures\keyboard/g19.jpg
C:\Users\h3\Pictures\keyboard - Copy/g19.jpg

C:\Users\h3\Pictures\keyboard/g510 yellow - Copy.jpg
C:\Users\h3\Pictures\keyboard - Copy/g510 yellow - Copy.jpg

C:\Users\h3\Pictures\keyboard/g510 green.jpg
C:\Users\h3\Pictures\keyboard - Copy/g510 green.jpg

==============================
There are 16 redundant files, totaling 2396674 bytes.
The following files (if any) will be deleted (if you used the “--delete” option):
C:\Users\h3\Pictures\keyboard - Copy/71Uvd2tZOZL._AA1500_.jpg
C:\Users\h3\Pictures\keyboard - Copy/81fuOEG-2lL._AA1500_.jpg
C:\Users\h3\Pictures\keyboard - Copy/81hTgnd037L._AA1500_ - Copy.jpg
C:\Users\h3\Pictures\keyboard - Copy/81hTgnd037L._AA1500_.jpg
C:\Users\h3\Pictures\keyboard - Copy/DSC_1108.jpg
C:\Users\h3\Pictures\keyboard - Copy/g110.jpg
C:\Users\h3\Pictures\keyboard - Copy/g19.jpg
C:\Users\h3\Pictures\keyboard - Copy/g510 green.jpg
C:\Users\h3\Pictures\keyboard - Copy/g510 red.jpg
C:\Users\h3\Pictures\keyboard - Copy/g510 yellow - Copy.jpg
C:\Users\h3\Pictures\keyboard - Copy/g510.jpg
C:\Users\h3\Pictures\keyboard - Copy/lenovo_thinkpad_usb_trackpoint_keyboard-2.jpg
C:\Users\h3\Pictures\keyboard - Copy/ms x4.jpg
C:\Users\h3\Pictures\keyboard - Copy/ms-sidewinder-x6-gaming-keyboard-full.jpg
C:\Users\h3\Pictures\keyboard - Copy/windowslogo.gif
C:\Users\h3\Pictures\keyboard/81hTgnd037L._AA1500_.jpg
File deletion done (if any)!

code

# -*- coding: utf-8 -*-

# perl
# delete files that's identical.
# http://xahlee.info/python/delete_dup_files.html

# Copyright 2005, 2022 by Xah Lee, Xah Lee. All rights reserved.

# created: 2003-05
# version: 2011-07-08 added a -d and -h option. Expanded up the inline doc. Other small cleanups.

##################################################
use strict;
use File::Find;
use Data::Dumper qw(Dumper); $Data::Dumper::Indent=1;
use File::Compare;

# use Combo114 qw(combo reduce merge);
# use Genpair114 qw(parti genpair);

##################################################

# combo(n) returns a collection with elements of pairs that is all possible combinations of 2 things from n. For example, combo(4) returns {'3,4' => ['3',4],'1,2' => [1,2],'1,3' => [1,3],'1,4' => [1,4],'2,3' => ['2',3],'2,4' => ['2',4]}; Each pair ($i,$j) returned must have $i < $j. Hash form is returned instead of array for this program.
sub combo {
  my $max = shift;
  my %hh=();
  for my $j ( 1 .. $max ) {
    for my $i ( 1 .. $j-1 ) {
      $hh{"$i,$j"} = [$i, $j];
    }
  }
  return \%hh;
}

# old implementation with lesser algorithm
# sub combo ($) { my $max=$_[0]; my %hh=(); for (my $j=1; $j < $max; ++$j) { for (my $i=1; $i <= $max; ++$i) { my $m = (($i+$j)-1)%$max+1; if ($i < $m){ $hh{"$i,$m"}=[$i,$m];}}} return \%hh;}

=pod

e.g. reduce( $pairings, $a_pair) retured the first argument with some pairs deleted.

Detail:

we have n things, represented by numbers 1 to n. Some of these are
identical. We want to partition the range of numbers 1 to n so that
identical ones are grouped together.

To begin comparison, we generate a list of pairings that's all
possible parings of numbers 1 to n. (of course order does not matter,
and the pairing does not contain repeations) This is the first
argument to reduce.

We'll go thru this pairings list one by one and do comparisons, remove
the pair once it has been compared. However, more pairs can be removed
if a we find a pair identical.

For example, suppose we know that 2 and 4 are identical, and if the
pairing list contains (2,3) and (4,3), one of them can be deleted
because now 2 and 4 are the same thing.

(We do this because we expect the comparison operation will be
expensive.)

reduce( $pairings, $a_pair) returns a reduced $pairings knowing that
$a_pair are identical.

The first argument $pairings must be in the form of a hash.

 {'1,5' => [1,5],'3,5' => [3,5],'2,4' => [2,4],'4,5' => [4,5],'1,3' =>
 [1,3],'2,5' => [2,5],'1,2' => [1,2],'3,4' => [3,4],'2,3' =>
 [2,3],'1,4' => [1,4]}

(Note that keys are strings of the pairs separated by a comma.)

$a_pair is a reference to a list of the form [$a,$b].

For example, if the input is the hash given above,
and if 2,3 is identical,
then these pairs will be deleted
3,4
1,3
3,5

(different pairs may be deleted if the hash's pairs are given in
different order. i.e. 3,4 instead of 4,3)

The return value is a reference to a hash.

=cut

sub reduce ($$) {
  my %hh= %{$_[0]};             # e.g. {'1,2'=>[1,2],'5,6'=>[5,6],...}
  my ($j1,$j2)=($_[1]->[0],$_[1]->[1]); # e.g. [3,4]
  delete $hh{"$j1,$j2"};
  foreach my $k (keys %hh) {
    $k=~m/^(\d+),(\d+)$/;
    my ($k1,$k2)=($1,$2);
    if ($k1==$j1) {
      if ($j2 < $k2) {
        delete $hh{"$j2,$k2"};
      } else {
        delete $hh{"$k2,$j2"};
      }
      ;
    }
    ;
    if ($k2==$j1) {
      if ($k1 < $j2) {
        delete $hh{"$k1,$j2"};
      } else {
        delete $hh{"$j2,$k1"};
      }
      ;
    }
    ;
  }
  return \%hh;
}

=pod

merge($pairings) takes list of pairs, each pair indicates the sameness
of the two indexes. Returns a partitioned list of same indexes.

For example, if the pairings is
merge( [ [1,2], [2,4], [5,6] ] );

that means 1 and 2 are the same. 2 and 4 are the same. Therefore
1==2==4. The result returned is

[[4,2,1],[6,5]];

(ordering of the returned list and sublists are not specified/important.)

=cut

sub merge($) {
  my @pairings = @{$_[0]};     # @pairings is, e.g. ([a,b], [c,d],...)

  my @interm; # array of hashs. For the hash, Keys are numbers, values are dummy 'x'.

  # chop the first value of @pairings into @interm
  $interm[0]={$pairings[0][0]=>'x'}; ${interm[0]}{$pairings[0][1]}='x'; shift @pairings;

 N1: for my $aPair (@pairings) {
    for my $aGroup (@interm) {
      if (exists ${$aGroup}{$aPair->[0]}) {
        ${$aGroup}{$aPair->[1]}='x'; next N1;
      }
      if (exists ${$aGroup}{$aPair->[1]}) {
        ${$aGroup}{$aPair->[0]}='x'; next N1;
      }
    }
    push @interm, {$aPair->[0]=>'x'}; ${interm[-1]}{$aPair->[1]}='x';
  }

  my @fin = shift @interm;

 N2: for my $group (@interm) {
    for my $newcoup (@fin) {
      foreach my $k (keys %$group) {
        if (exists ${$newcoup}{$k}) {
          map { ${$newcoup}{$_}='x'} (keys %$group); next N2;
        }
      }
    }
    push @fin, $group;
  }
  return map {[keys (%$_)]} @fin;
}

# ssss---------------------------------------------------

=pod

parti(aList, equalFunc)

given a sortable and sorted list aList of n elements, we want to
return a list that is a range of numbers from 1 to n, partitioned by
the predicate function of equivalence equalFunc. (a predicate function
is a function that takes two arguments, and returns either True or
False.)

example:
parti([
['x','x','x','1'],
['x','x','x','2'],
['x','x','x','2'],
['x','x','x','2'],
['x','x','x','3'],
['x','x','x','4'],
['x','x','x','5'],
['x','x','x','5']], sub {$_[0]->[3] == $_[1]->[3]} )

returns
 [[1],['2','3','4'],['5'],['6'],['7','8']];

Note: a mathematical aspect: there are certain mathematical
constraints on the a function that checks equivalence. That is to say,
if a==b, then b==a. If a==b and b==c, then a==c. And, a==a. If a
equivalence function does not satisfy these, it is inconsistent and
basically give meaningless result.

Note: This parti function requires the input to be sortable and
sorted.

=cut

sub parti($$) {
my @li = @{$_[0]};
my $sameQ = $_[1];

my @tray=(1);
my @result;

for (my $i=1; $i <= ((scalar @li)-1); $i++) {
    if (&$sameQ($li[$i-1], $li[$i]))
    {push @tray, $i+1} else {push @result, [@tray]; @tray=($i+1);}
}
push @result, [@tray];

return \@result;
}

=pod

given a list that is a set partitioned into subsets, generate a list
of all possible pairings of elements in any two subset.  Example:

genpair([[1],['2','3','4'],['5'],['6'],['7','8']])

returns:

 [[1,'2'],[1,'3'],[1,'4'],[1,'5'],[1,'6'],[1,'7'],[1,'8'],['2','5'],['3','5'],['4','5'],['2','6'],['3','6'],['4','6'],['2','7'],['2','8'],['3','7'],['3','8'],['4','7'],['4','8'],['5','6'],['5','7'],['5','8'],['6','7'],['6','8']];

Actually this program returns a reference to a hash. The keys are of the form "3,7"

=cut

sub genpair ($) {
my $partiSet = $_[0]; # e.g. [[1],['2','3','4'],['5'],['6'],['7','8']];

my %result;

for (my $head =0; $head <= ((scalar @$partiSet)-2); $head++ ) {
	for (my $tail = $head+1; $tail <= ((scalar @$partiSet)-1); $tail++ ) {
		foreach my $ii (@{$partiSet->[$head]}) {
			foreach my $jj (@{$partiSet->[$tail]}) {
				$result{"$ii,$jj"}= [$ii,$jj];
			}
		}
	}
}

return \%result;
}

##################################################

# arguments

my $helpText = q{
To find dup files in a folder:
 perl del_dup.pl dirpath

To find dup files in several folders:
 perl del_dup.pl dirpath1 dirpath2 dirpath3 ...

To delete dup files:
 perl del_dup.pl -d dirpath
or
 perl del_dup.pl -d dirpath1 dirpath2 ...

When there are duplicate files, the first one found (in the order the dir is given) is preserved, the others are deleted.

To see this help again:
 perl del_dup.pl -h

Note: the options -h and -d must be first argument.
};

if (not defined $ARGV[0]) {die qq{No argument received. \n $helpText};}

if ($ARGV[0] eq q{-h}) {print $helpText; exit 0;}

my $debugModeQ = 0;

my $septor1 = q{==============================} . "\n";
my $septor2 = q{---------------------} . "\n";
my $septor3 = q{=-=-=-=-=-=-=-=-=-=-=-=-=-} . "\n";

my $deleteModeQ = $ARGV[0] eq q{-d} ? 1 : 0;

# print "total num of argv:", scalar(@ARGV), "\n";
# print "argv dump:", Dumper \@ARGV;

my @dirsToCheck = ();

for (my $i=0; $i < scalar(@ARGV); ++$i) {
  # print "i is:", $i, "\n";
  if (( $ARGV[$i] ne q{-d})) {
    if (-d $ARGV[$i]) {
      print "Adding dir to check: $ARGV[$i]\n";
      push @dirsToCheck, $ARGV[$i];
    } else {
      print "Not a dir, skipped: $ARGV[$i]\n";
    }
  }
}

##################################################
# get all files and put them in an array
# each element has the form [dir path, file name, file size, integer of dir order]

my @fileList = ();

for (my $i=0; $i < scalar(@dirsToCheck); ++$i) {

    sub wanted() {
	my $fileName = $_;
	my $dirName = $File::Find::dir;
	if ( -f $File::Find::name
                # && $fileName =~ m@\.jpg$@i # check only jpg files
            ) {
          push @fileList, [$dirName, $fileName,  -s "$dirName/$fileName", $i];
        }
    }

    if (-d $dirsToCheck[$i]) {
      find(\&wanted, $dirsToCheck[$i]);
    }

}

#@fileList = sort {$a->[2] <=> $b->[2]} @fileList;

@fileList = sort {$a->[2] <=> $b->[2] || $a->[3] <=> $b->[3] } @fileList;

print qq[There are a total of ${\(scalar @fileList)} files examed.\n];
# print Dumper \@fileList;
print $septor1;

################################################################
# then partion the file list by file size.
# final result is an array @pfl, each element is another array, and each element is a file element, like this:
# ([ [ dir, name, size, dir index], ...], ...)

my @pfl; # partitioned file list, same-sized files are gathered together.

# partition the list into sublists of same-sized files
my @tray = @fileList[0]; for (my $i=1; $i < (scalar @fileList); $i++) { if ( (($fileList[$i-1]->[2]) == ($fileList[$i]->[2])) ) { push @tray, $fileList[$i] } else {push @pfl, [@tray];@tray=($fileList[$i]);} } push @pfl, [@tray]; undef @fileList;

print "There are ${\(scalar @pfl)} unique file size.\n";
if ($debugModeQ) { print "These following file groups have identical size:\n"; foreach my $group (@pfl) { for my $f (@$group) {print "$f->[0]/$f->[1]\n";}; print "\n"; }}

print $septor1;

################################################################
# fine tune the partition to be identical files instead of same sized.

# for a list of files (of same size), we want group together identical files.

#steps:
# find its length. from now on work with indexes.
# generate the comparison list (all possible pairings of any file from different dirs)
# do comparison, if hit, reduce the comparison list, and put the pair in idTwins list.
# repeat the above until comparison list is empty.
# link the identical twins (of pairs) into groups of the same index.
# make the partition. (turn indexes back to file element [dir, name, size])

my @idTwins=(); # e.g. ([1,3],[7,9],[3,2],...) # files are represented by indexs. In @idTwins, grouped togeher are indexs of the same file. i.e. file 1 3 are the same, 7 9 are the same, etc.
my @dfl=(); # partitioned duplicate file list

foreach my $filegroup (@pfl) {
    next if 1 == scalar @$filegroup;
    @idTwins=();

    my $pairings = ((scalar @dirsToCheck) > 1 ? genpair parti( $filegroup , sub {$_[0]->[3] == $_[1]->[3] }) : combo scalar @$filegroup );

    if ($debugModeQ){ print $septor2; print 'number of same sized: ', scalar @$filegroup, "\n"; print 'pairings: '; foreach my $nnn (keys %$pairings) {print "$nnn | "; }; print "\n";}

    while (0 < scalar keys %$pairings) {
        my $pairKey = [keys %$pairings]->[0];
        my $f1 = "$filegroup->[$pairings->{$pairKey}->[0]-1]->[0]/$filegroup->[$pairings->{$pairKey}->[0]-1]->[1]"; # $f1 is the full file path
        my $f2 = "$filegroup->[$pairings->{$pairKey}->[1]-1]->[0]/$filegroup->[$pairings->{$pairKey}->[1]-1]->[1]";

        if ($debugModeQ) { print $septor3; print 'current pairings pool: '; foreach my $nnn (keys %$pairings) {print "$nnn | "; }; print "\n"; print 'current id twins: '; foreach my $nnn (@idTwins) {print "@$nnn | "; }; print "\n"; print 'f1 is: ', $f1, ' ',$pairings->{$pairKey}->[0], "\n"; print 'f2 is: ', $f2, ' ', $pairings->{$pairKey}->[1],"\n"; }

        if (compare($f1,$f2) == 0) {
            if ($debugModeQ) { print "Test result: is same \n";}
            push @idTwins, $pairings->{$pairKey};
            $pairings = reduce $pairings, $pairings->{$pairKey};
        } else { if ($debugModeQ) {print "Test result: not same \n"} delete $pairings->{$pairKey} }
    }

    if ($debugModeQ) {print $septor2; print 'id twins: ', Dumper \@idTwins; print "\n"; print 'id groups: ', Dumper [merge \@idTwins]; print "\n";}

    if (0 < scalar @idTwins) { foreach my $indexGroup (merge \@idTwins) { push @dfl, [map {$filegroup->[$_-1];} @$indexGroup];}}

}; undef @pfl;

print $septor2;

print "These following files are identical:\n"; foreach my $group (@dfl) { for my $f (@$group) {print "$f->[0]/$f->[1]\n";}; print "\n"; }

print $septor1;

################################################################
# sort the sublists in dfl according to the order of given dir
# first given dir first

my @sfl = (); foreach my $mery (@dfl) {push @sfl, [sort {$a->[3] <=> $b->[3]} @$mery];}; undef @dfl;

my $summ=0; for my $bun (@sfl) {$summ += (scalar @$bun) -1};
my $sizz=0; for my $bun (@sfl) {$sizz += ((scalar @$bun) -1) * $bun->[0]->[2]};
print "There are $summ redundant files, totaling $sizz bytes.\n";

# print Dumper \@sfl;

my @toDelete; # full file paths; to be deleted

for my $bunch (@sfl) {shift @$bunch;  for my $f (@$bunch) {push @toDelete, "$f->[0]/$f->[1]"}}; @toDelete = sort @toDelete; undef @sfl;

# print Dumper \@toDelete;

print "The following files (if any) will be deleted (if you used the -d option):\n";

foreach my $goner (@toDelete) {print "$goner\n";}

if ( $deleteModeQ == 1) {
    unlink @toDelete;
    print "File deletion done (if any).\n";
}

__END__