Delete Duplicate Image Files
Suppose you have 30 thousands image files in many directories. You want to delete duplicates.
Here's a Perl script that solves the problem.
How to Use It
perl del_dup.pl --help To find dup files in a dir: perl del_dup.pl dirpath To find dup files in several dir: perl del_dup.pl dirpath1 dirpath2 dirpath3 … To delete dup files: perl del_dup.pl --delete dirpath or perl del_dup.pl --delete dirpath1 dirpath2 … When there are duplicate files, the first one found (in the order the dir is given) is preserved, the others are deleted. To see this help again: perl del_dup.pl --help Note: the options --help and --delete must be first argument.
A file is considered duplicate of another only if the 2 file's content are exactly identical. If you have 2 images, one is scaled version of the other, they are not considered identical.
I use this script on 30 thousand image files regularly over the years. For over 30k files, it runs under a minute.
Sample Output
perl del_dup.pl --delete C:\Users\h3\Pictures\keyboard "C:\Users\h3\Pictures\keyboard - Copy" Adding dir to check: C:\Users\h3\Pictures\keyboard Adding dir to check: C:\Users\h3\Pictures\keyboard - Copy There are a total of 32 files examed. ============================== There are 16 unique file size. ============================== --------------------- These following files are identical: C:\Users\h3\Pictures\keyboard/windowslogo.gif C:\Users\h3\Pictures\keyboard - Copy/windowslogo.gif C:\Users\h3\Pictures\keyboard/DSC_1108.jpg C:\Users\h3\Pictures\keyboard - Copy/DSC_1108.jpg C:\Users\h3\Pictures\keyboard/ms-sidewinder-x6-gaming-keyboard-full.jpg C:\Users\h3\Pictures\keyboard - Copy/ms-sidewinder-x6-gaming-keyboard-full.jpg C:\Users\h3\Pictures\keyboard/g510.jpg C:\Users\h3\Pictures\keyboard - Copy/g510.jpg C:\Users\h3\Pictures\keyboard/71Uvd2tZOZL._AA1500_.jpg C:\Users\h3\Pictures\keyboard - Copy/71Uvd2tZOZL._AA1500_.jpg C:\Users\h3\Pictures\keyboard/g510 red.jpg C:\Users\h3\Pictures\keyboard - Copy/g510 red.jpg C:\Users\h3\Pictures\keyboard/ms x4.jpg C:\Users\h3\Pictures\keyboard - Copy/ms x4.jpg C:\Users\h3\Pictures\keyboard/81fuOEG-2lL._AA1500_.jpg C:\Users\h3\Pictures\keyboard - Copy/81fuOEG-2lL._AA1500_.jpg C:\Users\h3\Pictures\keyboard/g110.jpg C:\Users\h3\Pictures\keyboard - Copy/g110.jpg C:\Users\h3\Pictures\keyboard - Copy/81hTgnd037L._AA1500_.jpg C:\Users\h3\Pictures\keyboard/81hTgnd037L._AA1500_ - Copy.jpg C:\Users\h3\Pictures\keyboard - Copy/81hTgnd037L._AA1500_ - Copy.jpg C:\Users\h3\Pictures\keyboard/81hTgnd037L._AA1500_.jpg C:\Users\h3\Pictures\keyboard/lenovo_thinkpad_usb_trackpoint_keyboard-2.jpg C:\Users\h3\Pictures\keyboard - Copy/lenovo_thinkpad_usb_trackpoint_keyboard-2.jpg C:\Users\h3\Pictures\keyboard/g19.jpg C:\Users\h3\Pictures\keyboard - Copy/g19.jpg C:\Users\h3\Pictures\keyboard/g510 yellow - Copy.jpg C:\Users\h3\Pictures\keyboard - Copy/g510 yellow - Copy.jpg C:\Users\h3\Pictures\keyboard/g510 green.jpg C:\Users\h3\Pictures\keyboard - Copy/g510 green.jpg ============================== There are 16 redundant files, totaling 2396674 bytes. The following files (if any) will be deleted (if you used the “--delete” option): C:\Users\h3\Pictures\keyboard - Copy/71Uvd2tZOZL._AA1500_.jpg C:\Users\h3\Pictures\keyboard - Copy/81fuOEG-2lL._AA1500_.jpg C:\Users\h3\Pictures\keyboard - Copy/81hTgnd037L._AA1500_ - Copy.jpg C:\Users\h3\Pictures\keyboard - Copy/81hTgnd037L._AA1500_.jpg C:\Users\h3\Pictures\keyboard - Copy/DSC_1108.jpg C:\Users\h3\Pictures\keyboard - Copy/g110.jpg C:\Users\h3\Pictures\keyboard - Copy/g19.jpg C:\Users\h3\Pictures\keyboard - Copy/g510 green.jpg C:\Users\h3\Pictures\keyboard - Copy/g510 red.jpg C:\Users\h3\Pictures\keyboard - Copy/g510 yellow - Copy.jpg C:\Users\h3\Pictures\keyboard - Copy/g510.jpg C:\Users\h3\Pictures\keyboard - Copy/lenovo_thinkpad_usb_trackpoint_keyboard-2.jpg C:\Users\h3\Pictures\keyboard - Copy/ms x4.jpg C:\Users\h3\Pictures\keyboard - Copy/ms-sidewinder-x6-gaming-keyboard-full.jpg C:\Users\h3\Pictures\keyboard - Copy/windowslogo.gif C:\Users\h3\Pictures\keyboard/81hTgnd037L._AA1500_.jpg File deletion done (if any)!
code
# -*- coding: utf-8 -*- # perl # delete files that's identical. # http://xahlee.info/python/delete_dup_files.html # Copyright 2005, 2022 by Xah Lee, Xah Lee. All rights reserved. # created: 2003-05 # version: 2011-07-08 added a -d and -h option. Expanded up the inline doc. Other small cleanups. ################################################## use strict; use File::Find; use Data::Dumper qw(Dumper); $Data::Dumper::Indent=1; use File::Compare; # use Combo114 qw(combo reduce merge); # use Genpair114 qw(parti genpair); ################################################## # combo(n) returns a collection with elements of pairs that is all possible combinations of 2 things from n. For example, combo(4) returns {'3,4' => ['3',4],'1,2' => [1,2],'1,3' => [1,3],'1,4' => [1,4],'2,3' => ['2',3],'2,4' => ['2',4]}; Each pair ($i,$j) returned must have $i < $j. Hash form is returned instead of array for this program. sub combo { my $max = shift; my %hh=(); for my $j ( 1 .. $max ) { for my $i ( 1 .. $j-1 ) { $hh{"$i,$j"} = [$i, $j]; } } return \%hh; } # old implementation with lesser algorithm # sub combo ($) { my $max=$_[0]; my %hh=(); for (my $j=1; $j < $max; ++$j) { for (my $i=1; $i <= $max; ++$i) { my $m = (($i+$j)-1)%$max+1; if ($i < $m){ $hh{"$i,$m"}=[$i,$m];}}} return \%hh;} =pod e.g. reduce( $pairings, $a_pair) retured the first argument with some pairs deleted. Detail: we have n things, represented by numbers 1 to n. Some of these are identical. We want to partition the range of numbers 1 to n so that identical ones are grouped together. To begin comparison, we generate a list of pairings that's all possible parings of numbers 1 to n. (of course order does not matter, and the pairing does not contain repeations) This is the first argument to reduce. We'll go thru this pairings list one by one and do comparisons, remove the pair once it has been compared. However, more pairs can be removed if a we find a pair identical. For example, suppose we know that 2 and 4 are identical, and if the pairing list contains (2,3) and (4,3), one of them can be deleted because now 2 and 4 are the same thing. (We do this because we expect the comparison operation will be expensive.) reduce( $pairings, $a_pair) returns a reduced $pairings knowing that $a_pair are identical. The first argument $pairings must be in the form of a hash. {'1,5' => [1,5],'3,5' => [3,5],'2,4' => [2,4],'4,5' => [4,5],'1,3' => [1,3],'2,5' => [2,5],'1,2' => [1,2],'3,4' => [3,4],'2,3' => [2,3],'1,4' => [1,4]} (Note that keys are strings of the pairs separated by a comma.) $a_pair is a reference to a list of the form [$a,$b]. For example, if the input is the hash given above, and if 2,3 is identical, then these pairs will be deleted 3,4 1,3 3,5 (different pairs may be deleted if the hash's pairs are given in different order. i.e. 3,4 instead of 4,3) The return value is a reference to a hash. =cut sub reduce ($$) { my %hh= %{$_[0]}; # e.g. {'1,2'=>[1,2],'5,6'=>[5,6],...} my ($j1,$j2)=($_[1]->[0],$_[1]->[1]); # e.g. [3,4] delete $hh{"$j1,$j2"}; foreach my $k (keys %hh) { $k=~m/^(\d+),(\d+)$/; my ($k1,$k2)=($1,$2); if ($k1==$j1) { if ($j2 < $k2) { delete $hh{"$j2,$k2"}; } else { delete $hh{"$k2,$j2"}; } ; } ; if ($k2==$j1) { if ($k1 < $j2) { delete $hh{"$k1,$j2"}; } else { delete $hh{"$j2,$k1"}; } ; } ; } return \%hh; } =pod merge($pairings) takes list of pairs, each pair indicates the sameness of the two indexes. Returns a partitioned list of same indexes. For example, if the pairings is merge( [ [1,2], [2,4], [5,6] ] ); that means 1 and 2 are the same. 2 and 4 are the same. Therefore 1==2==4. The result returned is [[4,2,1],[6,5]]; (ordering of the returned list and sublists are not specified/important.) =cut sub merge($) { my @pairings = @{$_[0]}; # @pairings is, e.g. ([a,b], [c,d],...) my @interm; # array of hashs. For the hash, Keys are numbers, values are dummy 'x'. # chop the first value of @pairings into @interm $interm[0]={$pairings[0][0]=>'x'}; ${interm[0]}{$pairings[0][1]}='x'; shift @pairings; N1: for my $aPair (@pairings) { for my $aGroup (@interm) { if (exists ${$aGroup}{$aPair->[0]}) { ${$aGroup}{$aPair->[1]}='x'; next N1; } if (exists ${$aGroup}{$aPair->[1]}) { ${$aGroup}{$aPair->[0]}='x'; next N1; } } push @interm, {$aPair->[0]=>'x'}; ${interm[-1]}{$aPair->[1]}='x'; } my @fin = shift @interm; N2: for my $group (@interm) { for my $newcoup (@fin) { foreach my $k (keys %$group) { if (exists ${$newcoup}{$k}) { map { ${$newcoup}{$_}='x'} (keys %$group); next N2; } } } push @fin, $group; } return map {[keys (%$_)]} @fin; } # ssss--------------------------------------------------- =pod parti(aList, equalFunc) given a sortable and sorted list aList of n elements, we want to return a list that is a range of numbers from 1 to n, partitioned by the predicate function of equivalence equalFunc. (a predicate function is a function that takes two arguments, and returns either True or False.) example: parti([ ['x','x','x','1'], ['x','x','x','2'], ['x','x','x','2'], ['x','x','x','2'], ['x','x','x','3'], ['x','x','x','4'], ['x','x','x','5'], ['x','x','x','5']], sub {$_[0]->[3] == $_[1]->[3]} ) returns [[1],['2','3','4'],['5'],['6'],['7','8']]; Note: a mathematical aspect: there are certain mathematical constraints on the a function that checks equivalence. That is to say, if a==b, then b==a. If a==b and b==c, then a==c. And, a==a. If a equivalence function does not satisfy these, it is inconsistent and basically give meaningless result. Note: This parti function requires the input to be sortable and sorted. =cut sub parti($$) { my @li = @{$_[0]}; my $sameQ = $_[1]; my @tray=(1); my @result; for (my $i=1; $i <= ((scalar @li)-1); $i++) { if (&$sameQ($li[$i-1], $li[$i])) {push @tray, $i+1} else {push @result, [@tray]; @tray=($i+1);} } push @result, [@tray]; return \@result; } =pod given a list that is a set partitioned into subsets, generate a list of all possible pairings of elements in any two subset. Example: genpair([[1],['2','3','4'],['5'],['6'],['7','8']]) returns: [[1,'2'],[1,'3'],[1,'4'],[1,'5'],[1,'6'],[1,'7'],[1,'8'],['2','5'],['3','5'],['4','5'],['2','6'],['3','6'],['4','6'],['2','7'],['2','8'],['3','7'],['3','8'],['4','7'],['4','8'],['5','6'],['5','7'],['5','8'],['6','7'],['6','8']]; Actually this program returns a reference to a hash. The keys are of the form "3,7" =cut sub genpair ($) { my $partiSet = $_[0]; # e.g. [[1],['2','3','4'],['5'],['6'],['7','8']]; my %result; for (my $head =0; $head <= ((scalar @$partiSet)-2); $head++ ) { for (my $tail = $head+1; $tail <= ((scalar @$partiSet)-1); $tail++ ) { foreach my $ii (@{$partiSet->[$head]}) { foreach my $jj (@{$partiSet->[$tail]}) { $result{"$ii,$jj"}= [$ii,$jj]; } } } } return \%result; } ################################################## # arguments my $helpText = q{ To find dup files in a folder: perl del_dup.pl dirpath To find dup files in several folders: perl del_dup.pl dirpath1 dirpath2 dirpath3 ... To delete dup files: perl del_dup.pl -d dirpath or perl del_dup.pl -d dirpath1 dirpath2 ... When there are duplicate files, the first one found (in the order the dir is given) is preserved, the others are deleted. To see this help again: perl del_dup.pl -h Note: the options -h and -d must be first argument. }; if (not defined $ARGV[0]) {die qq{No argument received. \n $helpText};} if ($ARGV[0] eq q{-h}) {print $helpText; exit 0;} my $debugModeQ = 0; my $septor1 = q{==============================} . "\n"; my $septor2 = q{---------------------} . "\n"; my $septor3 = q{=-=-=-=-=-=-=-=-=-=-=-=-=-} . "\n"; my $deleteModeQ = $ARGV[0] eq q{-d} ? 1 : 0; # print "total num of argv:", scalar(@ARGV), "\n"; # print "argv dump:", Dumper \@ARGV; my @dirsToCheck = (); for (my $i=0; $i < scalar(@ARGV); ++$i) { # print "i is:", $i, "\n"; if (( $ARGV[$i] ne q{-d})) { if (-d $ARGV[$i]) { print "Adding dir to check: $ARGV[$i]\n"; push @dirsToCheck, $ARGV[$i]; } else { print "Not a dir, skipped: $ARGV[$i]\n"; } } } ################################################## # get all files and put them in an array # each element has the form [dir path, file name, file size, integer of dir order] my @fileList = (); for (my $i=0; $i < scalar(@dirsToCheck); ++$i) { sub wanted() { my $fileName = $_; my $dirName = $File::Find::dir; if ( -f $File::Find::name # && $fileName =~ m@\.jpg$@i # check only jpg files ) { push @fileList, [$dirName, $fileName, -s "$dirName/$fileName", $i]; } } if (-d $dirsToCheck[$i]) { find(\&wanted, $dirsToCheck[$i]); } } #@fileList = sort {$a->[2] <=> $b->[2]} @fileList; @fileList = sort {$a->[2] <=> $b->[2] || $a->[3] <=> $b->[3] } @fileList; print qq[There are a total of ${\(scalar @fileList)} files examed.\n]; # print Dumper \@fileList; print $septor1; ################################################################ # then partion the file list by file size. # final result is an array @pfl, each element is another array, and each element is a file element, like this: # ([ [ dir, name, size, dir index], ...], ...) my @pfl; # partitioned file list, same-sized files are gathered together. # partition the list into sublists of same-sized files my @tray = @fileList[0]; for (my $i=1; $i < (scalar @fileList); $i++) { if ( (($fileList[$i-1]->[2]) == ($fileList[$i]->[2])) ) { push @tray, $fileList[$i] } else {push @pfl, [@tray];@tray=($fileList[$i]);} } push @pfl, [@tray]; undef @fileList; print "There are ${\(scalar @pfl)} unique file size.\n"; if ($debugModeQ) { print "These following file groups have identical size:\n"; foreach my $group (@pfl) { for my $f (@$group) {print "$f->[0]/$f->[1]\n";}; print "\n"; }} print $septor1; ################################################################ # fine tune the partition to be identical files instead of same sized. # for a list of files (of same size), we want group together identical files. #steps: # find its length. from now on work with indexes. # generate the comparison list (all possible pairings of any file from different dirs) # do comparison, if hit, reduce the comparison list, and put the pair in idTwins list. # repeat the above until comparison list is empty. # link the identical twins (of pairs) into groups of the same index. # make the partition. (turn indexes back to file element [dir, name, size]) my @idTwins=(); # e.g. ([1,3],[7,9],[3,2],...) # files are represented by indexs. In @idTwins, grouped togeher are indexs of the same file. i.e. file 1 3 are the same, 7 9 are the same, etc. my @dfl=(); # partitioned duplicate file list foreach my $filegroup (@pfl) { next if 1 == scalar @$filegroup; @idTwins=(); my $pairings = ((scalar @dirsToCheck) > 1 ? genpair parti( $filegroup , sub {$_[0]->[3] == $_[1]->[3] }) : combo scalar @$filegroup ); if ($debugModeQ){ print $septor2; print 'number of same sized: ', scalar @$filegroup, "\n"; print 'pairings: '; foreach my $nnn (keys %$pairings) {print "$nnn | "; }; print "\n";} while (0 < scalar keys %$pairings) { my $pairKey = [keys %$pairings]->[0]; my $f1 = "$filegroup->[$pairings->{$pairKey}->[0]-1]->[0]/$filegroup->[$pairings->{$pairKey}->[0]-1]->[1]"; # $f1 is the full file path my $f2 = "$filegroup->[$pairings->{$pairKey}->[1]-1]->[0]/$filegroup->[$pairings->{$pairKey}->[1]-1]->[1]"; if ($debugModeQ) { print $septor3; print 'current pairings pool: '; foreach my $nnn (keys %$pairings) {print "$nnn | "; }; print "\n"; print 'current id twins: '; foreach my $nnn (@idTwins) {print "@$nnn | "; }; print "\n"; print 'f1 is: ', $f1, ' ',$pairings->{$pairKey}->[0], "\n"; print 'f2 is: ', $f2, ' ', $pairings->{$pairKey}->[1],"\n"; } if (compare($f1,$f2) == 0) { if ($debugModeQ) { print "Test result: is same \n";} push @idTwins, $pairings->{$pairKey}; $pairings = reduce $pairings, $pairings->{$pairKey}; } else { if ($debugModeQ) {print "Test result: not same \n"} delete $pairings->{$pairKey} } } if ($debugModeQ) {print $septor2; print 'id twins: ', Dumper \@idTwins; print "\n"; print 'id groups: ', Dumper [merge \@idTwins]; print "\n";} if (0 < scalar @idTwins) { foreach my $indexGroup (merge \@idTwins) { push @dfl, [map {$filegroup->[$_-1];} @$indexGroup];}} }; undef @pfl; print $septor2; print "These following files are identical:\n"; foreach my $group (@dfl) { for my $f (@$group) {print "$f->[0]/$f->[1]\n";}; print "\n"; } print $septor1; ################################################################ # sort the sublists in dfl according to the order of given dir # first given dir first my @sfl = (); foreach my $mery (@dfl) {push @sfl, [sort {$a->[3] <=> $b->[3]} @$mery];}; undef @dfl; my $summ=0; for my $bun (@sfl) {$summ += (scalar @$bun) -1}; my $sizz=0; for my $bun (@sfl) {$sizz += ((scalar @$bun) -1) * $bun->[0]->[2]}; print "There are $summ redundant files, totaling $sizz bytes.\n"; # print Dumper \@sfl; my @toDelete; # full file paths; to be deleted for my $bunch (@sfl) {shift @$bunch; for my $f (@$bunch) {push @toDelete, "$f->[0]/$f->[1]"}}; @toDelete = sort @toDelete; undef @sfl; # print Dumper \@toDelete; print "The following files (if any) will be deleted (if you used the -d option):\n"; foreach my $goner (@toDelete) {print "$goner\n";} if ( $deleteModeQ == 1) { unlink @toDelete; print "File deletion done (if any).\n"; } __END__