This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.
From: Bill Wohler <wohler@newt.com> Subject: Removing duplicate messages Newsgroups: comp.mail.mh Date: Sun, 07 Dec 2003 00:31:54 GMT Organization: Newt Software I finally got around to deleting duplicate messages in my corpus of messages. Since I had thousands of duplicates, and some were across folders, I embellished the perl script in the MH FAQ somewhat to accomplish my goals. Here it is for your enjoyment. [...] === From: Bill Wohler <wohler@newt.com> Subject: Re: Removing duplicate messages Date: Mon, 08 Dec 2003 01:30:01 GMT Ken Yap found a small bug in mhfinddup that gave it gas on folders with a trailing + (like c++). Here is version 1.2 which fixes that problem. #!/usr/bin/perl -w # # $Id: mhfinddup,v 1.2 2003/12/07 18:49:37 wohler Exp $ =head1 NAME mhfinddup - find duplicate messages =head1 SYNOPSIS mhfinddup [options] [folder ...] =head1 DESCRIPTION B<mhfinddup> finds and removes duplicate MH messages in the folders listed on the command line (default: current folder). By default, you deal with duplicate messages interactively. You can either remove the duplicate, not remove the duplicate, or view the original and duplicate message before deciding. If you use the B<-msgid> option to B<send>, then you probably don't want to list any F<+outbox> folders if you are using the B<--no-same-folder> option and you want to preserve your sent messages as well as your messages to mailing lists. Note that if you specify one or more folders, or if you use the B<--all> option, B<mhfinddup> recursively descends the given folders. =head1 CONTEXT Context is per B<flist>(1). That is, if F<+folder> is given, it will become the current folder. If multiple folders are given, the last one specified will become the current folder. =head1 OPTIONS =over 4 =item --all Look for duplicates in all folders. If any folders are specified, this option is ignored. =item --debug Turn on debugging messages. =item --help Display the usage of this command. =item --list List duplicated messages. =item --no-same-folder Since it is common to use C<refile -link> to file a message in multiple folders, this script doesn't consider messages in different folders to be duplicates. Specify this option to list or remove duplicates across folders. =item --rmm Remove messages non-interactively. Use with care! For safety, the B<--list> option takes precedence if specified and is a good option to use before using B<--rmm>. =item --version Display program version. =back =head1 RETURN VALUE Returns 0 if all is well; non-zero otherwise. =head1 EXAMPLES =over 0 =item mhfinddup Interactively remove duplicates from the current folder. =item mhfinddup --all --list --no-same-folder List all duplicates regardless if they are in different folders or not. =item mhfinddup --rmm +lists Remove all duplicates in F<+lists>, recursively. =back =head1 SEE ALSO B<rmm>(1), B<mhl>(1), B<scan>(1) =head1 VERSION $Revision: 1.2 $ =head1 AUTHOR Bill Wohler <wohler at newt.com> Copyright (c) 2003 Newt Software. All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, you can find it at http://www.gnu.org/copyleft/gpl.html or write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. =head1 METHODS =cut # Packages and pragmas. use Getopt::Long; use strict; # Constants. my $cmd; # name by which command called ($cmd = $0) =~ s|^\./||; # ...minus the leading ./ my $ver = '$Revision: 1.2 $'; # program version with CVS noise $ver =~ s/\$//g; # strip dollar signs $ver =~ s/Revision://; # strip CVS keyword $ver =~ s/\s//g; # strip whitespace # Variables (may be overridden by arguments). my $all = 0; # look in all folders my $debug = 0; # verbose mode my $help = 0; # display usage my $version = 0; # display version my $list = 0; # list duplicates my $no_same_folder = 0; # consider duplicates across folders my $rmm = 0; # remove duplicates without asking # Constants. my $mhl = "/usr/lib/mh/mhl"; # Parse command line. # The use of the posix_default option is to ensure that folders like +a are # not confused with --all. I'd really prefer to set prefix_pattern to "(--|-)" # so that abbreviations of options can be used without being confused with # folders, but I couldn't make it so. my %opts; Getopt::Long::Configure("pass_through", "posix_default"); GetOptions('all' => \$all, 'debug' => \$debug, 'help' => \$help, 'list' => \$list, 'no-same-folder' => \$no_same_folder, 'rmm' => \$rmm, 'version' => \$version, ) or usage(); show_version() if ($version); usage() if ($help || int(@ARGV) != int(map(/^\+/, @ARGV))); my @folders = expand_folders(@ARGV); print("Expanded " . join(" ", @ARGV) . " into\n" . join("\n", @folders) . "\n") if ($debug); print("Scanning for duplicate messages...\n"); my %msgs; foreach my $folder (sort @folders) { open (SCAN, "MHCONTEXT=/dev/null scan +$folder -format '%(msg) %{message-id}'|"); while (<SCAN>) { if (my ($msg, $msgid) = /^(\d+) (<.*>)$/) { if ($msgs{$msgid}) { $msgs{$msgid} =~ m|^\+(.*)/(\d+)$|; my($f, $m) = ($1, $2); if ($folder eq $f || $no_same_folder) { handle_dup($f, $m, $folder, $msg); } } else { $msgs{$msgid} = "+$folder/$msg"; } } } close(SCAN); } sub expand_folders { my @folders = @_; print("Getting list of folders..."); open(FOLDERS, "flist -recurse " . (($all == 1 && @folders == 0) ? "-all" : join(" ", @folders)) . "|") or die("Could not determine folders\n"); @folders = (); chomp(my $current_folder = `mhparam Current-Folder`); $current_folder = quotemeta($current_folder); while (<FOLDERS>) { chomp; my ($folder, $a, $b, $c, $d, $e, $f, $g, $count) = split; if ($folder =~ /^$current_folder\+$/) { $folder =~ s/\+$//; # remove current folder indication } next if ($count == 0); push(@folders, $folder); } close(FOLDERS); print("done\n"); return(@folders); } sub handle_dup { my($f1, $m1, $f2, $m2) = @_; my $ans; repeat: print("+$f2/$m2 duplicate of +$f1/$m1"); if ($list) { print("\n"); } else { if ($rmm) { $ans = "y"; print("\n"); } else { print(", remove? [Yns?] "); chomp($ans = <STDIN>); } if ($ans eq "y" || $ans eq "") { system("rmm +$f2 $m2"); } elsif ($ans eq "s") { system("$mhl `mhpath +$f1 $m1` `mhpath +$f2 $m2`"); goto repeat; } elsif ($ans eq "?") { print("y, remove message (default)\n" . "n, don't remove message\n" . "s, show messages\n" . "?, show this message\n"); goto repeat; } } }