comp.mail.mh-script_for_deletion_of_duplicate_mh_messages

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.



From: Bill Wohler <wohler@newt.com>
Subject: Removing duplicate messages
Newsgroups: comp.mail.mh
Date: Sun, 07 Dec 2003 00:31:54 GMT
Organization: Newt Software

I finally got around to deleting duplicate messages in my corpus of
messages. Since I had thousands of duplicates, and some were across
folders, I embellished the perl script in the MH FAQ somewhat to
accomplish my goals.

Here it is for your enjoyment.

[...]

===

From: Bill Wohler <wohler@newt.com>
Subject: Re: Removing duplicate messages
Date: Mon, 08 Dec 2003 01:30:01 GMT

Ken Yap found a small bug in mhfinddup that gave it gas on folders
with a trailing + (like c++). Here is version 1.2 which fixes that
problem.

#!/usr/bin/perl -w
#
# $Id: mhfinddup,v 1.2 2003/12/07 18:49:37 wohler Exp $

=head1 NAME

mhfinddup - find duplicate messages

=head1 SYNOPSIS

mhfinddup [options] [folder ...]

=head1 DESCRIPTION

B<mhfinddup> finds and removes duplicate MH messages in the folders listed on
the command line (default: current folder). By default, you deal with
duplicate messages interactively. You can either remove the duplicate, not
remove the duplicate, or view the original and duplicate message before
deciding.

If you use the B<-msgid> option to B<send>, then you probably don't want to
list any F<+outbox> folders if you are using the B<--no-same-folder> option
and you want to preserve your sent messages as well as your messages to
mailing lists.

Note that if you specify one or more folders, or if you use the B<--all>
option, B<mhfinddup> recursively descends the given folders.

=head1 CONTEXT

Context is per B<flist>(1). That is, if F<+folder> is given, it will become
the current folder. If multiple folders are given, the last one specified will
become the current folder.

=head1 OPTIONS

=over 4

=item --all

Look for duplicates in all folders. If any folders are specified, this option
is ignored.

=item --debug

Turn on debugging messages.

=item --help

Display the usage of this command.

=item --list

List duplicated messages.

=item --no-same-folder

Since it is common to use C<refile -link> to file a message in multiple
folders, this script doesn't consider messages in different folders to be
duplicates. Specify this option to list or remove duplicates across folders.

=item --rmm

Remove messages non-interactively. Use with care! For safety, the B<--list>
option takes precedence if specified and is a good option to use before using
B<--rmm>.

=item --version

Display program version.

=back

=head1 RETURN VALUE

Returns 0 if all is well; non-zero otherwise.

=head1 EXAMPLES

=over 0

=item mhfinddup

Interactively remove duplicates from the current folder.

=item mhfinddup --all --list --no-same-folder

List all duplicates regardless if they are in different folders or not.

=item mhfinddup --rmm +lists

Remove all duplicates in F<+lists>, recursively.

=back

=head1 SEE ALSO

B<rmm>(1), B<mhl>(1), B<scan>(1)

=head1 VERSION

$Revision: 1.2 $

=head1 AUTHOR

Bill Wohler <wohler at newt.com>

Copyright (c) 2003 Newt Software. All rights reserved.

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, you can find it at
http://www.gnu.org/copyleft/gpl.html or write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.

=head1 METHODS

=cut

# Packages and pragmas.
use Getopt::Long;

use strict;

# Constants.
my $cmd;                                # name by which command called
($cmd = $0) =~ s|^\./||;                # ...minus the leading ./
my $ver = '$Revision: 1.2 $';		# program version with CVS noise
$ver =~ s/\$//g;                        # strip dollar signs
$ver =~ s/Revision://;                  # strip CVS keyword
$ver =~ s/\s//g;                        # strip whitespace

# Variables (may be overridden by arguments).
my $all = 0;				# look in all folders
my $debug = 0;				# verbose mode
my $help = 0;				# display usage
my $version = 0;			# display version
my $list = 0;				# list duplicates
my $no_same_folder = 0;			# consider duplicates across folders
my $rmm = 0;				# remove duplicates without asking

# Constants.
my $mhl = "/usr/lib/mh/mhl";

# Parse command line.
# The use of the posix_default option is to ensure that folders like +a are
# not confused with --all. I'd really prefer to set prefix_pattern to "(--|-)"
# so that abbreviations of options can be used without being confused with
# folders, but I couldn't make it so.
my %opts;
Getopt::Long::Configure("pass_through", "posix_default");
GetOptions('all'		=> \$all,
	   'debug'		=> \$debug,
	   'help'		=> \$help,
	   'list'		=> \$list,
	   'no-same-folder'	=> \$no_same_folder,
	   'rmm'		=> \$rmm,
	   'version'		=> \$version,
	  ) or usage();

show_version() if ($version);
usage() if ($help || int(@ARGV) != int(map(/^\+/, @ARGV)));

my @folders = expand_folders(@ARGV);
print("Expanded " . join(" ", @ARGV) . " into\n" . join("\n", @folders) . "\n")
    if ($debug);

print("Scanning for duplicate messages...\n");
my %msgs;
foreach my $folder (sort @folders) {
    open (SCAN,
	  "MHCONTEXT=/dev/null scan +$folder -format '%(msg) %{message-id}'|");
    while (<SCAN>) {
	if (my ($msg, $msgid) = /^(\d+) (<.*>)$/) {
	    if ($msgs{$msgid}) {
		$msgs{$msgid} =~ m|^\+(.*)/(\d+)$|;
		my($f, $m) = ($1, $2);
		if ($folder eq $f || $no_same_folder) {
		    handle_dup($f, $m, $folder, $msg);
		}
	    } else {
		$msgs{$msgid} = "+$folder/$msg";
	    }
	}
    }
    close(SCAN);
}

sub expand_folders {
    my @folders = @_;

    print("Getting list of folders...");
    open(FOLDERS,
	 "flist -recurse "
	  . (($all == 1 && @folders == 0) ? "-all" : join(" ", @folders))
	  . "|")
	or die("Could not determine folders\n");
    @folders = ();
    chomp(my $current_folder = `mhparam Current-Folder`);
    $current_folder = quotemeta($current_folder);
    while (<FOLDERS>) {
	chomp;
	my ($folder, $a, $b, $c, $d, $e, $f, $g, $count) = split;
	if ($folder =~ /^$current_folder\+$/) {
	    $folder =~ s/\+$//; # remove current folder indication
	}
	next if ($count == 0);
	push(@folders, $folder);
    }
    close(FOLDERS);
    print("done\n");

    return(@folders);
}

sub handle_dup {
    my($f1, $m1, $f2, $m2) = @_;

    my $ans;

 repeat:
    print("+$f2/$m2 duplicate of +$f1/$m1");

    if ($list) {
	print("\n");
    } else {
	if ($rmm) {
	    $ans = "y";
	    print("\n");
	} else {
	    print(", remove? [Yns?] ");
	    chomp($ans = <STDIN>);
	}

	if ($ans eq "y" || $ans eq "") {
	    system("rmm +$f2 $m2");
	} elsif ($ans eq "s") {
	    system("$mhl `mhpath +$f1 $m1` `mhpath +$f2 $m2`");
	    goto repeat;
	} elsif ($ans eq "?") {
	    print("y, remove message (default)\n" .
		  "n, don't remove message\n" .
		  "s, show messages\n" .
		  "?, show this message\n");
	    goto repeat;
	}
    }
}


the rest of The Pile (a partial mailing list archive)

doom@kzsu.stanford.edu