sfpug-perl_reading_writing_large_files

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.

From: Michael Ellery <mellery@sagent.com>
To: "'sfpug@sf.pm.org'" <sfpug@sf.pm.org>
Subject: [sf-perl] Reading writing/large files
Date: Wed, 18 Oct 2000 09:55:57 -0700


Below a code snippet I'm using to read and modify a large (~1.5G) postscript
file.  I'm creating a copy here, so I realize that I'll need something more
than 2x original file size of disk space in order to be able to do this.  In
any event, I'd appreciate any tips people might have on making this simple
script run faster.  My initial thoughts are to get rid of the lexical $line
in favor of $_, but I don't know if that really makes much difference.  I
figure someone out there must be using perl to read through gigantic weblog
files and must have dealt with some of these issues before....

any tips are appreciated.

#################################
# Main 
my $fh  = new FileHandle;
my $fho = new FileHandle;

$fh->open("$InFile") || croak "Unable to open input file $InFile.\n";
$fho->open(">$OutFile") || croak "Unable to open output file $OutFile.\n";

LINES: while (my $line = <$fh>) {
  print $fho $line; 
  
  if ($line =~ /^\s*\%\%EndProlog/i) {
    print $fho '%%DocumentMedia: cover 612 792 75 blue topsecret'."\n";
    print $fho '%%+ section 612 792 75 yellow'."\n";
    print $fho '%%+ body 612 792 75 white'."\n";
    print $fho '%%EndComments'."\n";
  }
}

$fho->close;
$fh->close;
#################################

===

Date: Wed, 18 Oct 2000 19:55:28 -0700 (PDT)
From: "Matthew D. P. K. Strelchun-Lanier" <matt@lanier.org>
To: "'sfpug@sf.pm.org'" <sfpug@sf.pm.org>
Subject: Re: [sf-perl] Reading writing/large files

the first thing i'd do is to stop using the FileHandle module.  there is
no reason to use it here.

get rid of the 'new FileHandle' lines, and change your open lines to:

open (my $fh, "<$InFile") ...

everything else works out the same.

the reason that your method is so slow is that it's calling an object
method for every darn line rather, which isn't what perl's fastest at.

your suggestion to use the default variable rather than an explicit
variable is not quite correct.  though i don't have a verion of perl with
-D compiled in to look at the parsed opcode tree, i speculate that the
only difference you'd see is in the initial compliation phase and no
difference at run time.

===

Date: Wed, 18 Oct 2000 21:08:05 -0700 (PDT)
From: David Lowe <dlowe@pootpoot.com>
To: Michael Ellery <mellery@sagent.com>
Subject: Re: [sf-perl] Reading writing/large files

Michael et. al. -

I have some suggestions, but first, I think what really needs to be done
is to re-think things a bit.  Your biggest performance win, I imagine,
would be to read much larger chunks of the file into memory and work with
those, rather than going back out to disk repeatedly.  (What size buffer
does <> use, internally?  Can it be increased?)

I can see a few very small changes that could be made to (slightly)
improve performance:

1. Get rid of the FileHandle module; it imports (under 5.6, at least) the
IO::* family, as well as incurring a dereference operation at every
use.  Use bare filehandles for the best Perl can do.  (Matt's suggestion
of a scalar filehandle, while faster than FileHandle, still causes a
dereference at every use of the object - try:
    open(my $foo, "foo.pl");
    print "$foo\n";
to see that $foo becomes a reference to a GLOB... also, this only works
under 5.6.0 and later.)

2. It is slightly faster to pass in multiple strings to print than to
concatenate into one string.  It should also be slightly faster to call
print only once, rather than several times.

3. (Doubtfully) it could be worthwhile to explicitly compile the regexp
only once, using qr//.  Benchmark it!

Switching to $_ should not make a (positive) difference.

Here's what I get, with these changes - it should be a bit faster,
assuming the input is gigantic.  Try it & see:

--- begin
open(FH,  "$InFile")   || croak "Unable to open input file $InFile.\n";
open(FHO, ">$OutFile") || croak "Unable to open output file $OutFile.\n";

my $pattern = qr/^\s*\%\%EndProlog/i; # compile the regexp

while (my $line = <FH>) {
  if ($line =~ $pattern) {  
    print FHO $line, '%%DocumentMedia: cover 612 792 75 blue topsecret', "\n",
              '%%+ section 612 792 75 yellow', "\n",
              '%%+ body 612 792 75 white', "\n",
              '%%EndComments', "\n";
  } else {
    print FHO $line;
  }
}
--- end

===

Date: Thu, 19 Oct 2000 01:10:36 -0700 (PDT)
From: Quinn Weaver <qweaver@vovida.com>
To: "'sfpug@sf.pm.org'" <sfpug@sf.pm.org>
Subject: Re: [sf-perl] Reading writing/large files

Excellent suggestions, Matt and David.  The regex is the thing that jumped
out at me immediately, but the filehandle dereference is nasty as well.
Objects. :P

On Wed, 18 Oct 2000, David Lowe wrote:
> 3. (Doubtfully) it could be worthwhile to explicitly compile the regexp

I suspect this is a bigger win than you think.  As it stands, ther regex
is being compiled on every trip around the loop--that's once per line of
input!

My meager addition is that you might want to study() the regex before
compiling it.  Larry only knows what it does (or maybe Ilya), but it's
supposed to help sometimes.  Sometimes the ways of Perl are inscrutable.
;)

===

From: John Nolan <jpnolan@sonic.net>
Subject: Re: [sf-perl] Reading writing/large files
To: sfpug@sf.pm.org
Date: Thu, 19 Oct 2000 06:41:27 -0700 (PDT)


> > 3. (Doubtfully) it could be worthwhile to explicitly compile the regexp
> 
> I suspect this is a bigger win than you think.  As it stands, ther regex
> is being compiled on every trip around the loop--that's once per line of
> input!


This is actually not true.  The original regex did not contain
any variables, so it will not be recompiled at run time in any case.

===

Date: Thu, 19 Oct 2000 14:04:14 -0700 (PDT)
From: David Lowe <dlowe@pootpoot.com>
To: John Nolan <jpnolan@sonic.net>
Subject: Re: [sf-perl] Reading writing/large files


Interestingly enough, a quick benchmark seems to show that, at least under
these circumstances, pre-compiling using qr// slows things down slightly.  
My benchmark program is attached (requires a /usr/share/dict/words word
list).  Adding an o flag to the inline regexp (the other way to
pre-compile a regexp in Perl) didn't change the performance significantly
(this backs up John's statement that regexps without variables are already
only compiled once by perl...)

So, I take it back: under these circumstances, using qr// actually slows
your loop down a tad.

My guess is that the overhead comes from looking up the contents of a
scalar variable.  Any other reasons this might slow things down?

Take it easy...

===
the rest of The Pile (a partial mailing list archive)
doom@kzsu.stanford.edu