This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.
From: Michael Ellery <mellery@sagent.com> To: "'sfpug@sf.pm.org'" <sfpug@sf.pm.org> Subject: [sf-perl] Reading writing/large files Date: Wed, 18 Oct 2000 09:55:57 -0700 Below a code snippet I'm using to read and modify a large (~1.5G) postscript file. I'm creating a copy here, so I realize that I'll need something more than 2x original file size of disk space in order to be able to do this. In any event, I'd appreciate any tips people might have on making this simple script run faster. My initial thoughts are to get rid of the lexical $line in favor of $_, but I don't know if that really makes much difference. I figure someone out there must be using perl to read through gigantic weblog files and must have dealt with some of these issues before.... any tips are appreciated. ################################# # Main my $fh = new FileHandle; my $fho = new FileHandle; $fh->open("$InFile") || croak "Unable to open input file $InFile.\n"; $fho->open(">$OutFile") || croak "Unable to open output file $OutFile.\n"; LINES: while (my $line = <$fh>) { print $fho $line; if ($line =~ /^\s*\%\%EndProlog/i) { print $fho '%%DocumentMedia: cover 612 792 75 blue topsecret'."\n"; print $fho '%%+ section 612 792 75 yellow'."\n"; print $fho '%%+ body 612 792 75 white'."\n"; print $fho '%%EndComments'."\n"; } } $fho->close; $fh->close; ################################# === Date: Wed, 18 Oct 2000 19:55:28 -0700 (PDT) From: "Matthew D. P. K. Strelchun-Lanier" <matt@lanier.org> To: "'sfpug@sf.pm.org'" <sfpug@sf.pm.org> Subject: Re: [sf-perl] Reading writing/large files the first thing i'd do is to stop using the FileHandle module. there is no reason to use it here. get rid of the 'new FileHandle' lines, and change your open lines to: open (my $fh, "<$InFile") ... everything else works out the same. the reason that your method is so slow is that it's calling an object method for every darn line rather, which isn't what perl's fastest at. your suggestion to use the default variable rather than an explicit variable is not quite correct. though i don't have a verion of perl with -D compiled in to look at the parsed opcode tree, i speculate that the only difference you'd see is in the initial compliation phase and no difference at run time. === Date: Wed, 18 Oct 2000 21:08:05 -0700 (PDT) From: David Lowe <dlowe@pootpoot.com> To: Michael Ellery <mellery@sagent.com> Subject: Re: [sf-perl] Reading writing/large files Michael et. al. - I have some suggestions, but first, I think what really needs to be done is to re-think things a bit. Your biggest performance win, I imagine, would be to read much larger chunks of the file into memory and work with those, rather than going back out to disk repeatedly. (What size buffer does <> use, internally? Can it be increased?) I can see a few very small changes that could be made to (slightly) improve performance: 1. Get rid of the FileHandle module; it imports (under 5.6, at least) the IO::* family, as well as incurring a dereference operation at every use. Use bare filehandles for the best Perl can do. (Matt's suggestion of a scalar filehandle, while faster than FileHandle, still causes a dereference at every use of the object - try: open(my $foo, "foo.pl"); print "$foo\n"; to see that $foo becomes a reference to a GLOB... also, this only works under 5.6.0 and later.) 2. It is slightly faster to pass in multiple strings to print than to concatenate into one string. It should also be slightly faster to call print only once, rather than several times. 3. (Doubtfully) it could be worthwhile to explicitly compile the regexp only once, using qr//. Benchmark it! Switching to $_ should not make a (positive) difference. Here's what I get, with these changes - it should be a bit faster, assuming the input is gigantic. Try it & see: --- begin open(FH, "$InFile") || croak "Unable to open input file $InFile.\n"; open(FHO, ">$OutFile") || croak "Unable to open output file $OutFile.\n"; my $pattern = qr/^\s*\%\%EndProlog/i; # compile the regexp while (my $line = <FH>) { if ($line =~ $pattern) { print FHO $line, '%%DocumentMedia: cover 612 792 75 blue topsecret', "\n", '%%+ section 612 792 75 yellow', "\n", '%%+ body 612 792 75 white', "\n", '%%EndComments', "\n"; } else { print FHO $line; } } --- end === Date: Thu, 19 Oct 2000 01:10:36 -0700 (PDT) From: Quinn Weaver <qweaver@vovida.com> To: "'sfpug@sf.pm.org'" <sfpug@sf.pm.org> Subject: Re: [sf-perl] Reading writing/large files Excellent suggestions, Matt and David. The regex is the thing that jumped out at me immediately, but the filehandle dereference is nasty as well. Objects. :P On Wed, 18 Oct 2000, David Lowe wrote: > 3. (Doubtfully) it could be worthwhile to explicitly compile the regexp I suspect this is a bigger win than you think. As it stands, ther regex is being compiled on every trip around the loop--that's once per line of input! My meager addition is that you might want to study() the regex before compiling it. Larry only knows what it does (or maybe Ilya), but it's supposed to help sometimes. Sometimes the ways of Perl are inscrutable. ;) === From: John Nolan <jpnolan@sonic.net> Subject: Re: [sf-perl] Reading writing/large files To: sfpug@sf.pm.org Date: Thu, 19 Oct 2000 06:41:27 -0700 (PDT) > > 3. (Doubtfully) it could be worthwhile to explicitly compile the regexp > > I suspect this is a bigger win than you think. As it stands, ther regex > is being compiled on every trip around the loop--that's once per line of > input! This is actually not true. The original regex did not contain any variables, so it will not be recompiled at run time in any case. === Date: Thu, 19 Oct 2000 14:04:14 -0700 (PDT) From: David Lowe <dlowe@pootpoot.com> To: John Nolan <jpnolan@sonic.net> Subject: Re: [sf-perl] Reading writing/large files Interestingly enough, a quick benchmark seems to show that, at least under these circumstances, pre-compiling using qr// slows things down slightly. My benchmark program is attached (requires a /usr/share/dict/words word list). Adding an o flag to the inline regexp (the other way to pre-compile a regexp in Perl) didn't change the performance significantly (this backs up John's statement that regexps without variables are already only compiled once by perl...) So, I take it back: under these circumstances, using qr// actually slows your loop down a tad. My guess is that the overhead comes from looking up the contents of a scalar variable. Any other reasons this might slow things down? Take it easy... ===