perl_cgi_ampserand_encoding_routines

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.

Subject: [sf-perl] yet one more challenge...
From: Josh Rabinowitz <joshr@joshr.com>
Date: Mon, 3 Jul 2000 09:51:29 -0700

Hey SFPUGgers:

OK, here's another challange, this time a real world one.

Suppose you had a perl script which took ~35 seconds to execute, and 
you profiled it with DProf, and found that a full 40% of the 
program's execution was spent in the function 
CGI::Util::simple_escape().  In CGI.pm v2.68 that function is 
declared as:

sub simple_escape {
   return unless defined (my $toencode = shift);
   $toencode =~ s{(.)}{
               if    ($1 eq '<')                            { '&lt;'    }
               elsif ($1 eq '>')                            { '&gt;'    }
               elsif ($1 eq '&')                            { '&amp;'   }
               elsif ($1 eq '"')                            { '&quot;'  }
               elsif ($1 eq "\x8b")                         { '&#139;'  }
               elsif ($1 eq "\x9b")                         { '&#155;'  }
               else                                         { $1        }
        }gsex;
   $toencode;
}

OK, here's the challenge:

Provide a snippet of code to be executed from the main package that 
overrides the above CGI::Util::simple_escape() method with a much 
faster, yet still readable and maintable version.  The 'winner' will 
be the person who submits the code that is both clearest and fastest 
in my real-world application, where that function is currently called 
14273 times, always with strings shorter than 255 characters.  I can 
provide a test driver if people want that for their testing :)

===

Subject: Re: [sf-perl] yet one more challenge...
From: John Nolan <jpnolan@sonic.net>
Date: Mon, 3 Jul 2000 10:25:17 -0700 (PDT)

Hmm.  My copy of CGI.pm 2.56 has a function called escapeHTML().   
I didn't do any tests, but this one looks like it would be 
a lot faster:

   sub escapeHTML {
       my ($self,$toencode) = self_or_default(@_);
       return undef unless defined($toencode);
       return $toencode if ref($self) && $self->{'dontescape'};
   
       $toencode=~s/&/&amp;/g;
       $toencode=~s/\"/&quot;/g;
       $toencode=~s/>/&gt;/g;
       $toencode=~s/</&lt;/g;
       return $toencode;
   }


===



Subject: Re: [sf-perl] yet one more challenge...
From: Josh Rabinowitz <joshr@joshr.com>
Date: Mon, 3 Jul 2000 10:41:14 -0700

Dear John;

Thanks! but...  The simple_escape() function I'm asking about, while 
similar to escapeHTML(), is not identical and is actually declared in 
the CGI::Util.pm library, not the CGI.pm library.

On my system, it's in the file /usr/lib/perl5/5.00503/CGI/Util.pm 
(that's why I included the full function definition in my original 
post).

Also, the contest states to "Provide a snippet of code to be executed 
from the main package that overrides the above 
CGI::Util::simple_escape() method "-- so this still needs some way to 
override the proper function without modifying my perl libraries in 
/usr/lib/perl5/...

===


Subject: Re: [sf-perl] yet one more challenge...
From: "Brad Greenlee" <brad@isyndicate.com>
Date: Mon, 3 Jul 2000 10:52:07 -0700

Wow, that's pretty damn inefficient. Here's a benchmark I did. simple_escape
is the original simple_escape subroutine. simple_escape2 is my version (see
below):

Benchmark: timing 200 iterations of simple_escape, simple_escape2...
simple_escape: 75 wallclock secs (75.44 usr +  0.09 sys = 75.53 CPU)
simple_escape2:  2 wallclock secs ( 1.50 usr +  0.01 sys =  1.51 CPU)

sub simple_escape2 {
  return unless defined (my $toencode = shift);
  $toencode =~ s{<}{&lt;}gso;
  $toencode =~ s{>}{&gt;}gso;
  $toencode =~ s{&}{&amp;}gso;
  $toencode =~ s{\"}{&quot;}gso;
  $toencode =~ s{\x8b}{&#139;}gso;
  $toencode =~ s{\x9b}{&#155;}gso;
  $toencode;
}

===

Subject: Re: [sf-perl] yet one more challenge...
From: Josh Rabinowitz <joshr@joshr.com>
Date: Mon, 3 Jul 2000 11:02:56 -0700


Dear Brad (and SFPUGgers):
Cool (and simple enough!)

 --> Now, how do we replace/override the original 
CGI::Util::simple_escape() with your simple_escape2() without 
modifying any of the distributed perl libraries?  <--

Also Brad, I'm curious what test driver you used to benchmark: Can 
you post that, too?  If 200 iterations take 75 seconds, I'm guessing 
you're passing in longgggg strings... (which is unlike my real world 
scenario below)

And can anyone do even better than Brad's simple_escape2()?  :}

===

Subject: Re: [sf-perl] yet one more challenge...
From: Josh Rabinowitz <joshr@joshr.com>
Date: Mon, 3 Jul 2000 11:51:44 -0700

Oh yeah, another criteria:

The output of your replacement function should always be equivalent 
to the output of  CGI::Util::simple_escape().  Brad's is not (note 
incorrect handling of replacement of ampersand character - the line 
"$toencode =~ s{&}{&amp;}gso;" needs to be the first search&replace 
in the function). :)

===

Subject: Re: [sf-perl] yet one more challenge...
From: Kevin Cureton <kevinc@pdi.com>
Date: Mon, 3 Jul 2000 11:46:53 -0700

Now, how do we replace/override the original 
> CGI::Util::simple_escape() with your simple_escape2() without 
> modifying any of the distributed perl libraries?  <--

If I remember correctly, this will override the simple_escape
function (along with the scalar value, array value, etc.) with
the definitions for simple_escape2.

*CGI::Util::simple_escape = *main::simple_escape2;

(this assumes simple_escape2 is in the main namespace).

===

Subject: Re: [sf-perl] yet one more challenge...
From: "Brad Greenlee" <brad@isyndicate.com>
Date: Mon, 3 Jul 2000 12:19:24 -0700

sorry for the multitude of postings, but it's a slow day here]

Here's another way to do it, FYI. Slightly slower than the fixed (so that
&amp; is done first) simple_escape2, but perhaps more elegant. It's also
faster if there's a good chance that there is nothing to escape in your
string.

This does match the output of CGI::Util::simple_escape, btw.

my %replacement_hash = (
                    '<' => '&lt;',
                    '>' => '&gt;',
                    '&' => '&amp;',
                    '"' => '&quot;',
                    '\x8b' => '&#139;',
                    '\x9b' => '&#155;');

sub simple_escape3 {
   return unless defined (my $toencode = shift);
   $toencode =~ s/([<>&"\x8b\x9b])/$replacement_hash{$1}/gs;
   $toencode;
}

Here's a test with a 26 character string. Note that the times for
simple_escape2 are faster now that I fixed the error that Josh caught (do
the &amp; substitutions first):

Benchmark: timing 500000 iterations of simple_escape2, simple_escape3...
simple_escape2: 17 wallclock secs (15.92 usr +  0.00 sys = 15.92 CPU)
simple_escape3: 19 wallclock secs (17.57 usr +  0.00 sys = 17.57 CPU)


Here's one with a 18,724 character string:

Benchmark: timing 5000 iterations of simple_escape2, simple_escape3...
simple_escape2: 28 wallclock secs (28.00 usr +  0.00 sys = 28.00 CPU)
simple_escape3: 44 wallclock secs (43.86 usr +  0.00 sys = 43.86 CPU)



===

Subject: Re: [sf-perl] yet one more challenge...
From: Josh Rabinowitz <joshr@joshr.com>
Date: Mon, 3 Jul 2000 12:12:39 -0700

Thanks Brad, Kevin!
I think the last unsolved point is:

 --> Q: Can we override just the method CGI::Util::simple_escape, and 
not the scalar, array, hash, etc with the same name? <--

Brad wrote:
>As for overriding CGI::Util::simple_escape, just explicitly import the
>subroutines you want from CGI::Util, and leave out simple_escape, then
>include your own sub simple_escape.

But I never actually call CGI::Util::simple_escape() myself, it's 
always called from inside CGI.pm, so I don't think your method of 
'sidestepping' CGI::Util::simple_escape() will work for this case.  I 
want our new version to replace the CGI::Util::simple_escape() in ALL 
cases for that run, not just for calls I explicitly make to it from 
my script ('cause I don't call it directly at all!).

According to Kevin Cureton:
>If I remember correctly, this will override the simple_escape
>function (along with the scalar value, array value, etc.) with
>the definitions for simple_escape2.
>    *CGI::Util::simple_escape = *main::simple_escape2;

THANKS AGAIN!!!

===

Subject: Re: [sf-perl] yet one more challenge...
From: Subaquaneous MonkeyBot <shift8@dnai.com>
Date: Mon, 3 Jul 2000 12:49:19 -0700 (PDT)



On Mon, 3 Jul 2000, Brad Greenlee wrote:

> [sorry for the multitude of postings, but it's a slow day here]
> 
> Here's another way to do it, FYI. Slightly slower than the fixed (so that
> &amp; is done first) simple_escape2, but perhaps more elegant. It's also
> faster if there's a good chance that there is nothing to escape in your
> string.
> 
> This does match the output of CGI::Util::simple_escape, btw.
> 
> my %replacement_hash = (
>                     '<' => '&lt;',
>                     '>' => '&gt;',
>                     '&' => '&amp;',
>                     '"' => '&quot;',
>                     '\x8b' => '&#139;',
>                     '\x9b' => '&#155;');
> 
> sub simple_escape3 {
>    return unless defined (my $toencode = shift);
>    $toencode =~ s/([<>&"\x8b\x9b])/$replacement_hash{$1}/gs;
>    $toencode;
> }

neat!  i was about to post almost identical code.  i use
something similar to this as an event handler to map matched
text to handler subroutines.

things like this illustrate the need for explicit ordering
of hash keys.

then you could say:

sub simpsons_escapade {
        return unless defined (my $text = shift);

        # hash for substitutions
        my %map = (
                '&'     =>      '&amp;',  # note the order
                '<'     =>      '&lt;',
                '>'     =>      '&gt;',
                '"'     =>      '&quot;',
                "\x8b"  =>      '&#139;',
                "\x9b"  =>      '&#155;',
        );

        # do substitutions
        my $baddie;
        foreach $baddie (keys %map) {
                $text =~ s/$baddie/$map{$baddie}/gs;
        $text;
}

===

Subject: Re: [sf-perl] yet one more challenge...
From: Josh Rabinowitz <joshr@joshr.com>
Date: Mon, 3 Jul 2000 21:07:48 -0700

SFPUGgers:
Thanks again for everyone's input.   The script now runs in approx 25 
seconds instead of 35, and CGI::Util::simple_escape() takes 4% of the 
total run time instead of %40. Hodehum!

I did this, combining Brad Greenlee and Kevin Cureton's contributions:

>  # replace CGI::Util::simple_escape() with our faster version.
>sub my_simple_escape {
>             return unless defined (my $toencode = shift);
>             $toencode =~ s{&}{&amp;}gs;
>             $toencode =~ s{<}{&lt;}gs;
>             $toencode =~ s{>}{&gt;}gs;
>             $toencode =~ s{\"}{&quot;}gs;
>             $toencode =~ s{\x8b}{&#139;}gs;
>             $toencode =~ s{\x9b}{&#155;}gs;
>             $toencode;
>}
>*CGI::Util::simple_escape = *main::my_simple_escape;

===


Subject: Re: [sf-perl] yet one more challenge...
From: "Paul Makepeace" <Paul.Makepeace@realprogrammers.com>
Date: Thu, 6 Jul 2000 00:29:31 -0700

Just as an afterthought since I missed the boat here,

> >*CGI::Util::simple_escape = *main::my_simple_escape;

is better written

*CGI::Util::simple_escape = \&my_simple_escape;

or else you'll copy the entire glob, clobbering/eximporting any scalars,
arrays, etc that might've been defined in CGI::Util. This way you only hack
on *CGI::Util::simple_parse{CODE}

> >             $toencode =~ s{&}{&amp;}gs;
> >             $toencode =~ s{<}{&lt;}gs;
> >             $toencode =~ s{>}{&gt;}gs;
> >             $toencode =~ s{\"}{&quot;}gs;
> >             $toencode =~ s{\x8b}{&#139;}gs;

A nice idiom here is for(),

for ($toencode) {
    s{&}{&amp;}gs;
    # etc
}

FWIW, if there are more than a few regexes like that, sometimes study() can
help. I tried it here but it actually slowed it down a bit. On larger inputs
or more complex regexes it can be worthwhile though. Suck it and see.

I don't know why the /s modifier is being used on these regexes. There's no
requirement for . to match a newline. Copy'n'paste?.

FWIW, I tried a C-esque hack,

my @replacement_array;
$replacement_array[ord] = $replacement_hash{$_} foreach keys
%replacement_hash;

and then

$toencode =~ s/([<>&"\x8b\x9b])/$replacement_array[ord $1]/g;

but this slowed it down slightly (8%) over the hash version (where cost of
perlops become dominant over processing the regex I guess).

Finally, one thing that actually did (minisculey, ~1%) improve the speed was
using qr//. As in:
my $lt = qr/</;
s{$lt}{&lt;}g;
# etc

This could be a bigger win for more/complex regexes. At the very least make
an amusing map / eval sub building project :)

===


Subject: Re: [sf-perl] yet one more challenge...
From: "Paul Makepeace" <Paul.Makepeace@realprogrammers.com>
Date: Mon, 24 Jul 2000 22:15:01 -0700

Well done for following up on this and convincing Lincoln!

I'm curious as to why these weird things *still* remain:

> sub my_escapeHTML {
> my ($self,$toencode) = CGI::self_or_default(@_);
> return undef unless defined($toencode);
> return $toencode if ref($self) && $self->{'dontescape'};
> if (uc $self->{'.charset'} eq 'ISO-8859-1') {
> $toencode =~ s{&}{&amp;}gso;
> $toencode =~ s{<}{&lt;}gso;
> $toencode =~ s{>}{&gt;}gso;
> $toencode =~ s{"}{&quot;}gso;
> $toencode =~ s{\x8b}{&#139;}gso;
> $toencode =~ s{\x9b}{&#155;}gso;

Why /so? There are no .'s being matched, so why the /s? There are no
variables so why the /o? It might seem trivial but this kind of copy'paste /
Cargo Cult stuff perpetuates myths about the dimmer recesses of Perl
(something the language hardly needs) and really confuses people trying to
learn from modules.

> } else {
> $toencode =~ s/(.)/'&#'.ord($1).';'/gsex;

Why the /x?

The for ($toencode) {
    s///;
    s///;
}

construct is faster too, not to mention more readable, IMO.

If a patch is going in, it's nice to make the best of it...

===

Subject: RE: [sf-perl] yet one more challenge...
From: Joshua Rabinowitz <JoshR@cnet.com>
Date: Tue, 25 Jul 2000 11:27:05 -0700

Yeah, what Paul said.  :)

RE: the /gso, I had (probably wrongly) assumed that to compiling the regex
only once would make it faster.  Also, though it _shouldn't_ contain a
newline, the search /s will handle newlines in querystrings correctly.

As for the /sex, that's actually in the CGI.pm as it stands now, if I'm not
mistaken... in other words, I didn't change that.  But it's pretty cool when
'sex' has a special meaning in a language (other than english, I mean). 

Also, for the record, I don't think Lincoln is actually taking my patch...
just backing simepl_escape() and escapeHTML() out to the way they had been
prior to Tom Christianson's recomendation of using for /(.)/.

joshr


===

Subject: Re: [sf-perl] yet one more challenge...
From: "David E. Wheeler" <David@Wheeler.net>
Date: Tue, 25 Jul 2000 12:19:18 -0700

Joshua Rabinowitz wrote:
> 
> Yeah, what Paul said.  :)
> 
> RE: the /gso, I had (probably wrongly) assumed that to compiling the regex
> only once would make it faster.

Only if you're matching on the contents of a variable, not when you're
matching literal characters.

> Also, though it _shouldn't_ contain a newline, the search /s will handle
> newlines in querystrings correctly.

A single character will never match a newline.
 
> As for the /sex, that's actually in the CGI.pm as it stands now, if I'm not
> mistaken... in other words, I didn't change that.  But it's pretty cool when
> 'sex' has a special meaning in a language (other than english, I mean).

Weird. Probably doesn't hurt any.

> Also, for the record, I don't think Lincoln is actually taking my patch...
> just backing simepl_escape() and escapeHTML() out to the way they had been
> prior to Tom Christianson's recomendation of using for /(.)/.
> 
> joshr

Untested:

sub my_escapeHTML {
    my ($self, $toencode) = CGI::self_or_default(@_);
    return unless defined($toencode);
    return $toencode if ref($self) && $self->{dontescape};
    if (uc $self->{'.charset'} eq 'ISO-8859-1') {
        for ($toencode) {
	    s/&/&amp;/g;
	    s/</&lt;|g;
	    s/>/&gt;|g;
            s/"/&quot;/g;
	    s/\x8b/&#139;/g;
	    s/\x9b/&#155;/g;
        }
    } else {
        $toencode =~ s/(.)/'&#'.ord($1).';'/ge;
    }
    return $toencode;
the rest of The Pile (a partial mailing list archive)
doom@kzsu.stanford.edu