modperl_memory_usage_vs_speedycgi

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.



Date: Thu, 21 Dec 2000 14:21:10 +0800
To: mod_perl list <modperl@apache.org>
From: Gunther Birznieks <gunther@extropia.com>
Subject: Fwd: [speedycgi] Speedycgi scales better than mod_perl with
  scripts that contain un-shared memory

FYI --

Sam just posted this to the speedycgi list just now.

>X-Authentication-Warning: www.newlug.org: majordom set sender to 
>owner-speedycgi@newlug.org using -f
>To: speedycgi@newlug.org
>Subject: [speedycgi] Speedycgi scales better than mod_perl with scripts 
>that contain un-shared memory
>Date: Wed, 20 Dec 2000 20:18:37 -0800
>From: Sam Horrocks <sam@daemoninc.com>
>Sender: owner-speedycgi@newlug.org
>Reply-To: speedycgi@newlug.org
>
>Just a point in speedy's favor, for anyone interested in performance tuning
>and scalability.
>
>A lot of mod_perl performance tuning involves trying to keep from creating
>"un-shared" memory - that is memory that a script uses while handling
>a request that is private to that script.  All perl scripts use some
>amount of un-shared memory - anything derived from user-input to the
>script via queries or posts for example has to be un-shared because it
>is unique to that run of that script.
>
>You can read all about mod_perl shared memory issues at:
>
>     http://perl.apache.org/guide/performance.html#Sharing_Memory
>
>The underlying problem in mod_perl is that apache likes to spread out
>web requests to as many httpd's, and therefore as many mod_perl interpreters,
>as possible using an LRU selection processes for picking httpd's.  For
>static web-pages where there is almost zero un-shared memory, the selection
>process doesn't matter much.  But when you load in a perl script with
>un-shared memory, it can really bog down the server.
>
>In SpeedyCGI's case, all perl memory is un-shared because there's no
>parent to pre-load any of the perl code into memory.  It could benefit
>somewhat from reducing this amount of un-shared memory if it had such
>a feature, but the fact that SpeedyCGI chooses backends using an MRU
>selection process means that it is much less prone to problems that
>un-shared memory can cause.
>
>I wanted to see how this played out in real benchmarks, so I wrote the
>following test script that uses un-shared memory:
>
>use CGI;
>$x = 'x' x 50000;       # Use some un-shared memory (*not* a memory leak)
>my $cgi = CGI->new();
>print $cgi->header();
>print "Hello ";
>print "World";
>
>I then ran ab to benchmark how well mod_speedycgi did versus mod_perl
>on this script.  When using no concurrency ("ab -c 1 -n 10000")
>mod_speedycgi and mod_perl come out about the same.  However, by
>increasing the concurrency level, I found that mod_perl performance drops
>off drastically, while mod_speedycgi does not.  In my case at about level
>100, the rps number drops by 50% and the system starts paging to disk
>while using mod_perl, whereas the mod_speedycgi numbers stay at about
>the same level.
>
>The problem is that at a high concurrency level, mod_perl is using lots
>and lots of different perl-interpreters to handle the requests, each
>with its own un-shared memory.  It's doing this due to its LRU design.
>But with SpeedyCGI's MRU design, only a few speedy_backends are being used
>because as much as possible it tries to use the same interpreter over and
>over and not spread out the requests to lots of different interpreters.
>Mod_perl is using lots of perl-interpreters, while speedycgi is only using
>a few.  mod_perl is requiring that lots of interpreters be in memory in
>order to handle the requests, wherase speedy only requires a small number
>of interpreters to be in memory.  And this is where the paging comes in -
>at a high enough concurency level, mod_perl starts using lots of memory
>to hold all of those interpreters, eventually running out of real memory
>and at that point it has to start paging.  And when the paging starts,
>the performance really nose-dives.
>
>With SpeedyCGI, at the same concurrency level, the total memory
>requirements for all the intepreters are much much smaller.  Eventually
>under a large enough load and with enough un-shared memory, SpeedyCGI
>would probably have to start paging too.  But due to its design the point
>at which SpeedyCGI will start doing this is at a much higher level than
>with mod_perl.

===

Date: Thu, 21 Dec 2000 02:01:48 -0600
Message-ID: <20001221020150-r01010600-10f24e81@10.0.0.2>
From: "Ken Williams" <ken@forum.swarthmore.edu>
To: "mod_perl list" <modperl@apache.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl with  scripts that contain un-shared memory

Well then, why doesn't somebody just make an Apache directive to control how
hits are divvied out to the children?  Something like 

  NextChild most-recent
  NextChild least-recent
  NextChild (blah...)

but more well-considered in name.  Not sure whether a config directive
would do it, or whether it would have to be a startup command-line
switch.  Or maybe a directive that can only happen in a startup config
file, not a .htaccess file.


===

Date: Thu, 21 Dec 2000 00:41:18 -0800
From: Perrin Harkins <perrin@primenet.com>
Reply-To: perrin@primenet.com
To: Gunther Birznieks <gunther@extropia.com>, Sam Horrocks <sam@daemoninc.com>
CC: mod_perl list <modperl@apache.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

Gunther Birznieks wrote:
> Sam just posted this to the speedycgi list just now.
[...]
> >The underlying problem in mod_perl is that apache likes to spread out
> >web requests to as many httpd's, and therefore as many mod_perl interpreters,
> >as possible using an LRU selection processes for picking httpd's.

Hmmm... this doesn't sound right.  I've never looked at the code in
Apache that does this selection, but I was under the impression that the
choice of which process would handle each request was an OS dependent
thing, based on some sort of mutex.

Take a look at this: http://httpd.apache.org/docs/misc/perf-tuning.html

Doesn't that appear to be saying that whichever process gets into the
mutex first will get the new request?  In my experience running
development servers on Linux it always seemed as if the the requests
would continue going to the same process until a request came in when
that process was already busy.

As I understand it, the implementation of "wake-one" scheduling in the
2.4 Linux kernel may affect this as well.  It may then be possible to
skip the mutex and use unserialized accept for single socket servers,
which will definitely hand process selection over to the kernel.

> >The problem is that at a high concurrency level, mod_perl is using lots
> >and lots of different perl-interpreters to handle the requests, each
> >with its own un-shared memory.  It's doing this due to its LRU design.
> >But with SpeedyCGI's MRU design, only a few speedy_backends are being used
> >because as much as possible it tries to use the same interpreter over and
> >over and not spread out the requests to lots of different interpreters.
> >Mod_perl is using lots of perl-interpreters, while speedycgi is only using
> >a few.  mod_perl is requiring that lots of interpreters be in memory in
> >order to handle the requests, wherase speedy only requires a small number
> >of interpreters to be in memory.

This test - building up unshared memory in each process - is somewhat
suspect since in most setups I've seen, there is a very significant
amount of memory being shared between mod_perl processes.  Regardless,
the explanation here doesn't make sense to me.  If we assume that each
approach is equally fast (as Sam seems to say earlier in his message)
then it should take an equal number of speedycgi and mod_perl processes
to handle the same concurrency.

That leads me to believe that what's really happening here is that
Apache is pre-forking a bit over-zealously in response to a sudden surge
of traffic from ab, and thus has extra unused processes sitting around
waiting, while speedycgi is avoiding this situation by waiting for
someone to try and use the processes before forking them (i.e. no
pre-forking).  The speedycgi way causes a brief delay while new
processes fork, but doesn't waste memory.  Does this sound like a
plausible explanation to folks?

This is probably all a moot point on a server with a properly set
MaxClients and Apache::SizeLimit that will not go into swap.  I would
expect mod_perl to have the advantage when all processes are
fully-utilized because of the shared memory.  It would be cool if
speedycgi could somehow use a parent process model and get the shared
memory benefits too.  Speedy seems like it might be more attractive to
ISPs, and it would be nice to increase interoperability between the two
projects.

===
Date: Thu, 21 Dec 2000 08:40:47 +0000 (GMT)
From: Matt Sergeant <matt@sergeant.org>
To: Ken Williams <ken@forum.swarthmore.edu>
cc: mod_perl list <modperl@apache.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl with 
 scripts that contain un-shared memory

On Thu, 21 Dec 2000, Ken Williams wrote:

> Well then, why doesn't somebody just make an Apache directive to control how
> hits are divvied out to the children?  Something like
>
>   NextChild most-recent
>   NextChild least-recent
>   NextChild (blah...)
>
> but more well-considered in name.  Not sure whether a config directive
> would do it, or whether it would have to be a startup command-line
> switch.  Or maybe a directive that can only happen in a startup config
> file, not a .htaccess file.

Probably nobody wants to do it because Apache 2.0 fixes this "bug".

===

Date: Thu, 21 Dec 2000 19:38:45 +0800
To: Sam Horrocks <sam@daemoninc.com>, perrin@primenet.com
From: Gunther Birznieks <gunther@extropia.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl
  withscripts that contain un-shared memory 
Cc: mod_perl list <modperl@apache.org>

I think you could actually make speedycgi even better for shared memory 
usage by creating a special directive which would indicate to speedycgi to 
preload a series of modules. And then to tell speedy cgi to do forking of 
that "master" backend preloaded module process and hand control over to 
that forked process whenever you need to launch a new process.

Then speedy would potentially have the best of both worlds.

Sorry I cross posted your thing. But I do think it is a problem of mod_perl 
also, and I am happily using speedycgi in production on at least one 
commercial site where mod_perl could not be installed so easily because of 
infrastructure issues.

I believe your mechanism of round robining among MRU perl interpreters is 
actually also accomplished by ActiveState's PerlEx (based on 
Apache::Registry but using multithreaded IIS and pool of Interpreters). A 
method similar to this will be used in Apache 2.0 when Apache is 
multithreaded and therefore can control within program logic which Perl 
interpeter gets called from a pool of Perl interpreters.

It just isn't so feasible right now in Apache 1.0 to do this. And sometimes 
people forget that mod_perl came about primarily for writing handlers in 
Perl not as an application environment although it is very good for the 
later as well.

I think SpeedyCGI needs more advocacy from the mod_perl group because put 
simply speedycgi is way easier to set up and use than mod_perl and will 
likely get more PHP people using Perl again. If more people rely on Perl 
for their fast websites, then you will get more people looking for more 
power, and by extension more people using mod_perl.

Whoops... here we go with the advocacy thing again.

===


To: modperl@apache.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl with  scripts that contain un-shared memory
From: Joe Schaefer <joe@sunstarsys.com>
Date: 21 Dec 2000 09:53:06 -0500


[ Sorry for accidentally spamming people on the
  list.  I was ticked off by this "benchmark",
  and accidentally forgot to clean up the reply 
  names.  I won't let it happen again :(  ]

Matt Sergeant <matt@sergeant.org> writes:

> On Thu, 21 Dec 2000, Ken Williams wrote:
> 
> > Well then, why doesn't somebody just make an Apache directive to control how
> > hits are divvied out to the children?  Something like
> >
> >   NextChild most-recent
> >   NextChild least-recent
> >   NextChild (blah...)
> >
> > but more well-considered in name.  Not sure whether a config directive
> > would do it, or whether it would have to be a startup command-line
> > switch.  Or maybe a directive that can only happen in a startup config
> > file, not a .htaccess file.
> 
> Probably nobody wants to do it because Apache 2.0 fixes this "bug".
> 

KeepAlive On

:)

All kidding aside, the problem with modperl is memory consumption, 
and to use modperl seriously, you currently have to code around 
that (preloading commonly used modules like CGI, or running it in 
a frontend/backend config similar to FastCGI.)  FastCGI and modperl
are fundamentally different technologies.  Both have the ability
to accelerate CGI scripts;  however, modperl can do quite a bit
more than that. 

Claimed benchmarks that are designed to exploit this memory issue 
are quite silly, especially when the actual results are never 
revealed. It's overzealous advocacy or FUD, depending on which 
side of the fence you are sitting on.

===

To: Gunther Birznieks <gunther@extropia.com>
Cc: modperl@apache.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl with  scripts that contain un-shared memory
From: Joe Schaefer <joe@sunstarsys.com>
Date: 21 Dec 2000 10:37:28 -0500

Gunther Birznieks <gunther@extropia.com> writes:

> But instead he crafted an experiment to show that in this particular case 
> (and some applications do satisfy this case) SpeedyCGI has a particular 
> benefit.

And what do I have to do to repeat it? Unlearn everything in Stas'
guide?

> 
> This is why people use different tools for different jobs -- because 
> architecturally they are designed for different things. SpeedyCGI is 
> designed in a different way from mod_perl. What I believe Sam is saying is 
> that there is a particular real-world scenario where SpeedyCGI likely has 
> better performance benefits to mod_perl.

Sure, and that's why some people use it.  But to say

"Speedycgi scales better than mod_perl with  scripts that contain un-shared memory"

is to me quite similar to saying

"SUV's are better than cars since they're safer to drive drunk in."

> 
> Discouraging the posting of experimental information like this is where the 
> FUD will lie. This isn't an advertisement in ComputerWorld by Microsoft or 
> Oracle, it's a posting on a mailing list. Open for discussion.

Maybe I'm wrong about this, but I didn't see any mention of the 
apparatus used in his experiment.  I only saw what you posted,
and your post had only anecdotal remarks of results without
detailing any config info.

I'm all for free and open discussions because they can
point to interesting new ideas.  However, some attempt at 
full disclosure (comments on the config used are as important 
important than anecdotal remarks about the results) is 
necessary so objective opinions can be formed.

===

Date: Thu, 21 Dec 2000 23:24:43 +0800
To: Joe Schaefer <joe@sunstarsys.com>, modperl@apache.org
From: Gunther Birznieks <gunther@extropia.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl
  with  scripts that contain un-shared memory

At 09:53 AM 12/21/00 -0500, Joe Schaefer wrote:

>[ Sorry for accidentally spamming people on the
>   list.  I was ticked off by this "benchmark",
>   and accidentally forgot to clean up the reply
>   names.  I won't let it happen again :(  ]

Not sure what you mean here. Some people like the duplicate reply names 
especially as the mod_perl list is still a bit slow on responding. I know I 
prefer to see replies to my messages ASAP and they tend to come faster if I 
am CCed on the list.

>All kidding aside, the problem with modperl is memory consumption,
>and to use modperl seriously, you currently have to code around
>that (preloading commonly used modules like CGI, or running it in
>a frontend/backend config similar to FastCGI.)  FastCGI and modperl
>are fundamentally different technologies.  Both have the ability
>to accelerate CGI scripts;  however, modperl can do quite a bit
>more than that.
>
>Claimed benchmarks that are designed to exploit this memory issue
>are quite silly, especially when the actual results are never
>revealed. It's overzealous advocacy or FUD, depending on which
>side of the fence you are sitting on.

I think I get your point on the first paragraph. But the 2nd paragraph is 
odd. Are you classifying the original post as being overzealous advocacy or 
FUD? I don't think I would classify it as such.

I could see it bordering on FUD if there was one benchmark which Sam 
produced and he just posted "SpeedyCGI is faster than mod_perl" without 
providing any details.

But instead he crafted an experiment to show that in this particular case 
(and some applications do satisfy this case) SpeedyCGI has a particular 
benefit.

This is why people use different tools for different jobs -- because 
architecturally they are designed for different things. SpeedyCGI is 
designed in a different way from mod_perl. What I believe Sam is saying is 
that there is a particular real-world scenario where SpeedyCGI likely has 
better performance benefits to mod_perl.

Discouraging the posting of experimental information like this is where the 
FUD will lie. This isn't an advertisement in ComputerWorld by Microsoft or 
Oracle, it's a posting on a mailing list. Open for discussion.

===   

Date: Thu, 21 Dec 2000 11:11:03 -0500
To: "mod_perl list" <modperl@apache.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl with  scripts that contain un-shared memory

>>>>> "KW" == Ken Williams <ken@forum.swarthmore.edu> writes:

KW> Well then, why doesn't somebody just make an Apache directive to
KW> control how hits are divvied out to the children?  Something like

According to memory, mod_perl 2.0 uses a most-recently-used strategy
to pull perl interpreters from the thread pool.  It sounds to me like
with apache 2.0 in thread-mode and mod_perl 2.0 you get the same
effect of using the proxy front end that we currently need.


Date: Thu, 21 Dec 2000 11:06:54 -0600
From: "Keith G. Murphy" <keithmur@mindspring.com>
To: mod_perl list <modperl@apache.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts 
 that contain un-shared memory

Perrin Harkins wrote:
>
[cut]
> 
> Doesn't that appear to be saying that whichever process gets into the
> mutex first will get the new request?  In my experience running
> development servers on Linux it always seemed as if the the requests
> would continue going to the same process until a request came in when
> that process was already busy.
> 
Is it possible that the persistent connections utilized by HTTP 1.1 just
made it look that way?  Would happen if the clients were MSIE.

Even recent Netscape browsers only use 1.0, IIRC.

(I was recently perplexed by differing performance between MSIE and NS
browsers hitting my system until I realized this.)

===
To: perrin@primenet.com
cc: Gunther Birznieks <gunther@extropia.com>, mod_perl list <modperl@apache.org>
cc: speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Thu, 21 Dec 2000 02:50:28 -0800

 > Gunther Birznieks wrote:
 > > Sam just posted this to the speedycgi list just now.
 > [...]
 > > >The underlying problem in mod_perl is that apache likes to spread out
 > > >web requests to as many httpd's, and therefore as many mod_perl interpreters,
 > > >as possible using an LRU selection processes for picking httpd's.
 > 
 > Hmmm... this doesn't sound right.  I've never looked at the code in
 > Apache that does this selection, but I was under the impression that the
 > choice of which process would handle each request was an OS dependent
 > thing, based on some sort of mutex.
 > 
 > Take a look at this: http://httpd.apache.org/docs/misc/perf-tuning.html
 > 
 > Doesn't that appear to be saying that whichever process gets into the
 > mutex first will get the new request?

 I would agree that whichver process gets into the mutex first will get
 the new request.  That's exactly the problem I'm describing.  What you
 are describing here is first-in, first-out behaviour which implies LRU
 behaviour.

 Processes 1, 2, 3 are running.  1 finishes and requests the mutex, then
 2 finishes and requests the mutex, then 3 finishes and requests the mutex.
 So when the next three requests come in, they are handled in the same order:
 1, then 2, then 3 - this is FIFO or LRU.  This is bad for performance.

 > In my experience running
 > development servers on Linux it always seemed as if the the requests
 > would continue going to the same process until a request came in when
 > that process was already busy.

 No, they don't.  They go round-robin (or LRU as I say it).

 Try this simple test script:

 use CGI;
 my $cgi = CGI->new;
 print $cgi->header();
 print "mypid=$$\n";

 WIth mod_perl you constantly get different pids.  WIth mod_speedycgi you
 usually get the same pid.  THis is a really good way to see the LRU/MRU
 difference that I'm talking about.

 Here's the problem - the mutex in apache is implemented using a lock
 on a file.  It's left up to the kernel to decide which process to give
 that lock to.

 Now, if you're writing a unix kernel and implementing this file locking code,
 what implementation would you use?  Well, this is a general purpose thing -
 you have 100 or so processes all trying to acquire this file lock.  You could
 give out the lock randomly or in some ordered fashion.  If I were writing
 the kernel I would give it out in a round-robin fashion (or the
 least-recently-used process as I referred to it before).  Why?  Because
 otherwise one of those processes may starve waiting for this lock - it may
 never get the lock unless you do it in a fair (round-robin) manner.

 THe kernel doesn't know that all these httpd's are exactly the same.
 The kernel is implementing a general-purpose file-locking scheme and
 it doesn't know whether one process is more important than another.  If
 it's not fair about giving out the lock a very important process might
 starve.

 Take a look at fs/locks.c (I'm looking at linux 2.3.46).  In there is the
 comment:

 /* Insert waiter into blocker's block list.
  * We use a circular list so that processes can be easily woken up in
  * the order they blocked. The documentation doesn't require this but
  * it seems like the reasonable thing to do.
  */
 static void locks_insert_block(struct file_lock *blocker, struct file_lock *waiter)

 > As I understand it, the implementation of "wake-one" scheduling in the
 > 2.4 Linux kernel may affect this as well.  It may then be possible to
 > skip the mutex and use unserialized accept for single socket servers,
 > which will definitely hand process selection over to the kernel.

 If the kernel implemented the queueing for multiple accepts using a LIFO
 instead of a FIFO and apache used this method instead of file locks,
 then that would probably solve it.

 Just found this on the net on this subject:
    http://www.uwsg.iu.edu/hypermail/linux/kernel/9704.0/0455.html
    http://www.uwsg.iu.edu/hypermail/linux/kernel/9704.0/0453.html

 > > >The problem is that at a high concurrency level, mod_perl is using lots
 > > >and lots of different perl-interpreters to handle the requests, each
 > > >with its own un-shared memory.  It's doing this due to its LRU design.
 > > >But with SpeedyCGI's MRU design, only a few speedy_backends are being used
 > > >because as much as possible it tries to use the same interpreter over and
 > > >over and not spread out the requests to lots of different interpreters.
 > > >Mod_perl is using lots of perl-interpreters, while speedycgi is only using
 > > >a few.  mod_perl is requiring that lots of interpreters be in memory in
 > > >order to handle the requests, wherase speedy only requires a small number
 > > >of interpreters to be in memory.
 > 
 > This test - building up unshared memory in each process - is somewhat
 > suspect since in most setups I've seen, there is a very significant
 > amount of memory being shared between mod_perl processes.

 My message and testing concerns un-shared memory only.  If all of your memory
 is shared, then there shouldn't be a problem.

 But a point I'm making is that with mod_perl you have to go to great
 lengths to write your code so as to avoid unshared memory.  My claim is that
 with mod_speedycgi you don't have to concern yourself as much with this.
 You can concentrate more on the application and less on performance tuning.

 > Regardless,
 > the explanation here doesn't make sense to me.  If we assume that each
 > approach is equally fast (as Sam seems to say earlier in his message)
 > then it should take an equal number of speedycgi and mod_perl processes
 > to handle the same concurrency.

 I don't assume that each approach is equally fast under all loads.  They
 were about the same with concurrency level-1, but higher concurrency levels
 they weren't.

 I am saying that since SpeedyCGI uses MRU to allocate requests to perl
 interpreters, it winds up using a lot fewer interpreters to handle the
 same number of requests.

 On a single-CPU system of course at some point all the concurrency has
 to be serialized. mod_speedycgi and mod_perl take different approaches
 before getting to get to that point.  mod_speedycgi tries to use as
 small a number of unix processes as possible, while mod_perl tries to
 use a very large number of unix processes.

 > That leads me to believe that what's really happening here is that
 > Apache is pre-forking a bit over-zealously in response to a sudden surge
 > of traffic from ab, and thus has extra unused processes sitting around
 > waiting, while speedycgi is avoiding this situation by waiting for
 > someone to try and use the processes before forking them (i.e. no
 > pre-forking).  The speedycgi way causes a brief delay while new
 > processes fork, but doesn't waste memory.  Does this sound like a
 > plausible explanation to folks?

 I don't think it's pre-forking.  When I ran my tests I would always run
 them twice, and take the results from the second run.  The first run
 was just to "prime the pump".

 I tried reducing MinSpareSErvers, and this did help mod_perl get a higher
 concurrency number, but it would still run into a wall where speedycgi
 would not.
 
 > This is probably all a moot point on a server with a properly set
 > MaxClients and Apache::SizeLimit that will not go into swap.

 Please let me know what you think I should change.  So far my
 benchmarks only show one trend, but if you can tell me specifically
 what I'm doing wrong (and it's something reasonable), I'll try it.

 I don't think SizeLimit is the answer - my process isn't growing.  It's
 using the same 50k of un-shared memory over and over.

 I believe that with speedycgi you don't have to lower the MaxClients
 setting, because it's able to handle a larger number of clients, at
 least in this test.  In other words, if with mod_perl you had to turn
 away requests, but with mod_speedycgi you did not, that would just
 prove that speedycgi is more scalable.

 Now you could tell me "don't use unshared memory", but that's outside
 the bounds of the test.   The whole test concerns unshared memory.
 
 > I would
 > expect mod_perl to have the advantage when all processes are
 > fully-utilized because of the shared memory.

 Maybe.  There must a benchmark somewhere that would show off of
 mod_perl's advantages in shared memory.  Maybe a 100,000 line perl
 program or something like that - it would have to be something where
 mod_perl is using *lots* of shared memory, because keep in mind that
 there are still going to be a whole lot fewer SpeedyCGI processes than
 there are mod_perl processes, so you would really have to go overboard
 in the shared-memory department.

 > It would be cool if speedycgi could somehow use a parent process
 > model and get the shared memory benefits too.

 > Speedy seems like it
 > might be more attractive to > ISPs, and it would be nice to increase
 > interoperability between the two > projects.

 Thanks.  And please, I'm not trying  start a speedy vs mod_perl war.
 My original message was only to the speedycgi list, but now that it's
 on mod_perl I think I have to reply there too.

 But, there is a need for a little good PR on speedycgi's side, and I
 was looking for that.  I would rather just see mod_perl fixed if that's
 possible.  But the last time I brought up this issue (maybe a year ago)
 I was unable to convince the people on the mod_perl list that this
 problem even existed.

===
Date: Thu, 21 Dec 2000 21:16:08 +0100 (CET)
From: Stas Bekman <stas@stason.org>
To: Sam Horrocks <sam@daemoninc.com>
Cc: Gunther Birznieks <gunther@extropia.com>, mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 

Folks, your discussion is not short of wrong statements that
can be easily proved, but I don't find it useful. Instead
please read:

http://perl.apache.org/~dougm/modperl_2.0.html#new

Too quote the most relevant part:

"With 2.0, mod_perl has much better control over which
PerlInterpreters are used for incoming requests. The
intepreters are stored in two linked lists, one for
available interpreters one for busy. When needed to handle a
request, one is taken from the head of the available list
and put back into the head of the list when done. This means
if you have, say, 10 interpreters configured to be cloned at
startup time, but no more than 5 are ever used concurrently,
those 5 continue to reuse Perls allocations, while the other
5 remain much smaller, but ready to go if the need arises."

Of course you should read the rest.

So the moment mod_perl 2.0 hits the shelves, this possible
benefit of speedycgi over mod_perl becomes irrelevant. I
think this more or less summarizes this thread.

And Gunther, nobody tries to shut people expressing their
opinions here, it's just that different people express their
feelings in different ways, that's the way the open list
goes... :) so please keep on forwarding things that you find
interesting. I don't think anybody here has a relief when
you are busy and not posting as you happen to say -- I
believe that your posts are very interesting and you
shouldn't discourage yourself from keeping on doing
that. Those who don't like your posts don't have to read
them.

Hope you are all having fun and getting ready for the
holidays :) I'm going to buy my ski equipment soonish!

===

To: Stas Bekman <stas@stason.org>
cc: Gunther Birznieks <gunther@extropia.com>, mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Thu, 21 Dec 2000 14:34:46 -0800

 > Folks, your discussion is not short of wrong statements that can be easily
 > proved, but I don't find it useful.

 I don't follow.  Are you saying that my conclusions are wrong, but
 you don't want to bother explaining why?
 
 Would you agree with the following statement?

    Under apache-1, speedycgi scales better than mod_perl with
    scripts that contain un-shared memory 

===

Date: Fri, 22 Dec 2000 08:45:25 +0800
To: Stas Bekman <stas@stason.org>, Sam Horrocks <sam@daemoninc.com>
From: Gunther Birznieks <gunther@extropia.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Cc: mod_perl list <modperl@apache.org>, speedycgi@newlug.org

At 09:16 PM 12/21/00 +0100, Stas Bekman wrote:
[much removed]

>So the moment mod_perl 2.0 hits the shelves, this possible benefit
>of speedycgi over mod_perl becomes irrelevant. I think this more or less
>summarizes this thread.

I think you are right about the summarization. However, I
also think it's unfair for people here to pin too many hopes
on mod_perl 2.0.

First Apache 2.0 has to be fully released. It's still in
Alpha! Then, mod_perl 2.0 has to be released. I haven't seen
any realistic timelines that indicate to me that these will
be released and stable for production use in only a few
months time. And Apache 2.0 has been worked on for years.  I
first saw a talk on Apache 2.0's architecture at the first
ApacheCon 2 years ago! To be fair, back then they were using
Mozilla's NPR which I think they learned from, threw away,
and rewrote from scratch after all (to become APR). But
still, the point is that it's been a long time and probably
will be a while yet.

Who in their right mind would pin their business or
production database on the hope that mod_perl 2.0 comes out
in a few months? I don't think anyone would. Sam has a
solution that works now, and is open source and provides
some benefits for web applications that mod_perl and apache
is not as efficient at for some types of applications.

As people interested in Perl, we should be embracing these
alternatives not telling people to wait for new versions of
software that may not come out soon.

If there is a problem with mod_perl advocacy, it's that it
is precisely too mod_perl centric. Mod_perl is a niche crowd
which has a high learning curve. I think the technology
mod_perl offers is great, but as has been said before, the
problem is that people are going to PHP away from Perl. If
more people had easier solutions to implement their simple
apps in Perl yet be as fast as PHP, less people would go to
PHP.

Those Perl people would eventually discover mod_perl's power
as they require it, and then they would take the step to
"upgrade" to the power of handlers away from the "missing
link".

But without that "missing link" to make things easy for
people to move from PHP to Perl, then Perl will miss
something very crucial to maintaining its standing as the
"defacto language for Web applications".

3 years ago, I think it would be accurate to say Perl apps
drive 95% of the dynamic web. Sadly, I believe (anecdotally)
that this is no longer true.

SpeedyCGI is not "THE" missing link, but I see it as a
crucial part of this link between newbies and mod_perl. This
is why I believe that mod_perl and its documentation should
have a section (even if tiny) on this stuff, so that people
will know that if they find mod_perl too hard, that there
are alternatives that are less powerful, yet provide at
least enough power to beat PHP.

I also see SpeedyCGI as being on the way to being more
ISP-friendly already for hosting casual users of Perl than
mod_perl is. Different apps use a different backend engine
by default. So the problem with virtual hosts screwing each
other over by accident is gone for the casual user. There
are still some needs for improvement (eg memory is likely
still an issue with different backends)...

Anyway, these are just my feelings. I really shouldn't be
spending time on posting this as I have some deadlines to
meet. But I felt they were still important points to make
that I think some people may be potentially missing here. :)

===

Date: Fri, 22 Dec 2000 01:48:47 +0100 (CET)
From: Stas Bekman <stas@stason.org>
To: Sam Horrocks <sam@daemoninc.com>
Cc: Gunther Birznieks <gunther@extropia.com>, mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
In-Reply-To: <14214.977438086@daemonweb.daemoninc.com>

On Thu, 21 Dec 2000, Sam Horrocks wrote:

>  > Folks, your discussion is not short of wrong statements that can be easily
>  > proved, but I don't find it useful.
> 
>  I don't follow.  Are you saying that my conclusions are wrong, but
>  you don't want to bother explaining why?
>  
>  Would you agree with the following statement?
> 
>     Under apache-1, speedycgi scales better than mod_perl with
>     scripts that contain un-shared memory 


I don't know. It's easy to give a simple example and claim being better.
So far whoever tried to show by benchmarks that he is better, most often
was proved wrong, since the technologies in question have so many
features, that I believe no benchmark will prove any of them absolutely
superior or inferior. Therefore I said that trying to tell that your grass
is greener is doomed to fail if someone has time on his hands to prove you
wrong. Well, we don't have this time.

Therefore I'm not trying to prove you wrong or right. Gunther's point of
the original forward was to show things that mod_perl may need to adopt to
make it better. Doug already explained in his paper that the MRU approach
has been already implemented in mod_perl-2.0. You could read it in the
link that I've attached and the quote that I've quoted.

So your conclusions about MRU are correct and we have it implemented
already (well very soon now :). I apologize if my original reply was
misleading.

I'm not telling that benchmarks are bad. What I'm telling is that it's
very hard to benchmark things which are different. You benefit the most
from the benchmarking when you take the initial code/product, benchmark
it, then you try to improve the code and benchmark again to see whether it
gave you any improvement. That's the area when the benchmarks rule and
their are fair because you test the same thing. Well you could read more
of my rambling about benchmarks in the guide.

So if you find some cool features in other technologies that mod_perl
might adopt and benefit from, don't hesitate to tell the rest of the gang.

----

Something that I'd like to comment on:

I find it a bad practice to quote one sentence from person's post and
follow up on it. Someone from the list has sent me this email (SB> == me):

SB> I don't find it useful

and follow up. Why not to use a single letter:

SB> I

and follow up? It's so much easier to flame on things taken out of their
context.

it has been no once that people did this to each other here on the list, I
think I did too. So please be more careful when taking things out of
context. Thanks a lot, folks!

===

To: Gunther Birznieks <gunther@extropia.com>
cc: speedycgi@newlug.org
cc: perrin@primenet.com, mod_perl list <modperl@apache.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Thu, 21 Dec 2000 16:56:54 -0800

I've put your suggestion on the todo list.  It certainly wouldn't hurt to
have that feature, though I think memory sharing becomes a much much smaller
issue once you switch to MRU scheduling.

At the moment I think SpeedyCGI has more pressing needs though - for
example multiple scripts in a single interpreter, and an NT port.


 > I think you could actually make speedycgi even better for shared memory 
 > usage by creating a special directive which would indicate to speedycgi to 
 > preload a series of modules. And then to tell speedy cgi to do forking of 
 > that "master" backend preloaded module process and hand control over to 
 > that forked process whenever you need to launch a new process.
 > 
 > Then speedy would potentially have the best of both worlds.
 > 
 > Sorry I cross posted your thing. But I do think it is a problem of mod_perl 
 > also, and I am happily using speedycgi in production on at least one 
 > commercial site where mod_perl could not be installed so easily because of 
 > infrastructure issues.
 > 
 > I believe your mechanism of round robining among MRU perl interpreters is 
 > actually also accomplished by ActiveState's PerlEx (based on 
 > Apache::Registry but using multithreaded IIS and pool of Interpreters). A 
 > method similar to this will be used in Apache 2.0 when Apache is 
 > multithreaded and therefore can control within program logic which Perl 
 > interpeter gets called from a pool of Perl interpreters.
 > 
 > It just isn't so feasible right now in Apache 1.0 to do this. And sometimes 
 > people forget that mod_perl came about primarily for writing handlers in 
 > Perl not as an application environment although it is very good for the 
 > later as well.
 > 
 > I think SpeedyCGI needs more advocacy from the mod_perl group because put 
 > simply speedycgi is way easier to set up and use than mod_perl and will 
 > likely get more PHP people using Perl again. If more people rely on Perl 
 > for their fast websites, then you will get more people looking for more 
 > power, and by extension more people using mod_perl.
 > 
 > Whoops... here we go with the advocacy thing again.
 > 
 > Later,
 >     Gunther
 > 
 > At 02:50 AM 12/21/2000 -0800, Sam Horrocks wrote:
 > >  > Gunther Birznieks wrote:
 > >  > > Sam just posted this to the speedycgi list just now.
 > >  > [...]
 > >  > > >The underlying problem in mod_perl is that apache likes to spread out
 > >  > > >web requests to as many httpd's, and therefore as many mod_perl 
 > > interpreters,
 > >  > > >as possible using an LRU selection processes for picking httpd's.
 > >  >
 > >  > Hmmm... this doesn't sound right.  I've never looked at the code in
 > >  > Apache that does this selection, but I was under the impression that the
 > >  > choice of which process would handle each request was an OS dependent
 > >  > thing, based on some sort of mutex.
 > >  >
 > >  > Take a look at this: http://httpd.apache.org/docs/misc/perf-tuning.html
 > >  >
 > >  > Doesn't that appear to be saying that whichever process gets into the
 > >  > mutex first will get the new request?
 > >
 > >  I would agree that whichver process gets into the mutex first will get
 > >  the new request.  That's exactly the problem I'm describing.  What you
 > >  are describing here is first-in, first-out behaviour which implies LRU
 > >  behaviour.
 > >
 > >  Processes 1, 2, 3 are running.  1 finishes and requests the mutex, then
 > >  2 finishes and requests the mutex, then 3 finishes and requests the mutex.
 > >  So when the next three requests come in, they are handled in the same order:
 > >  1, then 2, then 3 - this is FIFO or LRU.  This is bad for performance.
 > >
 > >  > In my experience running
 > >  > development servers on Linux it always seemed as if the the requests
 > >  > would continue going to the same process until a request came in when
 > >  > that process was already busy.
 > >
 > >  No, they don't.  They go round-robin (or LRU as I say it).
 > >
 > >  Try this simple test script:
 > >
 > >  use CGI;
 > >  my $cgi = CGI->new;
 > >  print $cgi->header();
 > >  print "mypid=$$\n";
 > >
 > >  WIth mod_perl you constantly get different pids.  WIth mod_speedycgi you
 > >  usually get the same pid.  THis is a really good way to see the LRU/MRU
 > >  difference that I'm talking about.
 > >
 > >  Here's the problem - the mutex in apache is implemented using a lock
 > >  on a file.  It's left up to the kernel to decide which process to give
 > >  that lock to.
 > >
 > >  Now, if you're writing a unix kernel and implementing this file locking 
 > > code,
 > >  what implementation would you use?  Well, this is a general purpose thing -
 > >  you have 100 or so processes all trying to acquire this file lock.  You 
 > > could
 > >  give out the lock randomly or in some ordered fashion.  If I were writing
 > >  the kernel I would give it out in a round-robin fashion (or the
 > >  least-recently-used process as I referred to it before).  Why?  Because
 > >  otherwise one of those processes may starve waiting for this lock - it may
 > >  never get the lock unless you do it in a fair (round-robin) manner.
 > >
 > >  THe kernel doesn't know that all these httpd's are exactly the same.
 > >  The kernel is implementing a general-purpose file-locking scheme and
 > >  it doesn't know whether one process is more important than another.  If
 > >  it's not fair about giving out the lock a very important process might
 > >  starve.
 > >
 > >  Take a look at fs/locks.c (I'm looking at linux 2.3.46).  In there is the
 > >  comment:
 > >
 > >  /* Insert waiter into blocker's block list.
 > >   * We use a circular list so that processes can be easily woken up in
 > >   * the order they blocked. The documentation doesn't require this but
 > >   * it seems like the reasonable thing to do.
 > >   */
 > >  static void locks_insert_block(struct file_lock *blocker, struct 
 > > file_lock *waiter)
 > >
 > >  > As I understand it, the implementation of "wake-one" scheduling in the
 > >  > 2.4 Linux kernel may affect this as well.  It may then be possible to
 > >  > skip the mutex and use unserialized accept for single socket servers,
 > >  > which will definitely hand process selection over to the kernel.
 > >
 > >  If the kernel implemented the queueing for multiple accepts using a LIFO
 > >  instead of a FIFO and apache used this method instead of file locks,
 > >  then that would probably solve it.
 > >
 > >  Just found this on the net on this subject:
 > >     http://www.uwsg.iu.edu/hypermail/linux/kernel/9704.0/0455.html
 > >     http://www.uwsg.iu.edu/hypermail/linux/kernel/9704.0/0453.html
 > >
 > >  > > >The problem is that at a high concurrency level, mod_perl is using lots
 > >  > > >and lots of different perl-interpreters to handle the requests, each
 > >  > > >with its own un-shared memory.  It's doing this due to its LRU design.
 > >  > > >But with SpeedyCGI's MRU design, only a few speedy_backends are 
 > > being used
 > >  > > >because as much as possible it tries to use the same interpreter 
 > > over and
 > >  > > >over and not spread out the requests to lots of different interpreters.
 > >  > > >Mod_perl is using lots of perl-interpreters, while speedycgi is 
 > > only using
 > >  > > >a few.  mod_perl is requiring that lots of interpreters be in memory in
 > >  > > >order to handle the requests, wherase speedy only requires a small 
 > > number
 > >  > > >of interpreters to be in memory.
 > >  >
 > >  > This test - building up unshared memory in each process - is somewhat
 > >  > suspect since in most setups I've seen, there is a very significant
 > >  > amount of memory being shared between mod_perl processes.
 > >
 > >  My message and testing concerns un-shared memory only.  If all of your 
 > > memory
 > >  is shared, then there shouldn't be a problem.
 > >
 > >  But a point I'm making is that with mod_perl you have to go to great
 > >  lengths to write your code so as to avoid unshared memory.  My claim is that
 > >  with mod_speedycgi you don't have to concern yourself as much with this.
 > >  You can concentrate more on the application and less on performance tuning.
 > >
 > >  > Regardless,
 > >  > the explanation here doesn't make sense to me.  If we assume that each
 > >  > approach is equally fast (as Sam seems to say earlier in his message)
 > >  > then it should take an equal number of speedycgi and mod_perl processes
 > >  > to handle the same concurrency.
 > >
 > >  I don't assume that each approach is equally fast under all loads.  They
 > >  were about the same with concurrency level-1, but higher concurrency levels
 > >  they weren't.
 > >
 > >  I am saying that since SpeedyCGI uses MRU to allocate requests to perl
 > >  interpreters, it winds up using a lot fewer interpreters to handle the
 > >  same number of requests.
 > >
 > >  On a single-CPU system of course at some point all the concurrency has
 > >  to be serialized. mod_speedycgi and mod_perl take different approaches
 > >  before getting to get to that point.  mod_speedycgi tries to use as
 > >  small a number of unix processes as possible, while mod_perl tries to
 > >  use a very large number of unix processes.
 > >
 > >  > That leads me to believe that what's really happening here is that
 > >  > Apache is pre-forking a bit over-zealously in response to a sudden surge
 > >  > of traffic from ab, and thus has extra unused processes sitting around
 > >  > waiting, while speedycgi is avoiding this situation by waiting for
 > >  > someone to try and use the processes before forking them (i.e. no
 > >  > pre-forking).  The speedycgi way causes a brief delay while new
 > >  > processes fork, but doesn't waste memory.  Does this sound like a
 > >  > plausible explanation to folks?
 > >
 > >  I don't think it's pre-forking.  When I ran my tests I would always run
 > >  them twice, and take the results from the second run.  The first run
 > >  was just to "prime the pump".
 > >
 > >  I tried reducing MinSpareSErvers, and this did help mod_perl get a higher
 > >  concurrency number, but it would still run into a wall where speedycgi
 > >  would not.
 > >
 > >  > This is probably all a moot point on a server with a properly set
 > >  > MaxClients and Apache::SizeLimit that will not go into swap.
 > >
 > >  Please let me know what you think I should change.  So far my
 > >  benchmarks only show one trend, but if you can tell me specifically
 > >  what I'm doing wrong (and it's something reasonable), I'll try it.
 > >
 > >  I don't think SizeLimit is the answer - my process isn't growing.  It's
 > >  using the same 50k of un-shared memory over and over.
 > >
 > >  I believe that with speedycgi you don't have to lower the MaxClients
 > >  setting, because it's able to handle a larger number of clients, at
 > >  least in this test.  In other words, if with mod_perl you had to turn
 > >  away requests, but with mod_speedycgi you did not, that would just
 > >  prove that speedycgi is more scalable.
 > >
 > >  Now you could tell me "don't use unshared memory", but that's outside
 > >  the bounds of the test.   The whole test concerns unshared memory.
 > >
 > >  > I would
 > >  > expect mod_perl to have the advantage when all processes are
 > >  > fully-utilized because of the shared memory.
 > >
 > >  Maybe.  There must a benchmark somewhere that would show off of
 > >  mod_perl's advantages in shared memory.  Maybe a 100,000 line perl
 > >  program or something like that - it would have to be something where
 > >  mod_perl is using *lots* of shared memory, because keep in mind that
 > >  there are still going to be a whole lot fewer SpeedyCGI processes than
 > >  there are mod_perl processes, so you would really have to go overboard
 > >  in the shared-memory department.
 > >
 > >  > It would be cool if speedycgi could somehow use a parent process
 > >  > model and get the shared memory benefits too.
 > >
 > >  > Speedy seems like it
 > >  > might be more attractive to > ISPs, and it would be nice to increase
 > >  > interoperability between the two > projects.
 > >
 > >  Thanks.  And please, I'm not trying  start a speedy vs mod_perl war.
 > >  My original message was only to the speedycgi list, but now that it's
 > >  on mod_perl I think I have to reply there too.
 > >
 > >  But, there is a need for a little good PR on speedycgi's side, and I
 > >  was looking for that.  I would rather just see mod_perl fixed if that's
 > >  possible.  But the last time I brought up this issue (maybe a year ago)
 > >  I was unable to convince the people on the mod_perl list that this
 > >  problem even existed.
 > >
 > >  Sam
===

To: speedycgi@newlug.org
cc: Gunther Birznieks <gunther@extropia.com>,
        mod_perl list <modperl@apache.org>
cc: Stas Bekman <stas@stason.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Thu, 21 Dec 2000 17:26:39 -0800

I really wasn't trying to work backwards from a benchmark.  It was
more of an analysis of the design, and the benchmarks bore it out.
It's sort of like coming up with a theory in science - if you can't get
any experimental data to back up the theory, you're in big trouble.
But if you can at least point out the existence of some experiments
that are consistent with your theory, it means your theory may be true.

The best would be to have other people do the same tests and see if they
see the same trend.  If no-one else sees this trend, then I'd really
have to re-think my analysis.

Another way to look at it - as you say below MRU is going to be in
mod_perl-2.0.  ANd what is the reason for that?  If there's no performance
difference between LRU and MRU why would the author bother to switch
to MRU.  So, I'm saying there must be some benchmarks somewhere that
point out this difference - if there weren't any real-world difference,
why bother even implementing MRU.

I claim that my benchmarks point out this difference between MRU over
LRU, and that's why my benchmarks show better performance on speedycgi
than on mod_perl.

===


From: Perrin Harkins <perrin@primenet.com>
To: Sam Horrocks <sam@daemoninc.com>
Cc: Gunther Birznieks <gunther@extropia.com>, mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Date: Thu, 21 Dec 2000 17:38:37 -0800 (PST)

Hi Sam,

>  Processes 1, 2, 3 are running.  1 finishes and requests the mutex, then
>  2 finishes and requests the mutex, then 3 finishes and requests the mutex.
>  So when the next three requests come in, they are handled in the same order:
>  1, then 2, then 3 - this is FIFO or LRU.  This is bad for performance.

Thanks for the explanation; that makes sense now.  So, I was right that
it's OS dependent, but most OSes use a FIFO approach which leads to LRU
selection in the mutex.

Unfortunately, I don't see that being fixed very simply, since it's not
really Apache doing the choosing.  Maybe it will be possible to do
something cool with the wake-one stuff in Linux 2.4 when that comes out.

By the way, how are you doing it?  Do you use a mutex routine that works
in LIFO fashion?

>  > In my experience running
>  > development servers on Linux it always seemed as if the the requests
>  > would continue going to the same process until a request came in when
>  > that process was already busy.
> 
>  No, they don't.  They go round-robin (or LRU as I say it).

Keith Murphy pointed out that I was seeing the result of persistent HTTP
connections from my browser.  Duh.

>  But a point I'm making is that with mod_perl you have to go to great
>  lengths to write your code so as to avoid unshared memory.  My claim is that
>  with mod_speedycgi you don't have to concern yourself as much with this.
>  You can concentrate more on the application and less on performance tuning.

I think you're overstating the case a bit here.  It's really easy to take
advantage of shared memory with mod_perl - I just add a 'use Foo' to my
startup.pl!  It can be hard for newbies to understand, but there's nothing
difficult about implementing it.  I often get 50% or more of my
application shared in this way.  That's a huge savings.

>  I don't assume that each approach is equally fast under all loads.  They
>  were about the same with concurrency level-1, but higher concurrency levels
>  they weren't.

Well, certainly not when mod_perl started swapping...

Actually, there is a reason why MRU could lead to better performance (as
opposed to just saving memory): caching of allocated memory.  The first
time Perl sees lexicals it has to allocate memory for them, so if you
re-use the same interpreter you get to skip this step and that should give
some kind of performance benefit.

>  I am saying that since SpeedyCGI uses MRU to allocate requests to perl
>  interpreters, it winds up using a lot fewer interpreters to handle the
>  same number of requests.

What I was saying is that it doesn't make sense for one to need fewer
interpreters than the other to handle the same concurrency.  If you have
10 requests at the same time, you need 10 interpreters.  There's no way
speedycgi can do it with fewer, unless it actually makes some of them
wait.  That could be happening, due to the fork-on-demand model, although
your warmup round (priming the pump) should take care of that.

>  I don't think it's pre-forking.  When I ran my tests I would always run
>  them twice, and take the results from the second run.  The first run
>  was just to "prime the pump".

That seems like it should do it, but I still think you could only have
more processes handling the same concurrency on mod_perl if some of the
mod_perl processes are idle or some of the speedycgi requests are waiting.

>  > This is probably all a moot point on a server with a properly set
>  > MaxClients and Apache::SizeLimit that will not go into swap.
> 
>  Please let me know what you think I should change.  So far my
>  benchmarks only show one trend, but if you can tell me specifically
>  what I'm doing wrong (and it's something reasonable), I'll try it.

Try setting MinSpareServers as low as possible and setting MaxClients to a
value that will prevent swapping.  Then set ab for a concurrency equal to
your MaxClients setting.

>  I believe that with speedycgi you don't have to lower the MaxClients
>  setting, because it's able to handle a larger number of clients, at
>  least in this test.

Maybe what you're seeing is an ability to handle a larger number of
requests (as opposed to clients) because of the performance benefit I
mentioned above.  I don't know how hard ab tries to make sure you really
have n simultaneous clients at any given time.

>  In other words, if with mod_perl you had to turn
>  away requests, but with mod_speedycgi you did not, that would just
>  prove that speedycgi is more scalable.

Are the speedycgi+Apache processes smaller than the mod_perl
processes?  If not, the maximum number of concurrent requests you can
handle on a given box is going to be the same.

>  Maybe.  There must a benchmark somewhere that would show off of
>  mod_perl's advantages in shared memory.  Maybe a 100,000 line perl
>  program or something like that - it would have to be something where
>  mod_perl is using *lots* of shared memory, because keep in mind that
>  there are still going to be a whole lot fewer SpeedyCGI processes than
>  there are mod_perl processes, so you would really have to go overboard
>  in the shared-memory department.

Well, I get tons of use out of shared memory without even trying.  If you
can find a way to implement it in speedycgi, I think it would be very
beneficial to your users.

>  I would rather just see mod_perl fixed if that's
>  possible.

Because this has more to do with the OS than Apache and is already fixed
in mod_perl 2, I doubt anyone will feel like messing with it before that
gets released.  Your experiment demonstrates that the MRU approach has
value, so I'll be looking forward to trying it out with mod_perl 2.

===

Date: Thu, 21 Dec 2000 22:39:50 -0600
From: "Ken Williams" <ken@forum.swarthmore.edu>
To: "Perrin Harkins" <perrin@primenet.com>
Cc: "mod_perl list" <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 

perrin@primenet.com (Perrin Harkins) wrote:
>Hi Sam,
[snip]
>>  I am saying that since SpeedyCGI uses MRU to allocate requests to perl
>>  interpreters, it winds up using a lot fewer interpreters to handle the
>>  same number of requests.
>
>What I was saying is that it doesn't make sense for one to need fewer
>interpreters than the other to handle the same concurrency.  If you have
>10 requests at the same time, you need 10 interpreters.  There's no way
>speedycgi can do it with fewer, unless it actually makes some of them
>wait.

Well, there is one way, though it's probably not a huge factor.  If
mod_perl indeed manages the child-farming in such a way that too much
memory is used, then each process might slow down as memory becomes
sparse, especially if you start swapping.  Then if each request takes
longer, your child pool is more saturated with requests, and you might
have to fork a few more kids.

So in a sense, I think you're both correct.  If "concurrency" means the
number of requests that can be handled at once, both systems are
necessarily (and trivially) equivalent.  This isn't a very useful
measurement, though; a more useful one is how many children (or perhaps
how much memory) will be necessary to handle a given number of incoming
requests per second, and with this metric the two systems could perform
differently.

===

To: Ken Williams <ken@forum.swarthmore.edu>
Cc: mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Date: Thu, 21 Dec 2000 22:07:10 -0800 (PST)
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 

On Thu, 21 Dec 2000, Ken Williams wrote:
> So in a sense, I think you're both correct.  If "concurrency" means
> the number of requests that can be handled at once, both systems are
> necessarily (and trivially) equivalent.  This isn't a very useful
> measurement, though; a more useful one is how many children (or
> perhaps how much memory) will be necessary to handle a given number of
> incoming requests per second, and with this metric the two systems
> could perform differently.

Yes, well put.  And that actually brings me back around to my original
hypothesis, which is that once you reach the maximum number of
interprerters that can be run on the box before swapping, it no longer
makes a difference if you're using LRU or MRU.  That's because all
interpreters are busy all the time, and the RAM for lexicals has already
been allocated in all of them.  At that point, it's a question of which
system can fit more interpreters in RAM at once, and I still think
mod_perl would come out on top there because of the shared memory.  Of
course most people don't run their servers at full throttle, and at less
than total saturation I would expect speedycgi to use less RAM and
possibly be faster.

So I guess I'm saying exactly the opposite of the original assertion:
mod_perl is more scalable if you define "scalable" as maximum requests per
second on a given machine, but speedycgi uses fewer resources at less than
peak loads which would make it more attractive for ISPs and other people
who use their servers for multiple tasks.

This is all hypothetical and I don't have time to experiment with it until
after the holidays, but I think the logic is correct.

===

From: "Jeremy Howard" <jh_lists@fastmail.fm>
To: "Perrin Harkins" <perrin@primenet.com>, "Sam Horrocks" <sam@daemoninc.com>
Cc: "Gunther Birznieks" <gunther@extropia.com>, "mod_perl list" <modperl@apache.org>, <speedycgi@newlug.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscriptsthat contain un-shared memory 
Date: Fri, 22 Dec 2000 17:38:19 +1100

Perrin Harkins wrote:

> What I was saying is that it doesn't make sense for one to need fewer
> interpreters than the other to handle the same concurrency.  If you have
> 10 requests at the same time, you need 10 interpreters.  There's no way
> speedycgi can do it with fewer, unless it actually makes some of them
> wait.  That could be happening, due to the fork-on-demand model, although
> your warmup round (priming the pump) should take care of that.

I don't know if Speedy fixes this, but one problem with mod_perl v1 is that
if, for instance, a large POST request is being uploaded, this takes a whole
perl interpreter while the transaction is occurring. This is at least one
place where a Perl interpreter should not be needed.

Of course, this could be overcome if an HTTP Accelerator is used that takes
the whole request before passing it to a local httpd, but I don't know of
any proxies that work this way (AFAIK they all pass the packets as they
arrive).

===

Date: Fri, 22 Dec 2000 07:51:47 +0000 (GMT)
From: Matt Sergeant <matt@sergeant.org>
To: Sam Horrocks <sam@daemoninc.com>
cc: Stas Bekman <stas@stason.org>, Gunther Birznieks <gunther@extropia.com>, mod_perl list <modperl@apache.org>, <speedycgi@newlug.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

On Thu, 21 Dec 2000, Sam Horrocks wrote:

>  > Folks, your discussion is not short of wrong statements that can be easily
>  > proved, but I don't find it useful.
>
>  I don't follow.  Are you saying that my conclusions are wrong, but
>  you don't want to bother explaining why?
>
>  Would you agree with the following statement?
>
>     Under apache-1, speedycgi scales better than mod_perl with
>     scripts that contain un-shared memory

NO!

When you can write a trans handler or an auth handler with speedy, then I
might agree with you. Until then I must insist you add "mod_perl
Apache::Registry scripts" or something to that affect.

===


Date: Fri, 22 Dec 2000 11:18:32 -0600
From: "Keith G. Murphy" <keithmur@mindspring.com>
To: mod_perl list <modperl@apache.org>
CC: Perrin Harkins <perrin@primenet.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscriptsthat contain un-shared memory

Perrin Harkins wrote:

> Keith Murphy pointed out that I was seeing the result of persistent HTTP
> connections from my browser.  Duh.
> 
I must mention that, having seen your postings here over a long period,
anytime I can make you say "duh", my week is made.  Maybe the whole
month.

That issue can be confusing.  It was especially so for me when IE did
it, and Netscape did not...

Let's make everyone switch to IE, and mod_perl looks good again!  :-b

===

To: "Jeremy Howard" <jh_lists@fastmail.fm>
Cc: modperl@apache.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscriptsthat contain un-shared memory
From: Joe Schaefer <joe@sunstarsys.com>
Date: 22 Dec 2000 22:17:06 -0500

"Jeremy Howard" <jh_lists@fastmail.fm> writes:

> Perrin Harkins wrote:
> > What I was saying is that it doesn't make sense for one to need fewer
> > interpreters than the other to handle the same concurrency.  If you have
> > 10 requests at the same time, you need 10 interpreters.  There's no way
> > speedycgi can do it with fewer, unless it actually makes some of them
> > wait.  That could be happening, due to the fork-on-demand model, although
> > your warmup round (priming the pump) should take care of that.

A backend server can realistically handle multiple frontend requests, since
the frontend server must stick around until the data has been delivered
to the client (at least that's my understanding of the lingering-close
issue that was recently discussed at length here). Hypothetically speaking,
if a "FastCGI-like"[1] backend can deliver it's content faster than the 
apache (front-end) server can "proxy" it to the client, you won't need as 
many to handle the same (front-end) traffic load.

As an extreme hypothetical example, say that over a 5 second period you
are barraged with 100 modem requests that typically would take 5s each to 
service.  This means (sans lingerd :) that at the end of your 5 second 
period, you have 100 active apache children around.

But if new requests during that 5 second interval were only received at 
20/second, and your "FastCGI-like" server could deliver the content to
apache in one second, you might only have forked 50-60 "FastCGI-like" new 
processes to handle all 100 requests (forks take a little time :).

Moreover, an MRU design allows the transient effects of a short burst 
of abnormally heavy traffic to dissipate quickly, and IMHO that's its 
chief advantage over LRU.  To return to this hypothetical, suppose 
that immediately following this short burst, we maintain a sustained 
traffic of 20 new requests per second. Since it takes 5 seconds to 
deliver the content, that amounts to a sustained concurrency level 
of 100. The "Fast-CGI like" backend may have initially reacted by forking 
50-60 processes, but with MRU only 20-30 processes will actually be 
handling the load, and this reduction would happen almost immediately 
in this hyothetical.  This means that the remaining transient 20-30 
processes could be quickly killed off or _moved to swap_ without adversely 
affecting server performance.

Again, this is all purely hypothetical - I don't have benchmarks to
back it up ;)

> I don't know if Speedy fixes this, but one problem with mod_perl v1 is that
> if, for instance, a large POST request is being uploaded, this takes a whole
> perl interpreter while the transaction is occurring. This is at least one
> place where a Perl interpreter should not be needed.
> 
> Of course, this could be overcome if an HTTP Accelerator is used that takes
> the whole request before passing it to a local httpd, but I don't know of
> any proxies that work this way (AFAIK they all pass the packets as they
> arrive).

I posted a patch to modproxy a few months ago that specifically 
addresses this issue.  It has a ProxyPostMax directive that changes 
it's behavior to a store-and-forward proxy for POST data (it also enabled 
keepalives on the browser-side connection if they were enabled on the 
frontend server.)

It does this by buffering the data to a temp file on the proxy before 
opening the backend socket.  It's straightforward to make it buffer to 
a portion of RAM instead- if you're interested I can post another patch 
that does this also, but it's pretty much untested.


[1] I've never used SpeedyCGI, so I've refrained from specifically discussing 
    it. Also, a mod_perl backend server using Apache::Registry can be viewed as 
    "FastCGI-like" for the purpose of my argument.

===

Date: Sat, 23 Dec 2000 11:28:18 +0800
To: Joe Schaefer <joe@sunstarsys.com>, "Jeremy Howard" <jh_lists@fastmail.fm>
From: Gunther Birznieks <gunther@extropia.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscriptsthat contain un-shared memory

At 10:17 PM 12/22/2000 -0500, Joe Schaefer wrote:
>"Jeremy Howard" <jh_lists@fastmail.fm> writes:
>
>[snipped]
>I posted a patch to modproxy a few months ago that specifically
>addresses this issue.  It has a ProxyPostMax directive that changes
>it's behavior to a store-and-forward proxy for POST data (it also enabled
>keepalives on the browser-side connection if they were enabled on the
>frontend server.)
>
>It does this by buffering the data to a temp file on the proxy before
>opening the backend socket.  It's straightforward to make it buffer to
>a portion of RAM instead- if you're interested I can post another patch
>that does this also, but it's pretty much untested.
Cool! Are these patches now incorporated in the core mod_proxy if we 
download it off the web? Or do we troll through the mailing list to find 
the patch?

(Similar question about the forwarding of remote user patch someone posted 
last year).

===

From: "Jeremy Howard" <jh_lists@fastmail.fm>
To: <modperl@apache.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscriptsthat contain un-shared memory
Date: Sat, 23 Dec 2000 15:36:52 +1100

Joe Schaefer wrote:
> "Jeremy Howard" <jh_lists@fastmail.fm> writes:

> > I don't know if Speedy fixes this, but one problem with
> > mod_perl v1 is that if, for instance, a large POST
> > request is being uploaded, this takes a whole perl
> > interpreter while the transaction is occurring. This is
> > at least one place where a Perl interpreter should not
> > be needed.

> > Of course, this could be overcome if an HTTP Accelerator
> > is used that takes the whole request before passing it
> > to a local httpd, but I don't know of any proxies that
> > work this way (AFAIK they all pass the packets as they
> > arrive).

> I posted a patch to modproxy a few months ago that
> specifically addresses this issue.  It has a ProxyPostMax
> directive that changes it's behavior to a
> store-and-forward proxy for POST data (it also enabled
> keepalives on the browser-side connection if they were
> enabled on the frontend server.)

FYI, this patch is at:

  http://www.mail-archive.com/modperl@apache.org/msg11072.html

===

Date: Fri, 22 Dec 2000 23:57:36 -0800 (PST)
From: Ask Bjoern Hansen <ask@valueclick.com>
To: Sam Horrocks <sam@daemoninc.com>
cc: Stas Bekman <stas@stason.org>, Gunther Birznieks <gunther@extropia.com>, mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts
 that contain un-shared memory 

On Thu, 21 Dec 2000, Sam Horrocks wrote:

>  > Folks, your discussion is not short of wrong statements that can be easily
>  > proved, but I don't find it useful.
> 
>  I don't follow.  Are you saying that my conclusions are wrong, but
>  you don't want to bother explaining why?
>  
>  Would you agree with the following statement?
> 
>     Under apache-1, speedycgi scales better than mod_perl with
>     scripts that contain un-shared memory 

Maybe; but for one thing the feature set seems to be very different
as others have pointed out. Secondly then the test that was
originally quoted didn't have much to do with reality and showed
that whoever made it didn't have much experience with setting up
real-world high traffic systems with mod_perl.


===

Date: Sat, 23 Dec 2000 16:27:34 +0000 (GMT)
From: Nigel Hamilton <nigel@e1mail.com>
To: speedycgi@newlug.org
cc: Sam Horrocks <sam@daemoninc.com>, Stas Bekman <stas@stason.org>, Gunther Birznieks <gunther@extropia.com>, mod_perl list <modperl@apache.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales a better benchmark

Hi,	
	I think some of the 'threatened' replies to this thread speak
more volumes than any benchmark.

	Sam has come up with a cool technology .... it will help bridge
the technology adoption gap between traditional perl CGI + mod_perl - 
especially for ISP's.

	Well done Sam!

===

From: Sam Horrocks 
            ((?))
To: Perrin Harkins <perrin@primenet.com>
cc: Gunther Birznieks <gunther@extropia.com>, mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Thu, 04 Jan 2001 04:56:34 -0800

Sorry for the late reply - I've been out for the holidays.

 > By the way, how are you doing it?  Do you use a mutex routine that works
 > in LIFO fashion?

 Speedycgi uses separate backend processes that run the perl interpreters.
 The frontend processes (the httpd's that are running mod_speedycgi)
 communicate with the backends, sending over the request and getting the output.

 Speedycgi uses some shared memory (an mmap'ed file in /tmp) to keep track
 of the backends and frontends.  This shared memory contains the queue.
 When backends become free, they add themselves at the front of this queue.
 When the frontends need a backend they pull the first one from the front
 of this list.

 > 
 > >  I am saying that since SpeedyCGI uses MRU to allocate requests to perl
 > >  interpreters, it winds up using a lot fewer interpreters to handle the
 > >  same number of requests.
 > 
 > What I was saying is that it doesn't make sense for one to need fewer
 > interpreters than the other to handle the same concurrency.  If you have
 > 10 requests at the same time, you need 10 interpreters.  There's no way
 > speedycgi can do it with fewer, unless it actually makes some of them
 > wait.  That could be happening, due to the fork-on-demand model, although
 > your warmup round (priming the pump) should take care of that.

 What you say would be true if you had 10 processors and could get
 true concurrency.  But on single-cpu systems you usually don't need
 10 unix processes to handle 10 requests concurrently, since they get
 serialized by the kernel anyways.  I'll try to show how mod_perl handles
 10 concurrent requests, and compare that to mod_speedycgi so you can
 see the difference.

 For mod_perl, let's assume we have 10 httpd's, h1 through h10,
 when the 10 concurent requests come in.  h1 has aquired the mutex,
 and h2-h10 are waiting (in order) on the mutex.  Here's how the cpu
 actually runs the processes:

    h1 accepts
    h1 releases the mutex, making h2 runnable
    h1 runs the perl code and produces the results
    h1 waits for the mutex

    h2 accepts
    h2 releases the mutex, making h3 runnable
    h2 runs the perl code and produces the results
    h2 waits for the mutex

    h3 accepts
    ...

 This is pretty straightforward.  Each of h1-h10 run the perl code
 exactly once.  They may not run exactly in this order since a process
 could get pre-empted, or blocked waiting to send data to the client,
 etc.  But regardless, each of the 10 processes will run the perl code
 exactly once.

 Here's the mod_speedycgi example - it too uses httpd's h1-h10, and they
 all take turns running the mod_speedycgi frontend code.  But the backends,
 where the perl code is, don't have to all be run fairly - they use MRU
 instead.  I'll use b1 and b2 to represent 2 speedycgi backend processes,
 already queued up in that order.

 Here's a possible speedycgi scenario:

    h1 accepts
    h1 releases the mutex, making h2 runnable
    h1 sends a request to b1, making b1 runnable

    h2 accepts
    h2 releases the mutex, making h3 runnable
    h2 sends a request to b2, making b2 runnable

    b1 runs the perl code and sends the results to h1, making h1 runnable
    b1 adds itself to the front of the queue

    h3 accepts
    h3 releases the mutex, making h4 runnable
    h3 sends a request to b1, making b1 runnable

    b2 runs the perl code and sends the results to h2, making h2 runnable
    b2 adds itself to the front of the queue

    h1 produces the results it got from b1
    h1 waits for the mutex

    h4 accepts
    h4 releases the mutex, making h5 runnable
    h4 sends a request to b2, making b2 runnable

    b1 runs the perl code and sends the results to h3, making h3 runnable
    b1 adds itself to the front of the queue

    h2 produces the results it got from b2
    h2 waits for the mutex

    h5 accepts
    h5 release the mutex, making h6 runnable
    h5 sends a request to b1, making b1 runnable

    b2 runs the perl code and sends the results to h4, making h4 runnable
    b2 adds itself to the front of the queue

 This may be hard to follow, but hopefully you can see that the 10 httpd's
 just take turns using b1 and b2 over and over.  So, the 10 conncurrent
 requests end up being handled by just two perl backend processes.  Again,
 this is simplified.  If the perl processes get blocked, or pre-empted,
 you'll end up using more of them.  But generally, the LIFO will cause
 SpeedyCGI to sort-of settle into the smallest number of processes needed for
 the task.

 The difference between the two approaches is that the mod_perl
 implementation forces unix to use 10 separate perl processes, while the
 mod_speedycgi implementation sort-of decides on the fly how many
 different processes are needed.

 > >  Please let me know what you think I should change.  So far my
 > >  benchmarks only show one trend, but if you can tell me specifically
 > >  what I'm doing wrong (and it's something reasonable), I'll try it.
 > 
 > Try setting MinSpareServers as low as possible and setting MaxClients to a
 > value that will prevent swapping.  Then set ab for a concurrency equal to
 > your MaxClients setting.

 I previously had set MinSpareServers to 1 - it did help mod_perl get
 to a higher level, but didn't change the overall trend.

 I found that setting MaxClients to 100 stopped the paging.  At concurrency
 level 100, both mod_perl and mod_speedycgi showed similar rates with ab.
 Even at higher levels (300), they were comparable.

 But, to show that the underlying problem is still there, I then changed
 the hello_world script and doubled the amount of un-shared memory.
 And of course the problem then came back for mod_perl, although speedycgi
 continued to work fine.  I think this shows that mod_perl is still
 using quite a bit more memory than speedycgi to provide the same service.

 > >  I believe that with speedycgi you don't have to lower the MaxClients
 > >  setting, because it's able to handle a larger number of clients, at
 > >  least in this test.
 > 
 > Maybe what you're seeing is an ability to handle a larger number of
 > requests (as opposed to clients) because of the performance benefit I
 > mentioned above.
 
 I don't follow.
 
 > I don't know how hard ab tries to make sure you really
 > have n simultaneous clients at any given time.

 I do know that the ab "-c" option does seem to have an effect on the
 tests I've been running.

 > >  In other words, if with mod_perl you had to turn
 > >  away requests, but with mod_speedycgi you did not, that would just
 > >  prove that speedycgi is more scalable.
 > 
 > Are the speedycgi+Apache processes smaller than the mod_perl
 > processes?  If not, the maximum number of concurrent requests you can
 > handle on a given box is going to be the same.

 The size of the httpds running mod_speedycgi, plus the size of speedycgi
 perl processes is significantly smaller than the total size of the httpd's
 running mod_perl.

 The reason for this is that only a handful of perl processes are required by
 speedycgi to handle the same load, whereas mod_perl uses a perl interpreter
 in all of the httpds.

===

To: speedycgi@newlug.org
cc: Ken Williams <ken@forum.swarthmore.edu>, mod_perl list <modperl@apache.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Thu, 04 Jan 2001 05:03:26 -0800

I don't agree.  SpeedyCGI handles the same load with a whole lot fewer
perl interpreters, thus reducing the memory requirements significantly.
See my previous post for a more detailed explanation.

 > On Thu, 21 Dec 2000, Ken Williams wrote:
 > > So in a sense, I think you're both correct.  If "concurrency" means
 > > the number of requests that can be handled at once, both systems are
 > > necessarily (and trivially) equivalent.  This isn't a very useful
 > > measurement, though; a more useful one is how many children (or
 > > perhaps how much memory) will be necessary to handle a given number of
 > > incoming requests per second, and with this metric the two systems
 > > could perform differently.
 > 
 > Yes, well put.  And that actually brings me back around to my original
 > hypothesis, which is that once you reach the maximum number of
 > interprerters that can be run on the box before swapping, it no longer
 > makes a difference if you're using LRU or MRU.  That's because all
 > interpreters are busy all the time, and the RAM for lexicals has already
 > been allocated in all of them.  At that point, it's a question of which
 > system can fit more interpreters in RAM at once, and I still think
 > mod_perl would come out on top there because of the shared memory.  Of
 > course most people don't run their servers at full throttle, and at less
 > than total saturation I would expect speedycgi to use less RAM and
 > possibly be faster.
 > 
 > So I guess I'm saying exactly the opposite of the original assertion:
 > mod_perl is more scalable if you define "scalable" as maximum requests per
 > second on a given machine, but speedycgi uses fewer resources at less than
 > peak loads which would make it more attractive for ISPs and other people
 > who use their servers for multiple tasks.
 > 
 > This is all hypothetical and I don't have time to experiment with it until
 > after the holidays, but I think the logic is correct.
 > 

===

To: "Jeremy Howard" <jh_lists@fastmail.fm>
cc: "Perrin Harkins" <perrin@primenet.com>, "Gunther Birznieks" <gunther@extropia.com>, "mod_perl list" <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscriptsthat contain un-shared memory 
Date: Thu, 04 Jan 2001 05:20:43 -0800

This is planned for a future release of speedycgi, though there will
probably be an option to set a maximum number of bytes that can be
bufferred before the frontend contacts a perl interpreter and starts
passing over the bytes.

Currently you can do this sort of acceleration with script output if you
use the "speedy" binary (not mod_speedycgi), and you set the BufsizGet option
high enough so that it's able to buffer all the output from your script.
The perl interpreter will then be able to detach and go handle other
requests while the frontend process waits for the output to drain.

 > Perrin Harkins wrote:
 > > What I was saying is that it doesn't make sense for one to need fewer
 > > interpreters than the other to handle the same concurrency.  If you have
 > > 10 requests at the same time, you need 10 interpreters.  There's no way
 > > speedycgi can do it with fewer, unless it actually makes some of them
 > > wait.  That could be happening, due to the fork-on-demand model, although
 > > your warmup round (priming the pump) should take care of that.
 > >
 > I don't know if Speedy fixes this, but one problem with mod_perl v1 is that
 > if, for instance, a large POST request is being uploaded, this takes a whole
 > perl interpreter while the transaction is occurring. This is at least one
 > place where a Perl interpreter should not be needed.
 > 
 > Of course, this could be overcome if an HTTP Accelerator is used that takes
 > the whole request before passing it to a local httpd, but I don't know of
 > any proxies that work this way (AFAIK they all pass the packets as they
 > arrive).

===

From: "Les Mikesell" <lesmikesell@home.com>
To: "Perrin Harkins" <perrin@primenet.com>, "Sam Horrocks" <sam@daemoninc.com>
Cc: "Gunther Birznieks" <gunther@extropia.com>, "mod_perl list" <modperl@apache.org>, <speedycgi@newlug.org>
References: <18795.978612994@daemonweb.daemoninc.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Thu, 4 Jan 2001 08:24:08 -0600

"Sam Horrocks" <sam@daemoninc.com> wrote: ((???))


>  > Are the speedycgi+Apache processes smaller than the mod_perl
>  > processes?  If not, the maximum number of concurrent requests you can
>  > handle on a given box is going to be the same.
>
>  The size of the httpds running mod_speedycgi, plus the size of speedycgi
>  perl processes is significantly smaller than the total size of the httpd's
>  running mod_perl.

That would be true if you only ran one mod_perl'd httpd, but can you
give a better comparison to the usual setup for a busy site where
you run a non-mod_perl lightweight front end and let mod_rewrite
decide what is proxied through to the larger mod_perl'd backend,
letting apache decide how many backends you need to have
running?

>  The reason for this is that only a handful of perl processes are required by
>  speedycgi to handle the same load, whereas mod_perl uses a perl interpreter
>  in all of the httpds.

I always see at least a 10-1 ratio of front-to-back end httpd's when serving
over the internet.   One effect that is difficult to benchmark is that clients
connecting over the internet are often slow and will hold up the process
that is delivering the data even though the processing has been completed.
The proxy approach provides some buffering and allows the backend
to move on more quickly.  Does speedycgi do the same?

===

Date: Thu, 4 Jan 2001 17:15:35 +0100
From: Roger Espel Llima <espel@iagora.net>
To: Jeremy Howard <jh_lists@fastmail.fm>
Cc: modperl@apache.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscriptsthat contain un-shared memory

"Jeremy Howard" <jh_lists@fastmail.fm> wrote:
> A backend server can realistically handle multiple frontend requests, since
> the frontend server must stick around until the data has been delivered
> to the client (at least that's my understanding of the lingering-close
> issue that was recently discussed at length here). 

I won't enter the {Fast,Speedy}-CGI debates, having never played
with these, but the picture you're painting about delivering data to
the clients is just a little bit too bleak.

With a frontend/backend mod_perl setup, the frontend server sticks
around for a second or two as part of the lingering_close routine,
but it doesn't have to wait for the client to finish reading all the
data.  Fortunately enough, spoonfeeding data to slow clients is
handled by the OS kernel.

===

Date: Thu, 04 Jan 2001 20:47:22 -0800
From: Perrin Harkins <perrin@primenet.com>
To: Sam Horrocks <sam@daemoninc.com>
CC: mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts  that contain un-shared memory

Hi Sam,

I think we're talking in circles here a bit, and I don't want to
diminish the original point, which I read as "MRU process selection is a
good idea for Perl-based servers."  Your tests showed that this was
true.

Let me just try to explain my reasoning.  I'll define a couple of my
base assumptions, in case you disagree with them.

- Slices of CPU time doled out by the kernel are very small - so small
that processes can be considered concurrent, even though technically
they are handled serially.
- A set of requests can be considered "simultaneous" if they all arrive
and start being handled in a period of time shorter than the time it
takes to service a request.

Operating on these two assumptions, I say that 10 simultaneous requests
will require 10 interpreters to service them.  There's no way to handle
them with fewer, unless you queue up some of the requests and make them
wait.

I also say that if you have a top limit of 10 interpreters on your
machine because of memory constraints, and you're sending in 10
simultaneous requests constantly, all interpreters will be used all the
time.  In that case it makes no difference to the throughput whether you
use MRU or LRU.

>  What you say would be true if you had 10 processors and could get
>  true concurrency.  But on single-cpu systems you usually don't need
>  10 unix processes to handle 10 requests concurrently, since they get
>  serialized by the kernel anyways.

I think the CPU slices are smaller than that.  I don't know much about
process scheduling, so I could be wrong.  I would agree with you if we
were talking about requests that were coming in with more time between
them.  Speedycgi will definitely use fewer interpreters in that case.

>  I found that setting MaxClients to 100 stopped the paging.  At concurrency
>  level 100, both mod_perl and mod_speedycgi showed similar rates with ab.
>  Even at higher levels (300), they were comparable.

That's what I would expect if both systems have a similar limit of how
many interpreters they can fit in RAM at once.  Shared memory would help
here, since it would allow more interpreters to run.

By the way, do you limit the number of SpeedyCGI processes as well?  it
seems like you'd have to, or they'd start swapping too when you throw
too many requests in.

>  But, to show that the underlying problem is still there, I then changed
>  the hello_world script and doubled the amount of un-shared memory.
>  And of course the problem then came back for mod_perl, although speedycgi
>  continued to work fine.  I think this shows that mod_perl is still
>  using quite a bit more memory than speedycgi to provide the same service.

I'm guessing that what happened was you ran mod_perl into swap again. 
You need to adjust MaxClients when your process size changes
significantly.

>  > >  I believe that with speedycgi you don't have to lower the MaxClients
>  > >  setting, because it's able to handle a larger number of clients, at
>  > >  least in this test.
>  >
>  > Maybe what you're seeing is an ability to handle a larger number of
>  > requests (as opposed to clients) because of the performance benefit I
>  > mentioned above.
> 
>  I don't follow.

When not all processes are in use, I think Speedy would handle requests
more quickly, which would allow it to handle n requests in less time
than mod_perl.  Saying it handles more clients implies that the requests
are simultaneous.  I don't think it can handle more simultaneous
requests.

>  > Are the speedycgi+Apache processes smaller than the mod_perl
>  > processes?  If not, the maximum number of concurrent requests you can
>  > handle on a given box is going to be the same.
> 
>  The size of the httpds running mod_speedycgi, plus the size of speedycgi
>  perl processes is significantly smaller than the total size of the httpd's
>  running mod_perl.
> 
>  The reason for this is that only a handful of perl processes are required by
>  speedycgi to handle the same load, whereas mod_perl uses a perl interpreter
>  in all of the httpds.

I think this is true at lower levels, but not when the number of
simultaneous requests gets up to the maximum that the box can handle. 
At that point, it's a question of how many interpreters can fit in
memory.  I would expect the size of one Speedy + one httpd to be about
the same as one mod_perl/httpd when no memory is shared.  With sharing,
you'd be able to run more processes.

===

To: Roger Espel Llima <espel@iagora.net>, modperl@apache.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscriptsthat contain un-shared memory
From: Joe Schaefer <joe+apache@sunstarsys.com>
Date: 05 Jan 2001 00:53:05 -0500

Roger Espel Llima <espel@iagora.net> writes:

> "Jeremy Howard" <jh_lists@fastmail.fm> wrote:

I'm pretty sure I'm the person whose words you're quoting here,
not Jeremy's.

> > A backend server can realistically handle multiple frontend requests, since
> > the frontend server must stick around until the data has been delivered
> > to the client (at least that's my understanding of the lingering-close
> > issue that was recently discussed at length here). 
> 
> I won't enter the {Fast,Speedy}-CGI debates, having never played
> with these, but the picture you're painting about delivering data to
> the clients is just a little bit too bleak.

It's a "hypothetical", and I obviously exaggerated the numbers to show
the advantage of a front/back end architecture for "comparative benchmarks" 
like these.  As you well know, the relevant issue is the percentage of time 
spent generating the content relative to the entire time spent servicing 
the request.  If you don't like seconds, rescale it to your favorite 
time window.

> With a frontend/backend mod_perl setup, the frontend server sticks
> around for a second or two as part of the lingering_close routine,
> but it doesn't have to wait for the client to finish reading all the
> data.  Fortunately enough, spoonfeeding data to slow clients is
> handled by the OS kernel.

Right- relative to the time it takes the backend to actually 
create and deliver the content to the frontend, a second or
two can be an eternity.  

===

From: Sam Horrocks <sam@daemoninc.com>
To: "Les Mikesell" <lesmikesell@home.com>
cc: "Perrin Harkins" <perrin@primenet.com>,
        "Gunther Birznieks" <gunther@extropia.com>,
        "mod_perl list" <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Fri, 05 Jan 2001 04:28:59 -0800

 > >  > Are the speedycgi+Apache processes smaller than the mod_perl
 > >  > processes?  If not, the maximum number of concurrent requests you can
 > >  > handle on a given box is going to be the same.
 > >
 > >  The size of the httpds running mod_speedycgi, plus the size of speedycgi
 > >  perl processes is significantly smaller than the total size of the httpd's
 > >  running mod_perl.
 > 
 > That would be true if you only ran one mod_perl'd httpd, but can you
 > give a better comparison to the usual setup for a busy site where
 > you run a non-mod_perl lightweight front end and let mod_rewrite
 > decide what is proxied through to the larger mod_perl'd backend,
 > letting apache decide how many backends you need to have
 > running?

 The fundamental differences would remain the same - even in the mod_perl
 backend, the requests will be spread out over all the httpd's that are
 running, whereas speedycgi would tend to use fewer perl interpreters
 to handle the same load.

 But with this setup, the mod_perl backend could probably be set to run
 fewer httpds because it doesn't have to wait on slow clients.  And the
 fewer httpd's you run with mod_perl the smaller your total memory.

 > >  The reason for this is that only a handful of perl processes are required by
 > >  speedycgi to handle the same load, whereas mod_perl uses a perl interpreter
 > >  in all of the httpds.
 > 
 > I always see at least a 10-1 ratio of front-to-back end httpd's when serving
 > over the internet.   One effect that is difficult to benchmark is that clients
 > connecting over the internet are often slow and will hold up the process
 > that is delivering the data even though the processing has been completed.
 > The proxy approach provides some buffering and allows the backend
 > to move on more quickly.  Does speedycgi do the same?

 There are plans to make it so that SpeedyCGI does more buffering of
 the output in memory, perhaps eliminating the need for caching frontend
 webserver.  It works now only for the "speedy" binary (not mod_speedycgi)
 if you set the BufsizGet value high enough.

 Of course you could add a caching webserver in front of the SpeedyCGI server
 just like you do with mod_perl now.  So yes you can do the same with
 speedycgi now.

===

To: perrin@primenet.com
cc: mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Sat, 06 Jan 2001 04:32:34 -0800
From: Sam Horrocks <sam@daemoninc.com>

 > Let me just try to explain my reasoning.  I'll define a couple of my
 > base assumptions, in case you disagree with them.
 > 
 > - Slices of CPU time doled out by the kernel are very small - so small
 > that processes can be considered concurrent, even though technically
 > they are handled serially.

 Don't agree.  You're equating the model with the implemntation.
 Unix processes model concurrency, but when it comes down to it, if you
 don't have more CPU's than processes, you can only simulate concurrency.

 Each process runs until it either blocks on a resource (timer, network,
 disk, pipe to another process, etc), or a higher priority process
 pre-empts it, or it's taken so much time that the kernel wants to give
 another process a chance to run.

 > - A set of requests can be considered "simultaneous" if they all arrive
 > and start being handled in a period of time shorter than the time it
 > takes to service a request.

 That sounds OK.

 > Operating on these two assumptions, I say that 10 simultaneous requests
 > will require 10 interpreters to service them.  There's no way to handle
 > them with fewer, unless you queue up some of the requests and make them
 > wait.

 Right.  And that waiting takes place:

    - In the mutex around the accept call in the httpd

    - In the kernel's run queue when the process is ready to run, but is
      waiting for other processes ahead of it.

 So, since there is only one CPU, then in both cases (mod_perl and
 SpeedyCGI), processes spend time waiting.  But what happens in the
 case of SpeedyCGI is that while some of the httpd's are waiting,
 one of the earlier speedycgi perl interpreters has already finished
 its run through the perl code and has put itself back at the front of
 the speedycgi queue.  And by the time that Nth httpd gets around to
 running, it can re-use that first perl interpreter instead of needing
 yet another process.

 This is why it's important that you don't assume that Unix is truly
 concurrent.

 > I also say that if you have a top limit of 10 interpreters on your
 > machine because of memory constraints, and you're sending in 10
 > simultaneous requests constantly, all interpreters will be used all the
 > time.  In that case it makes no difference to the throughput whether you
 > use MRU or LRU.

 This is not true for SpeedyCGI, because of the reason I give above.
 10 simultaneous requests will not necessarily require 10 interpreters.

 > >  What you say would be true if you had 10 processors and could get
 > >  true concurrency.  But on single-cpu systems you usually don't need
 > >  10 unix processes to handle 10 requests concurrently, since they get
 > >  serialized by the kernel anyways.
 > 
 > I think the CPU slices are smaller than that.  I don't know much about
 > process scheduling, so I could be wrong.  I would agree with you if we
 > were talking about requests that were coming in with more time between
 > them.  Speedycgi will definitely use fewer interpreters in that case.

 This url:

    http://www.oreilly.com/catalog/linuxkernel/chapter/ch10.html

 says the default timeslice is 210ms (1/5th of a second) for Linux on a PC.
 There's also lots of good info there on Linux scheduling.

 > >  I found that setting MaxClients to 100 stopped the paging.  At concurrency
 > >  level 100, both mod_perl and mod_speedycgi showed similar rates with ab.
 > >  Even at higher levels (300), they were comparable.
 > 
 > That's what I would expect if both systems have a similar limit of how
 > many interpreters they can fit in RAM at once.  Shared memory would help
 > here, since it would allow more interpreters to run.
 > 
 > By the way, do you limit the number of SpeedyCGI processes as well?  it
 > seems like you'd have to, or they'd start swapping too when you throw
 > too many requests in.

 SpeedyCGI has an optional limit on the number of processes, but I didn't
 use it in my testing.

 > >  But, to show that the underlying problem is still there, I then changed
 > >  the hello_world script and doubled the amount of un-shared memory.
 > >  And of course the problem then came back for mod_perl, although speedycgi
 > >  continued to work fine.  I think this shows that mod_perl is still
 > >  using quite a bit more memory than speedycgi to provide the same service.
 > 
 > I'm guessing that what happened was you ran mod_perl into swap again. 
 > You need to adjust MaxClients when your process size changes
 > significantly.

 Right, but this also points out how difficult it is to get mod_perl
 tuning just right.  My opinion is that the MRU design adapts more
 dynamically to the load.

 > >  > >  I believe that with speedycgi you don't have to lower the MaxClients
 > >  > >  setting, because it's able to handle a larger number of clients, at
 > >  > >  least in this test.
 > >  >
 > >  > Maybe what you're seeing is an ability to handle a larger number of
 > >  > requests (as opposed to clients) because of the performance benefit I
 > >  > mentioned above.
 > > 
 > >  I don't follow.
 > 
 > When not all processes are in use, I think Speedy would handle requests
 > more quickly, which would allow it to handle n requests in less time
 > than mod_perl.  Saying it handles more clients implies that the requests
 > are simultaneous.  I don't think it can handle more simultaneous
 > requests.

 Don't agree.

 > >  > Are the speedycgi+Apache processes smaller than the mod_perl
 > >  > processes?  If not, the maximum number of concurrent requests you can
 > >  > handle on a given box is going to be the same.
 > > 
 > >  The size of the httpds running mod_speedycgi, plus the size of speedycgi
 > >  perl processes is significantly smaller than the total size of the httpd's
 > >  running mod_perl.
 > > 
 > >  The reason for this is that only a handful of perl processes are required by
 > >  speedycgi to handle the same load, whereas mod_perl uses a perl interpreter
 > >  in all of the httpds.
 > 
 > I think this is true at lower levels, but not when the number of
 > simultaneous requests gets up to the maximum that the box can handle. 
 > At that point, it's a question of how many interpreters can fit in
 > memory.  I would expect the size of one Speedy + one httpd to be about
 > the same as one mod_perl/httpd when no memory is shared.  With sharing,
 > you'd be able to run more processes.

 I'd agree that the size of one Speedy backend + one httpd would be the
 same or even greater than the size of one mod_perl/httpd when no memory
 is shared.  But because the speedycgi httpds are small (no perl in them)
 and the number of SpeedyCGI perl interpreters is small, the total memory
 required is significantly smaller for the same load.

===

Date: Sat, 06 Jan 2001 13:35:01 -0800
From: Perrin Harkins <perrin@primenet.com>
Reply-To: perrin@primenet.com
To: Sam Horrocks <sam@daemoninc.com>
CC: mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts  that contain un-shared memory

Sam Horrocks wrote:
>  Don't agree.  You're equating the model with the implemntation.
>  Unix processes model concurrency, but when it comes down to it, if you
>  don't have more CPU's than processes, you can only simulate concurrency.
[...]
>  This url:
> 
>     http://www.oreilly.com/catalog/linuxkernel/chapter/ch10.html
> 
>  says the default timeslice is 210ms (1/5th of a second) for Linux on a PC.
>  There's also lots of good info there on Linux scheduling.

Thanks for the info.  This makes much more sense to me now.  It sounds
like using an MRU algrorithm for process selection is automatically
finding the sweet spot in terms of how many processes can run within the
space of one request and coming close to the ideal of never having
unused processes in memory.  Now I'm really looking forward to getting
MRU and shared memory in the same package and seeing how high I can
scale my hardware.

===

Date: Sat, 06 Jan 2001 16:46:30 -0500
From: Buddy Lee Haystack <haystack@email.rentzone.org>
To: perrin@primenet.com
Cc: Sam Horrocks <sam@daemoninc.com>, mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts  that contain un-shared memory

Does this mean that mod_perl's memory hunger will curbed in
the future using some of the neat tricks in Speedycgi?

Perrin Harkins wrote:
> 
> Sam Horrocks wrote:
> >  Don't agree.  You're equating the model with the implemntation.
> >  Unix processes model concurrency, but when it comes down to it, if you
> >  don't have more CPU's than processes, you can only simulate concurrency.
> [...]
> >  This url:
> >
> >     http://www.oreilly.com/catalog/linuxkernel/chapter/ch10.html
> >
> >  says the default timeslice is 210ms (1/5th of a second) for Linux on a PC.
> >  There's also lots of good info there on Linux scheduling.
> 
> Thanks for the info.  This makes much more sense to me now.  It sounds
> like using an MRU algrorithm for process selection is automatically
> finding the sweet spot in terms of how many processes can run within the
> space of one request and coming close to the ideal of never having
> unused processes in memory.  Now I'm really looking forward to getting
> MRU and shared memory in the same package and seeing how high I can
> scale my hardware.

===

Date: Sat, 06 Jan 2001 13:47:51 -0800
From: Perrin Harkins <perrin@primenet.com>
To: haystack@email.rentzone.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

Buddy Lee Haystack wrote:

> Does this mean that mod_perl's memory hunger will curbed
> in the future using some of the neat tricks in Speedycgi?

Yes.  The upcoming mod_perl 2 (running on Apache 2) will use MRU to
select threads.  Doug demoed this at ApacheCon a few months back.

===

From: "Les Mikesell" <lesmikesell@home.com>
To: <perrin@primenet.com>, "Sam Horrocks" <sam@daemoninc.com>
Cc: "mod_perl list" <modperl@apache.org>, <speedycgi@newlug.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Sat, 6 Jan 2001 15:56:44 -0600

"Sam Horrocks" <sam@daemoninc.com> wrote: 

>  Right, but this also points out how difficult it is to get mod_perl
>  tuning just right.  My opinion is that the MRU design adapts more
>  dynamically to the load.

How would this compare to apache's process management when
using the front/back end approach?

>  I'd agree that the size of one Speedy backend + one httpd would be the
>  same or even greater than the size of one mod_perl/httpd when no memory
>  is shared.  But because the speedycgi httpds are small (no perl in them)
>  and the number of SpeedyCGI perl interpreters is small, the total memory
>  required is significantly smaller for the same load.

Likewise, it would be helpful if you would always make the comparison
to the dual httpd setup that is often used for busy sites.   I think it must
really boil down to the efficiency of your IPC vs. access to the full
apache environment.

===

Date: Sat, 06 Jan 2001 14:08:56 -0800
From: Joshua Chamas <joshua@chamas.com>
To: Sam Horrocks <sam@daemoninc.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts  that contain un-shared memory

Sam Horrocks wrote:

>  Don't agree.  You're equating the model with the implemntation.
>  Unix processes model concurrency, but when it comes down to it, if you
>  don't have more CPU's than processes, you can only simulate concurrency.

Hey Sam, nice module.  I just installed your SpeedyCGI for a good 'ol
HelloWorld benchmark & it was a snap, well done.  I'd like to add to the 
numbers below that a fair benchmark would be between mod_proxy in front 
of a mod_perl server and mod_speedycgi, as it would be a similar memory 
saving model ( this is how we often scale mod_perl )... both models would
end up forwarding back to a smaller set of persistent perl interpreters.

However, I did not do such a benchmark, so SpeedyCGI looses out a
bit for the extra layer it has to do :(   This is based on the 
suite at http://www.chamas.com/bench/hello.tar.gz, but I have not
included the speedy test in that yet.

 -- Josh

Test Name                      Test File  Hits/sec   Total Hits Total Time sec/Hits   Bytes/Hit  
------------                   ---------- ---------- ---------- ---------- ---------- ---------- 
Apache::Registry v2.01 CGI.pm  hello.cgi   451.9     27128 hits 60.03 sec  0.002213   216 bytes  
Speedy CGI                     hello.cgi   375.2     22518 hits 60.02 sec  0.002665   216 bytes  

Apache Server Header Tokens
---------------------------
(Unix)
Apache/1.3.14
OpenSSL/0.9.6
PHP/4.0.3pl1
mod_perl/1.24
mod_ssl/2.7.1

===

To: speedycgi@newlug.org
cc: perrin@primenet.com, "mod_perl list" <modperl@apache.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Sat, 06 Jan 2001 14:37:34 -0800
From: Sam Horrocks <sam@daemoninc.com>

 > >  Right, but this also points out how difficult it is to get mod_perl
 > >  tuning just right.  My opinion is that the MRU design adapts more
 > >  dynamically to the load.
 > 
 > How would this compare to apache's process management when
 > using the front/back end approach?

 Same thing applies.  The front/back end approach does not change the
 fundamentals.

 > >  I'd agree that the size of one Speedy backend + one httpd would be the
 > >  same or even greater than the size of one mod_perl/httpd when no memory
 > >  is shared.  But because the speedycgi httpds are small (no perl in them)
 > >  and the number of SpeedyCGI perl interpreters is small, the total memory
 > >  required is significantly smaller for the same load.
 > 
 > Likewise, it would be helpful if you would always make the comparison
 > to the dual httpd setup that is often used for busy sites.   I think it must
 > really boil down to the efficiency of your IPC vs. access to the full
 > apache environment.

 The reason I don't include that comparison is that it's not fundamental
 to the differences between mod_perl and speedycgi or LRU and MRU that
 I have been trying to point out.  Regardless of whether you add a
 frontend or not, the mod_perl process selection remains LRU and the
 speedycgi process selection remains MRU.

===

To: Joshua Chamas <joshua@chamas.com>
cc: perrin@primenet.com, mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Sat, 06 Jan 2001 15:58:27 -0800
From: Sam Horrocks <sam@daemoninc.com>

A few things:

    - In your results, could you add the speedycgi version number (2.02),
      and the fact that this is using the mod_speedycgi frontend.
      The fork/exec frontend will be much slower on hello-world so I don't
      want people to get the wrong idea.  You may want to benchmark
      the fork/exec version as well.

    - You may be able to eke out a little more performance by setting
      MaxRuns to 0 (infinite).  The is set for mod_speedycgi using the
      SpeedyMaxRuns directive, or on the command-line using "-r0".
      This setting is similar to the MaxRequestsPerChild setting in apache.

    - My tests show mod_perl/speedy much closer than yours do, even with
      MaxRuns at its default value of 500.  Maybe you're running on
      a different OS than I am - I'm using Redhat 6.2.  I'm also running
      one rev lower of mod_perl in case that matters.


 > Hey Sam, nice module.  I just installed your SpeedyCGI for a good 'ol
 > HelloWorld benchmark & it was a snap, well done.  I'd like to add to the 
 > numbers below that a fair benchmark would be between mod_proxy in front 
 > of a mod_perl server and mod_speedycgi, as it would be a similar memory 
 > saving model ( this is how we often scale mod_perl )... both models would
 > end up forwarding back to a smaller set of persistent perl interpreters.
 > 
 > However, I did not do such a benchmark, so SpeedyCGI looses out a
 > bit for the extra layer it has to do :(   This is based on the 
 > suite at http://www.chamas.com/bench/hello.tar.gz, but I have not
 > included the speedy test in that yet.
 > 
 >  -- Josh
 > 
 > Test Name                      Test File  Hits/sec   Total Hits Total Time sec/Hits   Bytes/Hit  
 > ------------                   ---------- ---------- ---------- ---------- ---------- ---------- 
 > Apache::Registry v2.01 CGI.pm  hello.cgi   451.9     27128 hits 60.03 sec  0.002213   216 bytes  
 > Speedy CGI                     hello.cgi   375.2     22518 hits 60.02 sec  0.002665   216 bytes  
 > 
 > Apache Server Header Tokens
 > ---------------------------
 > (Unix)
 > Apache/1.3.14
 > OpenSSL/0.9.6
 > PHP/4.0.3pl1
 > mod_perl/1.24
 > mod_ssl/2.7.1

===

From: "Les Mikesell" <lesmikesell@home.com>
To: <speedycgi@newlug.org>, "Sam Horrocks" <sam@daemoninc.com>
Cc: <perrin@primenet.com>, "mod_perl list" <modperl@apache.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Sat, 6 Jan 2001 23:10:02 -0600

"Sam Horrocks" <sam@daemoninc.com> wrote:

> > >  Right, but this also points out how difficult it is to get mod_perl
>  > >  tuning just right.  My opinion is that the MRU design adapts more
>  > >  dynamically to the load.
>  >
>  > How would this compare to apache's process management when
>  > using the front/back end approach?
>
>  Same thing applies.  The front/back end approach does not change the
>  fundamentals.

It changes them drastically in the world of slow internet connections,
but perhaps not much in artificial benchmarks or LAN use.   I think
you can reduce the problem to:

     How much time do you spend in non-perl apache code vs. how
     much time  you spend in perl code.
and the solution to:
    Only use the memory footprint of perl for the miminal time it is needed.

If your I/O is slow and your program complexity minimal, the bulk of
the wall-clock time is spent in i/o wait by non-perl apache code.  Using
a front-end proxy greatly reduces this time (and correspondingly the
ratio of time spent in non-perl code) for the backend where it matters
because you are tying up a copy of perl in memory.     Likewise, increasing
the complexity of the perl code will reduce this ratio, reducing the
potential for saving memory regardless of what you do, so benchmarking
a trivial perl program will likely be misleading.

>  > >  I'd agree that the size of one Speedy backend + one httpd would be the
>  > >  same or even greater than the size of one mod_perl/httpd when no memory
>  > >  is shared.  But because the speedycgi httpds are small (no perl in
them)
>  > >  and the number of SpeedyCGI perl interpreters is small, the total
memory
>  > >  required is significantly smaller for the same load.
>  >
>  > Likewise, it would be helpful if you would always make the comparison
>  > to the dual httpd setup that is often used for busy sites.   I think it
must
>  > really boil down to the efficiency of your IPC vs. access to the full
>  > apache environment.
>
>  The reason I don't include that comparison is that it's not fundamental
>  to the differences between mod_perl and speedycgi or LRU and MRU that
>  I have been trying to point out.  Regardless of whether you add a
>  frontend or not, the mod_perl process selection remains LRU and the
>  speedycgi process selection remains MRU.

I don't think I understand what you mean by LRU.   When I view the
Apache server-status with ExtendedStatus On,  it appears that
the backend server processes recycle themselves as soon as they
are free instead of cycling sequentially through all the available
processes.   Did you mean to imply otherwise or are you talking
about something else?

===
Date: Sat, 06 Jan 2001 23:51:34 -0800
From: Joshua Chamas <joshua@chamas.com>
To: Sam Horrocks <sam@daemoninc.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts  that contain un-shared memory

Sam Horrocks wrote:
> 
> A few things:
> 
>     - In your results, could you add the speedycgi version number (2.02),
>       and the fact that this is using the mod_speedycgi frontend.

The version numbers are gathered at runtime, so for mod_speedycgi,
this would get picked up if you registered it in the Apache server
header that gets sent out.  I'll list the test as mod_speedycgi.

>       The fork/exec frontend will be much slower on hello-world so I don't
>       want people to get the wrong idea.  You may want to benchmark
>       the fork/exec version as well.
> 

If its slower than what's the point :)  If mod_speedycgi is the faster
way to run it, they that should be good enough, no?  If you would like 
to contribute that test to the suite, please do so.

>     - You may be able to eke out a little more performance by setting
>       MaxRuns to 0 (infinite).  The is set for mod_speedycgi using the
>       SpeedyMaxRuns directive, or on the command-line using "-r0".
>       This setting is similar to the MaxRequestsPerChild setting in apache.
> 

Will do.

>     - My tests show mod_perl/speedy much closer than yours do, even with
>       MaxRuns at its default value of 500.  Maybe you're running on
>       a different OS than I am - I'm using Redhat 6.2.  I'm also running
>       one rev lower of mod_perl in case that matters.
> 

I'm running the same thing, RH 6.2, I don't know if the mod_perl rev 
matters, but what often does matter is that I have 2 CPUs in my box, so 
my results often look different from other peoples.

===

Date: Mon, 08 Jan 2001 09:50:22 -0600
From: "Keith G. Murphy" <keithmur@mindspring.com>
To: mod_perl list <modperl@apache.org>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts  that contain un-shared memory

Les Mikesell wrote:
>
[cut] 
> 
> I don't think I understand what you mean by LRU.   When I view the
> Apache server-status with ExtendedStatus On,  it appears that
> the backend server processes recycle themselves as soon as they
> are free instead of cycling sequentially through all the available
> processes.   Did you mean to imply otherwise or are you talking
> about something else?
> 
Be careful here.  Note my message earlier in the thread about the
misleading effect of persistent connections (HTTP 1.1).

Perrin Harkins noted in another thread that it had fooled him as well as
me.

Not saying that's what you're seeing, just take it into account. 
(Quick-and-dirty test: run Netscape as the client browser; do you still
see the same thing?)

===

Date: Sun, 14 Jan 2001 12:40:00 +0800
To: Sam Horrocks <sam@daemoninc.com>, perrin@primenet.com
From: Gunther Birznieks <gunther@extropia.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Cc: mod_perl list <modperl@apache.org>, speedycgi@newlug.org

I have just gotten around to reading this thread I've been saving for a 
rainy day. Well, it's not rainy, but I'm finally getting to it. Apologizes 
to those who hate when when people don't snip their reply mails but I am 
including it so that the entire context is not lost.

Sam (or others who may understand Sam's explanation),

I am still confused by this explanation of MRU helping when there are 10 
processes serving 10 requests at all times. I understand MRU helping when 
the processes are not at max, but I don't see how it helps when they are at 
max utilization.

It seems to me that if the wait is the same for mod_perl backend processes 
and speedyCGI processes, that it doesn't matter if some of the speedycgi 
processes cycle earlier than the mod_perl ones because all 10 will always 
be used.

I did read and reread (once) the snippets about modeling concurrency and 
the HTTP waiting for an accept.. But I still don't understand how MRU helps 
when all the processes would be in use anyway. At that point they all have 
an equal chance of being called.

Could you clarify this with a simpler example? Maybe 4 processes and a 
sample timeline of what happens to those when there are enough requests to 
keep all 4 busy all the time for speedyCGI and a mod_perl backend?

===

To: Gunther Birznieks <gunther@extropia.com>
cc: perrin@primenet.com, mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Wed, 17 Jan 2001 03:19:46 -0800
From: Sam Horrocks <sam@daemoninc.com>

I think the major problem is that you're assuming that just because
there are 10 constant concurrent requests, that there have to be 10
perl processes serving those requests at all times in order to get
maximum throughput.  The problem with that assumption is that there
is only one CPU - ten processes cannot all run simultaneously anyways,
so you don't really need ten perl interpreters.

I've been trying to think of better ways to explain this.  I'll try to
explain with an analogy - it's sort-of lame, but maybe it'll give you
a mental picture of what's happening.  To eliminate some confusion,
this analogy doesn't address LRU/MRU, nor waiting on other events like
network or disk i/o.  It only tries to explain why you don't necessarily
need 10 perl-interpreters to handle a stream of 10 concurrent requests
on a single-CPU system.

You own a fast-food restaurant.  The players involved are:

    Your customers.  These represent the http requests.

    Your cashiers.  These represent the perl interpreters.

    Your cook.  You only have one.  THis represents your CPU.

The normal flow of events is this:
    
    A cashier gets an order from a customer.  The cashier goes and
    waits until the cook is free, and then gives the order to the cook.
    The cook then cooks the meal, taking 5-minutes for each meal.
    The cashier waits for the meal to be ready, then takes the meal and
    gives it to the customer.  The cashier then serves another customer.
    The cashier/customer interaction takes a very small amount of time.

The analogy is this:

    An http request (customer) arrives.  It is given to a perl
    interpreter (cashier).  A perl interpreter must wait for all other
    perl interpreters ahead of it to finish using the CPU (the cook).
    It can't serve any other requests until it finishes this one.
    When its turn arrives, the perl interpreter uses the CPU to process
    the perl code.  It then finishes and gives the results over to the
    http client (the customer).

Now, say in this analogy you begin the day with 10 customers in the store.
At each 5-minute interval thereafter another customer arrives.  So at time
0, there is a pool of 10 customers.  At time +5, another customer arrives.
At time +10, another customer arrives, ad infinitum.

You could hire 10 cashiers in order to handle this load.  What would
happen is that the 10 cashiers would fairly quickly get all the orders
from the first 10 customers simultaneously, and then start waiting for
the cook.  The 10 cashiers would queue up.  Casher #1 would put in the
first order.  Cashiers 2-9 would wait their turn.  After 5-minutes,
cashier number 1 would receive the meal, deliver it to customer #1, and
then serve the next customer (#11) that just arrived at the 5-minute mark.
Cashier #1 would take customer #11's order, then queue up and wait in
line for the cook - there will be 9 other cashiers already in line, so
the wait will be long.  At the 10-minute mark, cashier #2 would receive
a meal from the cook, deliver it to customer #2, then go on and serve
the next customer (#12) that just arrived.  Cashier #2 would then go and
wait in line for the cook.  This continues on through all the cashiers
in order 1-10, then repeating, 1-10, ad infinitum.

Now even though you have 10 cashiers, most of their time is spent
waiting to put in an order to the cook.  Starting with customer #11,
all customers will wait 50-minutes for their meal.  When customer #11
comes in he/she will immediately get to place an order, but it will take
the cashier 45-minutes to wait for the cook to become free, and another
5-minutes for the meal to be cooked.  Same is true for customer #12,
and all customers from then on.

Now, the question is, could you get the same throughput with fewer
cashiers?  Say you had 2 cashiers instead.  The 10 customers are
there waiting.  The 2 cashiers take orders from customers #1 and #2.
Cashier #1 then gives the order to the cook and waits.  Cashier #2 waits
in line for the cook behind cashier #1.  At the 5-minute mark, the first
meal is done.  Cashier #1 delivers the meal to customer #1, then serves
customer #3.  Cashier #1 then goes and stands in line behind cashier #2.
At the 10-minute mark, cashier #2's meal is ready - it's delivered to
customer #2 and then customer #4 is served.  This continues on with the
cashiers trading off between serving customers.

Does the scenario with two cashiers go any more slowly than the one with
10 cashiers?  No.  When the 11th customer arrives at the 5-minute mark,
what he/she sees is that customer #3 is just now putting in an order.
There are 7 other people there waiting to put in orders.  Customer #11 will
wait 40 minutes until he/she puts in an order, then wait another 10 minutes
for the meal to arrive.  Same is true for customer #12, and all others arriving
thereafter.

The only difference between the two scenarious is the number of cashiers,
and where the waiting is taking place.  In the first scenario, each customer
puts in their order immediately, then waits 50 minutes for it to arrive.
In the second scenario each customer waits 40 minutes in to put in
their order, then waits another 10 minutes for it to arrive.

What I'm trying to show with this analogy is that no matter how many
"simultaneous" requests you have, they all have to be serialized at
some point because you only have one CPU.  Either you can serialize them
before they get to the perl interpreter, or afterward.  Either way you
wait on the CPU, and you get the same throughput.

Does that help?
===

Date: Wed, 17 Jan 2001 23:05:13 +0800
To: Sam Horrocks <sam@daemoninc.com>
From: Gunther Birznieks <gunther@extropia.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 

I guess as I get older I start to slip technically. :) This helps me a bit, 
but it doesn't really help me understand the final arguement (that MRU is 
still going to help on a fully loaded system).

With some modification, I guess I am thinking that the cook is really the 
OS and the CPU is really the oven. But the hamburgers on an Intel oven have 
to be timesliced instead of left to cook and then after it's done the next 
hamburger is put on.

So if we think of meals as Perl requests, the reality is that not all meals 
take the same amount of time to cook. A quarter pounder surely takes longer 
than your typical paper thin McDonald's Patty.

The fact that a customer requests a meal that takes longer to cook than 
another one is relatively random. In fact in the real world, it is likely 
to be random. This means that it's possible for all 10 meals to be cooking 
but the 3rd meal gets done really fast, so another customer gets time 
sliced to use the oven for their meal -- which might be a long meal.

In your testing, perhaps the problem is that you are benchmarking with a 
homogeneous process. So of course you are seeing this behavior that makes 
it look like serializing 10 connections is just the same wait as time 
slicing them and therefore an MRU algorithm works better (of course it 
works better, because you keep releasing the systems in order)...

But in the world where the 3rd or 5th or 6th process may finish sooner and 
release sooner than others, then an MRU algorithm doesn't matter. And 
actually a process that finishes in 10 seconds shouldn't have to wait until 
a process than takes 30 seconds to complete has finished.

And all 10 interpreters are in use at the same time, serving all requests 
and randomly popping off the queue and starting again where no MRU or LRU 
algorithm will really help. It's all the same.

Anyway, maybe I am still not really getting it. Even with the fast food 
analogy. Maybe it is time to throw in the network time and other variables 
that seemed to make a difference in Perrin understanding how you were 
approaching the explanation.

I am now curious -- on a fully loaded system of max 10 processes, did you 
see that SpeedyCGI scaled better than mod_perl on your benchmarks? Or are 
we still just speculating?


===

Date: Wed, 17 Jan 2001 11:08:09 -0500
From: Buddy Lee Haystack <haystack@email.rentzone.org>
To: Sam Horrocks <sam@daemoninc.com>
Cc: Gunther Birznieks <gunther@extropia.com>, perrin@primenet.com, mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

I have a wide assortment of queries on a site, some of which
take several minutes to execute, while others execute in
less than one second. If understand this analogy correctly,
I'd be better off with the current incarnation of mod_perl
because there would be more cashiers around to serve the
"quick cups of coffee" that many customers request at my
dinner.

Is this correct?


===

To: Gunther Birznieks <gunther@extropia.com>
cc: perrin@primenet.com, mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Wed, 17 Jan 2001 15:37:00 -0800
From: Sam Horrocks <sam@daemoninc.com>

 > I guess as I get older I start to slip technically. :) This helps me a bit, 
 > but it doesn't really help me understand the final arguement (that MRU is 
 > still going to help on a fully loaded system).
 > 
 > With some modification, I guess I am thinking that the cook is really the 
 > OS and the CPU is really the oven. But the hamburgers on an Intel oven have 
 > to be timesliced instead of left to cook and then after it's done the next 
 > hamburger is put on.
 > 
 > So if we think of meals as Perl requests, the reality is that not all meals 
 > take the same amount of time to cook. A quarter pounder surely takes longer 
 > than your typical paper thin McDonald's Patty.
 > 
 > The fact that a customer requests a meal that takes longer to cook than 
 > another one is relatively random. In fact in the real world, it is likely 
 > to be random. This means that it's possible for all 10 meals to be cooking 
 > but the 3rd meal gets done really fast, so another customer gets time 
 > sliced to use the oven for their meal -- which might be a long meal.

I don't like your mods to the analogy, because they don't model how
a CPU actually works.  Even if the cook == the OS and the oven == the
CPU, the oven *must* work on tasks sequentially.  If you look at the
assembly language for your Intel CPU you won't see anything about it
doing multi-tasking.  It does adds, subtracts, stores, loads, jumps, etc.
It executes code sequentially.  You must model this somewhere in your
analogy if it's going to be accurate.

So I'll modify your analogy to say the oven can only cook one thing at
a time.  Now, what you could do is have the cook take one of the longer
meals (the 10 minute meatloaf) out of the oven in order to cook something
small, then put the meatloaf back later to finish cooking.  But the oven
does *not* cook things in parallel.  Remember that things have
to cook for a very long time before they get timesliced -- 210ms is a
long time for a CPU, and that's the default timeslice on a Linux PC.

If we say the oven cooks things sequentially, it doesn't really change
the overall results that I had in the previous example.  The cook just
puts things in the oven sequentially, in the order in which they were
received from the cashiers - this represents the run queue in the OS.
But the cashiers still sit there and wait for the meals from the cook,
and the cook just stands there waiting for the oven to cook meals
sequentially.

 > In your testing, perhaps the problem is that you are benchmarking with a 
 > homogeneous process. So of course you are seeing this behavior that makes 
 > it look like serializing 10 connections is just the same wait as time 
 > slicing them and therefore an MRU algorithm works better (of course it 
 > works better, because you keep releasing the systems in order)...
 > 
 > But in the world where the 3rd or 5th or 6th process may finish sooner and 
 > release sooner than others, then an MRU algorithm doesn't matter. And 
 > actually a process that finishes in 10 seconds shouldn't have to wait until 
 > a process than takes 30 seconds to complete has finished.

No, homogeneity (or the lack of it) wouldn't make a difference.  Those 3rd,
5th or 6th processes run only *after* the 1st and 2nd have finished using
the CPU.  And at that poiint you could re-use those interpreters that 1 and 2
were using.

 > And all 10 interpreters are in use at the same time, serving all requests 
 > and randomly popping off the queue and starting again where no MRU or LRU 
 > algorithm will really help. It's all the same.

If in both the MRU/LRU case there were exactly 10 interpreters busy at
all times, then you're right it wouldn't matter.  But don't confuse
the issues - 10 concurrent requests do *not* necessarily require 10
concurrent interpreters.  The MRU has an affect on the way a stream of 10
concurrent requests are handled, and MRU results in those same requests
being handled by fewer interpreters.

 > Anyway, maybe I am still not really getting it. Even with the fast food 
 > analogy. Maybe it is time to throw in the network time and other variables 
 > that seemed to make a difference in Perrin understanding how you were 
 > approaching the explanation.

Please again take a look at the first analogy.  The CPU can't do multi-tasking.
Until that gets straightened out, I don't think adding more to the analogy
will help.

Also, I think the analogy is about to break - that's why I put in extra
disclaimers at the top.  It was only intended to show that 10 concurrent
requests don't necessarily require 10 perl interpreters in order to
achieve maximum throughput.

 > I am now curious -- on a fully loaded system of max 10 processes, did you 
 > see that SpeedyCGI scaled better than mod_perl on your benchmarks? Or are 
 > we still just speculating?

It is actually possible to benchmark.  Given the same concurrent load
and the same number of httpds running, speedycgi will use fewer perl
interpreters than mod_perl.  This will usually result in speedycgi
using less RAM, except under light loads, or if the amount of shared
memory is extremely large.  If the total amount of RAM used by the
mod_perl interpreters is high enough, your system will start paging,
and your performance will nosedive.  Given the same load speedycgi will
just maintain the same performance because it's using less RAM.

The thing is that if you know ahead of time what your load is going
to be in the benchmark, you can reduce the number of httpd's so that
mod_perl handles it with the same number of interpreters as speedycgi
does.  But how realistic that is in the real world, I don't know.
With speedycgi it just sort of adapts to the load automatically.  Maybe
it would be possible to come up wiith a better benchmark that varies the
load to show how speedycgi adapts better.

Here are my results (perl == mod_perl, speedy == mod_speedycgi):

*
* Benchmarking perl
*
  3:05pm  up 5 min,  3 users,  load average: 0.04, 0.26, 0.15
This is ApacheBench, Version 1.3
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-1999 The Apache Group, http://www.apache.org/

Benchmarking localhost (be patient)...
                                                                           
Server Software:        Apache/1.3.9
Server Hostname:        localhost
Server Port:            80

Document Path:          /perl/hello_world
Document Length:        11 bytes

Concurrency Level:      300
Time taken for tests:   30.022 seconds
Complete requests:      2409
Failed requests:        0
Total transferred:      411939 bytes
HTML transferred:       26499 bytes
Requests per second:    80.24
Transfer rate:          13.72 kb/s received

Connnection Times (ms)
              min   avg   max
Connect:        0   572 21675
Processing:    30  1201  8301
Total:         30  1773 29976
This is ApacheBench, Version 1.3
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-1999 The Apache Group, http://www.apache.org/

Benchmarking localhost (be patient)...
                                                                           
Server Software:        Apache/1.3.9
Server Hostname:        localhost
Server Port:            80

Document Path:          /perl/hello_world
Document Length:        11 bytes

Concurrency Level:      300
Time taken for tests:   41.872 seconds
Complete requests:      524
Failed requests:        0
Total transferred:      98496 bytes
HTML transferred:       6336 bytes
Requests per second:    12.51
Transfer rate:          2.35 kb/s received

Connnection Times (ms)
              min   avg   max
Connect:       70  1679  8864
Processing:   300  7209 14728
Total:        370  8888 23592
*
* Benchmarking speedy
*
  3:14pm  up 3 min,  3 users,  load average: 0.14, 0.31, 0.15
This is ApacheBench, Version 1.3
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-1999 The Apache Group, http://www.apache.org/

Benchmarking localhost (be patient)...
                                                                           
Server Software:        Apache/1.3.9
Server Hostname:        localhost
Server Port:            80

Document Path:          /speedy/hello_world
Document Length:        11 bytes

Concurrency Level:      300
Time taken for tests:   30.175 seconds
Complete requests:      6135
Failed requests:        0
Total transferred:      1060713 bytes
HTML transferred:       68233 bytes
Requests per second:    203.31
Transfer rate:          35.15 kb/s received

Connnection Times (ms)
              min   avg   max
Connect:        0   179  9122
Processing:    12   341  5710
Total:         12   520 14832
This is ApacheBench, Version 1.3
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-1999 The Apache Group, http://www.apache.org/

Benchmarking localhost (be patient)...
                                                                           
Server Software:        Apache/1.3.9
Server Hostname:        localhost
Server Port:            80

Document Path:          /speedy/hello_world
Document Length:        11 bytes

Concurrency Level:      300
Time taken for tests:   30.327 seconds
Complete requests:      7034
Failed requests:        0
Total transferred:      1221795 bytes
HTML transferred:       78595 bytes
Requests per second:    231.94
Transfer rate:          40.29 kb/s received

Connnection Times (ms)
              min   avg   max
Connect:        0   237  9336
Processing:   215   405 12012
Total:        215   642 21348


Here's the hello_world script:

#!/usr/bin/speedy
## mod_perl/cgi program; iis/perl cgi; iis/perl isapi cgi
use CGI;
$x = 'x' x 65536;
my $cgi = CGI->new();
print $cgi->header();
print "Hello ";
print "World";



Here's the script I used to run the benchmarks:

#!/bin/sh

which=$1

echo "*"
echo "* Benchmarking $which"
echo "*"

uptime
httpd
sleep 5
ab -t 30 -c 300 http://localhost/$which/hello_world
ab -t 30 -c 300 http://localhost/$which/hello_world


Before running each test, I rebooted my system.  Here's the software
installed:

angel: {139}# rpm -q -a |egrep -i 'mod_perl|speedy|apache'
apache-1.3.9-4
speedycgi-2.02-1
apache-devel-1.3.9-4
speedycgi-apache-2.02-1
mod_perl-1.21-2

Here are some relevant parameters from my httpd.conf:

MinSpareServers 8
MaxSpareServers 20
StartServers 10
MaxClients 150
MaxRequestsPerChild 10000
SpeedyMaxRuns 0





===

To: haystack@email.rentzone.org
cc: Gunther Birznieks <gunther@extropia.com>, perrin@primenet.com, mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory 
Date: Wed, 17 Jan 2001 15:43:18 -0800
From: Sam Horrocks <sam@daemoninc.com>


 > I have a wide assortment of queries on a site, some of
 >which take several minutes to execute, while others
 >execute in less than one second. If understand this
 >analogy correctly, I'd be better off with the current
 >incarnation of mod_perl because there would be more
 >cashiers around to serve the "quick cups of coffee" that
 >many customers request at my dinner.


There is no coffee.  Only meals.  No substitutions. :-)

If we added coffee to the menu it would still have to be prepared by the cook.
Remember that you only have one CPU, and all the perl interpreters large and
small must gain access to that CPU in order to run.

===

Date: Wed, 17 Jan 2001 15:55:52 -0800 (PST)
From: Perrin Harkins <perrin@primenet.com>
To: Sam Horrocks <sam@daemoninc.com>
cc: Gunther Birznieks <gunther@extropia.com>, mod_perl list <modperl@apache.org>, speedycgi@newlug.org
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts  that contain un-shared memory 
On Wed, 17 Jan 2001, Sam Horrocks wrote:
> If in both the MRU/LRU case there were exactly 10 interpreters busy at
> all times, then you're right it wouldn't matter.  But don't confuse
> the issues - 10 concurrent requests do *not* necessarily require 10
> concurrent interpreters.  The MRU has an affect on the way a stream of 10
> concurrent requests are handled, and MRU results in those same requests
> being handled by fewer interpreters.

On a side note, I'm curious about is how Apache decides that child
processes are unused and can be killed off.  The spawning of new processes
is pretty agressive on a busy server, but if the server reaches a steady
state and some processes aren't needed they should be killed off.  Maybe
no one has bothered to make that part very efficient since in normal
circusmtances most users would prefer to have extra processes waiting
around than not have enough to handle a surge and have to spawn a whole
bunch.

===

Date: Thu, 18 Jan 2001 03:02:11 +0100
To: Sam Horrocks <sam@daemoninc.com>
From: Christian Jaeger <christian.jaeger@sl.ethz.ch>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

Hello Sam and others

If I haven't overseen, nobody so far really mentioned fastcgi. I'm 
asking myself why you reinvented the wheel. I summarize the 
differences I see:

+ perl scripts are more similar to standard CGI ones than with 
FastCGI (downside: see next point)
- it seems you can't control the request loop yourself
+ protocol is more free than the one of FastCGI (is it?)
- protocol isn't widespread (almost standard) like the one of FastCGI
- seems only to support perl (so far)
- doesn't seem to support external servers (on other machines) like 
FastCGI (does it?)

Question: does speedycgi run a separate interpreter for each script, 
or is there one process loading and calling several perl scripts? If 
it's a separate process for each script, then mod_perl is sure to use 
less memory.

As far I understand, IF you can collect several scripts together into 
one interpreter and IF you do preforking, I don't see essential 
performance related differences between mod_perl and speedy/fastcgi 
if you set up mod_perl with the proxy approach. With mod_perl the 
protocol to the backends is http, with speedy it's speedy and with 
fastcgi it's the fastcgi protocol. (The difference between mod_perl 
and fastcgi is that fastcgi uses a request loop, whereas mod_perl has 
it's handlers (sorry, I never really used mod_perl so I don't know 
exactly).)

I think it's a pity that during the last years there was such little 
interest/support for fastcgi and now that should change with 
speedycgi. But why not, if the stuff that people develop can run on 
both and speedy is/becomes better than fastcgi.

I'm developing a web application framework (called 'Eile', you can 
see some outdated documentation on testwww.ethz.ch/eile, I will 
release a new much better version soon) which currently uses fastcgi. 
If I can get it to run with speedycgi, I'll be glad to release it 
with support for both protocols. I haven't looked very close at it 
yet. One of the problems seems to be that I really depend on 
controlling the request loop (initialization, preforking etc all have 
to be done before the application begins serving requests, and I'm 
also controlling exits of childs myself). If you're interested to 
help me solving these issues please contact me privately. The main 
advantages of Eile concerning resources are a) one 
process/interpreter runs dozens of 'scripts' (called page-processing 
modules), and you don't have to dispatch requests to each of them 
yourself, and b) my new version does preforking.

===

To: mod_perl list <modperl@apache.org>
From: Stephen Anderson
<Stephen.Anderson@energis-squared.com>
Subject: RE: Fwd: [speedycgi] Speedycgi scales better than
mod_perl withsc
Date: Thu, 18 Jan 2001 11:20:49 -0000


Sam Horrocks [mailto:sam@daemoninc.com] wrote:

>  > With some modification, I guess I am thinking that the 
>  > cook is really the 
>  > OS and the CPU is really the oven. But the hamburgers on 
>  > an Intel oven have 
>  > to be timesliced instead of left to cook and then after 
>  > it's done the next 
>  > hamburger is put on.
>  > 
>  > So if we think of meals as Perl requests, the reality is 
>  > that not all meals 
>  > take the same amount of time to cook. A quarter pounder 
>  > surely takes longer 
>  > than your typical paper thin McDonald's Patty.

[snip]

> 
> I don't like your mods to the analogy, because they don't model how
> a CPU actually works.  Even if the cook == the OS and the oven == the
> CPU, the oven *must* work on tasks sequentially.  If you look at the
> assembly language for your Intel CPU you won't see anything about it
> doing multi-tasking.  It does adds, subtracts, stores, loads, 
> jumps, etc.
> It executes code sequentially.  You must model this somewhere in your
> analogy if it's going to be accurate.

( I think the analogies have lost their usefulness....)

This doesn't affect the argument, because the core of it is that:

a) the CPU will not completely process a single task all at once; instead,
it will divide its time _between_ the tasks
b) tasks do not arrive at regular intervals
c) tasks take varying amounts of time to complete

Now, if (a) were true but (b) and (c) were not, then, yes, it would have the
same effective result as sequential processing. Tasks that arrived first
would finish first. In the real world however, (b) and (c) are usually true,
and it becomes practically impossible to predict which task handler (in this
case, a mod_perl process) will complete first.

Similarly, because of the non-deterministic nature of computer systems,
Apache doesn't service requests on an LRU basis; you're comparing SpeedyCGI
against a straw man. Apache's servicing algortihm approaches randomness, so
you need to build a comparison between forced-MRU and random choice.

(Note I'm not saying SpeedyCGI _won't_ win....just that the current
comparison doesn't make sense)

Thinking about it, assuming you are, at some time, servicing requests
_below_ system capacity, SpeedyCGI will always win in memory usage, and
probably have an edge in handling response time. My concern would be, does
it offer _enough_ of an edge? Especially bearing in mind, if I understand,
you could end runing anywhere up 2x as many processes (n Apache handlers + n
script handlers)?

> No, homogeneity (or the lack of it) wouldn't make a 
> difference.  Those 3rd,
> 5th or 6th processes run only *after* the 1st and 2nd have 
> finished using
> the CPU.  And at that poiint you could re-use those 
> interpreters that 1 and 2
> were using.

This, if you'll excuse me, is quite clearly wrong. See the above argument,
and imagine that tasks 1 and 2 happen to take three times as long to
complete than 3, and you should see that that they could all end being in
the scheduling queue together. Perhaps you're considering tasks which are
too small to take more than 1 or 2 timeslices, in which case, you're much
less likely to want to accelerate them.


[snipping obscenely long quoted thread 8-)]


Stephen.
===
To: speedycgi@newlug.org
From: Sam Horrocks <sam@daemoninc.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than
mod_perl withsc ripts that contain un-shared memory 
Date: Thu, 18 Jan 2001 20:38:48 -0800

 > This doesn't affect the argument, because the core of it is that:
 > 
 > a) the CPU will not completely process a single task all at once; instead,
 > it will divide its time _between_ the tasks
 > b) tasks do not arrive at regular intervals
 > c) tasks take varying amounts of time to complete
 > 
 > Now, if (a) were true but (b) and (c) were not, then, yes, it would have the
 > same effective result as sequential processing. Tasks that arrived first
 > would finish first. In the real world however, (b) and (c) are usually true,
 > and it becomes practically impossible to predict which task handler (in this
 > case, a mod_perl process) will complete first.

 I'll agree with (b) and (c) - I ignored them to keep my analogy as simple
 as possible.  Again, the goal of my analogy was to show that a stream of
 10 concurrent requests can be handled with the same througput with a lot
 fewer than 10 perl interpreters.  (b) and (c) don't really have an effect
 on that - they don't control the order in which processes arrive and get
 queued up for the CPU.

 I won't agree with (a) unless you qualify it further - what do you claim
 is the method or policy for (a)?

 There's only one run queue in the kernel.  THe first task ready to run is put
 at the head of that queue, and anything arriving afterwards waits.  Only
 if that first task blocks on a resource or takes a very long time, or
 a higher priority process becomes able to run due to an interrupt is that
 process taken out of the queue.

 It is inefficient for the unix kernel to be constantly switching
 very quickly from process to process, because it takes time to do
 context switches.  Also, unless the processes share the same memory,
 some amount of the processor cache can get flushed when you switch
 processes because you're changing to a different set of memory pages.
 That's why it's best for overall throughput if the kernel keeps a single
 process running as long as it can.

 > Similarly, because of the non-deterministic nature of computer systems,
 > Apache doesn't service requests on an LRU basis; you're comparing SpeedyCGI
 > against a straw man. Apache's servicing algortihm approaches randomness, so
 > you need to build a comparison between forced-MRU and random choice.

 Apache httpd's are scheduled on an LRU basis.  This was discussed early
 in this thread.  Apache uses a file-lock for its mutex around the accept
 call, and file-locking is implemented in the kernel using a round-robin
 (fair) selection in order to prevent starvation.  This results in
 incoming requests being assigned to httpd's in an LRU fashion.

 Once the httpd's get into the kernel's run queue, they finish in the
 same order they were put there, unless they block on a resource, get
 timesliced or are pre-empted by a higher priority process.

 > Thinking about it, assuming you are, at some time, servicing requests
 > _below_ system capacity, SpeedyCGI will always win in memory usage, and
 > probably have an edge in handling response time. My concern would be, does
 > it offer _enough_ of an edge? Especially bearing in mind, if I understand,
 > you could end runing anywhere up 2x as many processes (n Apache handlers + n
 > script handlers)?

 Try it and see.  I'm sure you'll run more processes with speedycgi, but
 you'll probably run a whole lot fewer perl interpreters and need less ram.
 
 Remember that the httpd's in the speedycgi case will have very little
 un-shared memory, because they don't have perl interpreters in them.
 So the processes are fairly indistinguishable, and the LRU isn't as 
 big a penalty in that case.

 This is why the original designers of Apache thought it was safe to
 create so many httpd's.  If they all have the same (shared) memory,
 then creating a lot of them does not have much of a penalty.  mod_perl
 applications throw a big monkey wrench into this design when they add
 a lot of unshared memory to the httpd's.

 > > No, homogeneity (or the lack of it) wouldn't make a 
 > > difference.  Those 3rd,
 > > 5th or 6th processes run only *after* the 1st and 2nd have 
 > > finished using
 > > the CPU.  And at that poiint you could re-use those 
 > > interpreters that 1 and 2
 > > were using.
 > 
 > This, if you'll excuse me, is quite clearly wrong. See the above argument,
 > and imagine that tasks 1 and 2 happen to take three times as long to
 > complete than 3, and you should see that that they could all end being in
 > the scheduling queue together. Perhaps you're considering tasks which are
 > too small to take more than 1 or 2 timeslices, in which case, you're much
 > less likely to want to accelerate them.

 So far to keep things fairly simple I've assumed you take less than one
 time slice to run.  A timeslice is fairly long on a linux pc (210ms).

 But say they take two slices, and interpreters 1 and 2 get pre-empted and
 go back into the queue.  So then requests 5/6 in the queue have to use
 other interpreters, and you expand the number of interpreters in use.
 But still, you'll wind up using the smallest number of interpreters
 required for the given load and timeslice.  As soon as those 1st and
 2nd perl interpreters finish their run, they go back at the beginning
 of the queue, and the 7th/ 8th or later requests can then use them, etc.
 Now you have a pool of maybe four interpreters, all being used on an MRU
 basis.  But it won't expand beyond that set unless your load goes up or
 your program's CPU time requirements increase beyond another timeslice.
 MRU will ensure that whatever the number of interpreters in use, it
 is the lowest possible, given the load, the CPU-time required by the
 program and the size of the timeslice.

===

To: speedycgi@newlug.org
From: Sam Horrocks <sam@daemoninc.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than
mod_perl withscripts that contain un-shared memory 
Date: Thu, 18 Jan 2001 21:30:28 -0800

 > Hello Sam and others
 > 
 > If I haven't overseen, nobody so far really mentioned fastcgi. I'm 
 > asking myself why you reinvented the wheel. I summarize the 
 > differences I see:
 >
 > + perl scripts are more similar to standard CGI ones than with 
 > FastCGI (downside: see next point)

 Agree.

 > - it seems you can't control the request loop yourself

 Yes, but what do you do with this control in fastcgi?   Maybe you can
 do the same thing in speedycgi in a different way?

 > + protocol is more free than the one of FastCGI (is it?)

 I'm not sure what you mean by "more free".

 > - protocol isn't widespread (almost standard) like the one of FastCGI

 Correct.  The speedycgi protocol has changed many times, and is documented
 only in the C files.

 > - seems only to support perl (so far)
 > - doesn't seem to support external servers (on other machines) like 
 > FastCGI (does it?)

 Correct.  Correct.

 I'll add the following plusses for speedycgi:

 + Starts up/shuts down perl processes automatically depending on the load.
   Users don't have to get involved at all in starting/stopping proceeses.
 + Assigns requests to processes on an MRU basis

 Don't know if these are also true now for fastcgi.

 > Question: does speedycgi run a separate interpreter for each script, 
 > or is there one process loading and calling several perl scripts?

 Currently one process == one script.  I'm almost done with a version
 that allows mutiple scripts in one process.

 > If it's a separate process for each script, then mod_perl is sure to use
 > less memory.

 Depends on the number of scripts you have running at once.  And you
 have to factor in the whole LRU/shared-memory problem in mod_perl,
 which is where this thread originally started.

 > As far I understand, IF you can collect several scripts together into 
 > one interpreter and IF you do preforking, I don't see essential 
 > performance related differences between mod_perl and speedy/fastcgi 
 > if you set up mod_perl with the proxy approach.

 There is the way speedy assigns requests to handlers that is different.
 Plus speedy runs the perl processes outside the web-server.  And speedy
 has a CGI-only mode totally outside the webserver.
 
 > I think it's a pity that during the last years there was such little 
 > interest/support for fastcgi and now that should change with 
 > speedycgi. But why not, if the stuff that people develop can run on 
 > both and speedy is/becomes better than fastcgi.

 I think people should use whichever one is best for their application.
 I'm trying to make speedy as good as possible given the time I can
 put into it.  And I'm trying to communicate to people what it can do.
 Beyond that it's up to people to decide which one they want to use.

 As I've mentioned on the speedycgi list, I'd like to see some sort of
 persistent-perl API developed so that people could write to that API,
 then run their script under different persistent-perl environments
 without changes.  I think that would be better than porting scripts
 and modules to every persistent perl environment out there.

 > I'm developing a web application framework (called 'Eile', you can 
 > see some outdated documentation on testwww.ethz.ch/eile, I will 
 > release a new much better version soon) which currently uses fastcgi. 
 > If I can get it to run with speedycgi, I'll be glad to release it 
 > with support for both protocols. I haven't looked very close at it 
 > yet. One of the problems seems to be that I really depend on 
 > controlling the request loop (initialization, preforking etc all have 
 > to be done before the application begins serving requests, and I'm 
 > also controlling exits of childs myself). If you're interested to 
 > help me solving these issues please contact me privately. The main 
 > advantages of Eile concerning resources are a) one 
 > process/interpreter runs dozens of 'scripts' (called page-processing 
 > modules), and you don't have to dispatch requests to each of them 
 > yourself, and b) my new version does preforking.

 I think I can help - I'll send a private message.
===
To: <speedycgi@newlug.org>, "Sam Horrocks"
<sam@daemoninc.com>
From: "Les Mikesell" <lesmikesell@home.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than
mod_perl withsc ripts that contain un-shared memory 
Date: Fri, 19 Jan 2001 00:13:56 -0600

"Sam Horrocks" <sam@daemoninc.com> wrote:

>  There's only one run queue in the kernel.  THe first task
>  ready to run is put at the head of that queue, and
>  anything arriving afterwards waits.  Only if that first
>  task blocks on a resource or takes a very long time, or a
>  higher priority process becomes able to run due to an
>  interrupt is that process taken out of the queue.

Note that any I/O request that isn't completely handled by buffers will
trigger the 'blocks on a resource' clause above, which means that
jobs doing any real work will complete in an order determined by
something other than the cpu and not strictly serialized.  Also, most
of my web servers are dual-cpu so even cpu bound processes may
complete out of order.

>  > Similarly, because of the non-deterministic nature of computer systems,
>  > Apache doesn't service requests on an LRU basis; you're comparing
SpeedyCGI
>  > against a straw man. Apache's servicing algortihm approaches randomness,
so
>  > you need to build a comparison between forced-MRU and random choice.
>
>  Apache httpd's are scheduled on an LRU basis.  This was discussed early
>  in this thread.  Apache uses a file-lock for its mutex around the accept
>  call, and file-locking is implemented in the kernel using a round-robin
>  (fair) selection in order to prevent starvation.  This results in
>  incoming requests being assigned to httpd's in an LRU fashion.

But, if you are running a front/back end apache with a small number
of spare servers configured on the back end there really won't be
any idle perl processes during the busy times you care about.  That
is, the  backends will all be running or apache will shut them down
and there won't be any difference between MRU and LRU (the
difference would be which idle process waits longer - if none are
idle there is no difference).

>  Once the httpd's get into the kernel's run queue, they finish in the
>  same order they were put there, unless they block on a resource, get
>  timesliced or are pre-empted by a higher priority process.

Which means they don't finish in the same order if (a) you have
more than one cpu, (b) they do any I/O (including delivering the
output back which they all do), or (c) some of them run long enough
to consume a timeslice.

>  Try it and see.  I'm sure you'll run more processes with speedycgi, but
>  you'll probably run a whole lot fewer perl interpreters and need less ram.

Do you have a benchmark that does some real work (at least a dbm
lookup) to compare against a front/back end mod_perl setup?

>  Remember that the httpd's in the speedycgi case will have very little
>  un-shared memory, because they don't have perl interpreters in them.
>  So the processes are fairly indistinguishable, and the LRU isn't as
>  big a penalty in that case.
>
>  This is why the original designers of Apache thought it was safe to
>  create so many httpd's.  If they all have the same (shared) memory,
>  then creating a lot of them does not have much of a penalty.  mod_perl
>  applications throw a big monkey wrench into this design when they add
>  a lot of unshared memory to the httpd's.

This is part of the reason the front/back end  mod_perl configuration
works well, keeping the backend numbers low.  The real win when serving
over the internet, though, is that the perl memory is no longer tied
up while delivering the output back over frequently slow connections.

===

To: Sam Horrocks <sam@daemoninc.com>
From: Perrin Harkins <perrin@primenet.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than
mod_perl withsc ripts 
Date: Fri, 19 Jan 2001 01:52:26 -0800

Sam Horrocks wrote:
>  say they take two slices, and interpreters 1 and 2 get pre-empted and
>  go back into the queue.  So then requests 5/6 in the queue have to use
>  other interpreters, and you expand the number of interpreters in use.
>  But still, you'll wind up using the smallest number of interpreters
>  required for the given load and timeslice.  As soon as those 1st and
>  2nd perl interpreters finish their run, they go back at the beginning
>  of the queue, and the 7th/ 8th or later requests can then use them, etc.
>  Now you have a pool of maybe four interpreters, all being used on an MRU
>  basis.  But it won't expand beyond that set unless your load goes up or
>  your program's CPU time requirements increase beyond another timeslice.
>  MRU will ensure that whatever the number of interpreters in use, it
>  is the lowest possible, given the load, the CPU-time required by the
>  program and the size of the timeslice.

You know, I had brief look through some of the SpeedyCGI code yesterday,
and I think the MRU process selection might be a bit of a red herring. 
I think the real reason Speedy won the memory test is the way it spawns
processes.

If I understand what's going on in Apache's source, once every second it
has a look at the scoreboard and says "less than MinSpareServers are
idle, so I'll start more" or "more than MaxSpareServers are idle, so
I'll kill one".  It only kills one per second.  It starts by spawning
one, but the number spawned goes up exponentially each time it sees
there are still not enough idle servers, until it hits 32 per second. 
It's easy to see how this could result in spawning too many in response
to sudden load, and then taking a long time to clear out the unnecessary
ones.

In contrast, Speedy checks on every request to see if there are enough
backends running.  If there aren't, it spawns more until there are as
many backends as queued requests.  That means it never overshoots the
mark.

Going back to your example up above, if Apache actually controlled the
number of processes tightly enough to prevent building up idle servers,
it wouldn't really matter much how processes were selected.  If after
the 1st and 2nd interpreters finish their run they went to the end of
the queue instead of the beginning of it, that simply means they will
sit idle until called for instead of some other two processes sitting
idle until called for.  If the systems were both efficient enough about
spawning to only create as many interpreters as needed, none of them
would be sitting idle and memory usage would always be as low as
possible.

I don't know if I'm explaining this very well, but the gist of my theory
is that at any given time both systems will require an equal number of
in use interpreters to do an equal amount of work and the diffirentiator
between the two is Apache's relatively poor estimate of how many
processes should be available at any given time.  I think this theory
matches up nicely with the results of Sam's tests: when MaxClients
prevents Apache from spawning too many processes, both systems have
similar performance characteristics.

There are some knobs to twiddle in Apache's source if anyone is
interested in playing with it.  You can change the frequency of the
checks and the maximum number of servers spawned per check.  I don't
have much motivation to do this investigation myself, since I've already
tuned our MaxClients and process size constraints to prevent problems
with our application.

===

To: "Les Mikesell" <lesmikesell@home.com>
From: Sam Horrocks <sam@daemoninc.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than
mod_perl withsc ripts that contain un-shared memory 
Date: Fri, 19 Jan 2001 03:47:21 -0800

 > >  There's only one run queue in the kernel.  THe first task ready to run is
 > put
 > >  at the head of that queue, and anything arriving afterwards waits.  Only
 > >  if that first task blocks on a resource or takes a very long time, or
 > >  a higher priority process becomes able to run due to an interrupt is that
 > >  process taken out of the queue.
 > 
 > Note that any I/O request that isn't completely handled by buffers will
 > trigger the 'blocks on a resource' clause above, which means that
 > jobs doing any real work will complete in an order determined by
 > something other than the cpu and not strictly serialized.  Also, most
 > of my web servers are dual-cpu so even cpu bound processes may
 > complete out of order.

 I think it's much easier to visualize how MRU helps when you look at one
 thing running at a time.  And MRU works best when every process runs
 to completion instead of blocking, etc.  But even if the process gets
 timesliced, blocked, etc, MRU still degrades gracefully.  You'll get
 more processes in use, but still the numbers will remain small.

 > >  > Similarly, because of the non-deterministic nature of computer systems,
 > >  > Apache doesn't service requests on an LRU basis; you're comparing
 > SpeedyCGI
 > >  > against a straw man. Apache's servicing algortihm approaches randomness,
 > so
 > >  > you need to build a comparison between forced-MRU and random choice.
 > >
 > >  Apache httpd's are scheduled on an LRU basis.  This was discussed early
 > >  in this thread.  Apache uses a file-lock for its mutex around the accept
 > >  call, and file-locking is implemented in the kernel using a round-robin
 > >  (fair) selection in order to prevent starvation.  This results in
 > >  incoming requests being assigned to httpd's in an LRU fashion.
 > 
 > But, if you are running a front/back end apache with a small number
 > of spare servers configured on the back end there really won't be
 > any idle perl processes during the busy times you care about.  That
 > is, the  backends will all be running or apache will shut them down
 > and there won't be any difference between MRU and LRU (the
 > difference would be which idle process waits longer - if none are
 > idle there is no difference).

 If you can tune it just right so you never run out of ram, then I think
 you could get the same performance as MRU on something like hello-world.

 > >  Once the httpd's get into the kernel's run queue, they finish in the
 > >  same order they were put there, unless they block on a resource, get
 > >  timesliced or are pre-empted by a higher priority process.
 > 
 > Which means they don't finish in the same order if (a) you have
 > more than one cpu, (b) they do any I/O (including delivering the
 > output back which they all do), or (c) some of them run long enough
 > to consume a timeslice.
 > 
 > >  Try it and see.  I'm sure you'll run more processes with speedycgi, but
 > >  you'll probably run a whole lot fewer perl interpreters and need less ram.
 > 
 > Do you have a benchmark that does some real work (at least a dbm
 > lookup) to compare against a front/back end mod_perl setup?

 No, but if you send me one, I'll run it.

===

To: "'Sam Horrocks'" <sam@daemoninc.com>,
speedycgi@newlug.org
From: Stephen Anderson
<Stephen.Anderson@energis-squared.com>
Subject: RE: Fwd: [speedycgi] Speedycgi scales better than
mod_perl withsc
Date: Fri, 19 Jan 2001 12:09:35 -0000

>  > This doesn't affect the argument, because the core of it is that:
>  > 
>  > a) the CPU will not completely process a single task all 
> at once; instead,
>  > it will divide its time _between_ the tasks
>  > b) tasks do not arrive at regular intervals
>  > c) tasks take varying amounts of time to complete
>  > 
[snip]

>  I won't agree with (a) unless you qualify it further - what 
> do you claim
>  is the method or policy for (a)?

I think this has been answered ... basically, resource conflicts (including
I/O), interrupts, long running tasks, higher priority tasks, and, of course,
the process yielding, can all cause the CPU to switch processes (which of
these qualify depends very much on the OS in question).

This is why, despite the efficiency of single-task running, you can usefully
run more than one process on a UNIX system. Otherwise, if you ran a single
Apache process and had no traffic, you couldn't run a shell at the same time
- Apache would consume practically all your CPU in its select() loop 8-)

>  Apache httpd's are scheduled on an LRU basis.  This was 
> discussed early
>  in this thread.  Apache uses a file-lock for its mutex 
> around the accept
>  call, and file-locking is implemented in the kernel using a 
> round-robin
>  (fair) selection in order to prevent starvation.  This results in
>  incoming requests being assigned to httpd's in an LRU fashion.

I'll apologise, and say, yes, of course you're right, but I do have a query:

There are at (IIRC) 5 methods that Apache uses to serialize requests:
fcntl(), flock(), Sys V semaphores, uslock (IRIX only) and Pthreads
(reliably only on Solaris). Do they _all_ result in LRU?

>  Remember that the httpd's in the speedycgi case will have very little
>  un-shared memory, because they don't have perl interpreters in them.
>  So the processes are fairly indistinguishable, and the LRU isn't as 
>  big a penalty in that case.


Yessss...._but_, interpreter for interpreter, won't the equivalent speedycgi
have roughly as much unshared memory as the mod_perl? I've had a lot of
(dumb) discussions with people who complain about the size of
Apache+mod_perl without realising that the interpreter code's all shared,
and with pre-loading a lot of the perl code can be too. While I _can_ see
speedycgi having an advantage (because it's got a much better overview of
what's happening, and can intelligently manage the situation), I don't think
it's as large as you're suggesting. I think this needs to be intensively
benchmarked to answer that....

>  other interpreters, and you expand the number of interpreters in use.
>  But still, you'll wind up using the smallest number of interpreters
>  required for the given load and timeslice.  As soon as those 1st and
>  2nd perl interpreters finish their run, they go back at the beginning
>  of the queue, and the 7th/ 8th or later requests can then 
> use them, etc.
>  Now you have a pool of maybe four interpreters, all being 
> used on an MRU
>  basis.  But it won't expand beyond that set unless your load 
> goes up or
>  your program's CPU time requirements increase beyond another 
> timeslice.
>  MRU will ensure that whatever the number of interpreters in use, it
>  is the lowest possible, given the load, the CPU-time required by the
>  program and the size of the timeslice.

Yep...no arguments here. SpeedyCGI should result in fewer interpreters.


I will say that there are a lot of convincing reasons to follow the
SpeedyCGI model rather than the mod_perl model, but I've generally thought
that the increase in that kind of performance that can be obtained as
sufficiently minimal as to not warrant the extra layer... thoughts, anyone?

Stephen.

===

To: <modperl@apache.org>
From: Matt Sergeant <matt@sergeant.org>
Subject: RE: Fwd: [speedycgi] Speedycgi scales better than
mod_perl withsc	
Date: Fri, 19 Jan 2001 12:14:45 +0000 (GMT)

There seems to be a lot of talk here, and analogies, and zero real-world
benchmarking.

Now it seems to me from reading this thread, that speedycgi would be
better where you run 1 script, or only a few scripts, and mod_perl might
win where you have a large application with hundreds of different URLs
with different code being executed on each. That may change with the next
release of speedy, but then lots of things will change with the next major
release of mod_perl too, so its irrelevant until both are released.

And as well as that, speedy still suffers (IMHO) that is still follows the
CGI scripting model, whereas mod_perl offers a much more flexible
environemt, and feature rich API (the Apache API). What's more, I could
never build something like AxKit in speedycgi, without resorting to hacks
like mod_rewrite to hide nasty URL's. At least thats my conclusion from
first appearances.

Either way, both solutions have their merits. Neither is going to totally
replace the other.

What I'd really like to do though is sum up this thread in a short article
for take23. I'll see if I have time on Sunday to do it.

===

To: perrin@primenet.com
From: Sam Horrocks <sam@daemoninc.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than
mod_perl withsc ripts that contain un-shared memory 
Date: Fri, 19 Jan 2001 04:53:04 -0800

 > You know, I had brief look through some of the SpeedyCGI code yesterday,
 > and I think the MRU process selection might be a bit of a red herring. 
 > I think the real reason Speedy won the memory test is the way it spawns
 > processes.

 Please take a look at that code again.  There's no smoke and mirrors,
 no red-herrings.  Also, I don't look at the benchmarks as "winning" - I
 am not trying to start a mod_perl vs speedy battle here.  Gunther wanted
 to know if there were "real bechmarks", so I reluctantly put them up.

 Here's how SpeedyCGI works (this is from version 2.02 of the code):

    When the frontend starts, it tries to quickly grab a backend from
    the front of the be_wait queue, which is a LIFO.  This is in
    speedy_frontend.c, get_a_backend() function.

    If there aren't any idle be's, it puts itself onto the fe_wait queue.
    Same file, get_a_backend_hard().
    
    If this fe (frontend) is at the front of the fe_wait queue, it
    "takes charge" and starts looking to see if a backend needs to be
    spawned.  This is part of the "frontend_ping()" function.  It will
    only spawn a be if no other backends are being spawned, so only
    one backend gets spawned at a time.

    Every frontend in the queue, drops into a sigsuspend and waits for an
    alarm signal.  The alarm is set for 1-second.  This is also in
    get_a_backend_hard().

    When a backend is ready to handle code, it goes and looks at the fe_wait
    queue and if there are fe's there, it sends a SIGALRM to the one at
    the front, and sets the sent_sig flag for that fe.  This done in
    speedy_group.c, speedy_group_sendsigs().

    When a frontend wakes on an alarm (either due to a timeout, or due to
    a be waking it up), it looks at its sent_sig flag to see if it can now
    grab a be from the queue.  If so it does that.  If not, it runs various
    checks then goes back to sleep.

 In most cases, you should get a be from the lifo right at the beginning
 in the get_a_backend() function.  Unless there aren't enough be's running,
 or somethign is killing them (bad perl code), or you've set the
 MaxBackends option to limit the number of be's.


 > If I understand what's going on in Apache's source, once every second it
 > has a look at the scoreboard and says "less than MinSpareServers are
 > idle, so I'll start more" or "more than MaxSpareServers are idle, so
 > I'll kill one".  It only kills one per second.  It starts by spawning
 > one, but the number spawned goes up exponentially each time it sees
 > there are still not enough idle servers, until it hits 32 per second. 
 > It's easy to see how this could result in spawning too many in response
 > to sudden load, and then taking a long time to clear out the unnecessary
 > ones.
 > 
 > In contrast, Speedy checks on every request to see if there are enough
 > backends running.  If there aren't, it spawns more until there are as
 > many backends as queued requests.
 
 Speedy does not check on every request to see if there are enough
 backends running.  In most cases, the only thing the frontend does is
 grab an idle backend from the lifo.  Only if there are none available
 does it start to worry about how many are running, etc.

 > That means it never overshoots the mark.

 You're correct that speedy does try not to overshoot, but mainly
 because there's no point in overshooting - it just wastes swap space.
 But that's not the heart of the mechanism.  There truly is a LIFO
 involved.  Please read that code again, or run some tests.  Speedy
 could overshoot by far, and the worst that would happen is that you
 would get a lot of idle backends sitting in virtual memory, which the
 kernel would page out, and then at some point they'll time out and die.
 Unless of course the load increases to a point where they're needed,
 in which case they would get used.

 If you have speedy installed, you can manually start backends yourself
 and test.  Just run "speedy_backend script.pl &" to start a backend.
 If you start lots of those on a script that says 'print "$$\n"', then
 run the frontend on the same script, you will still see the same pid
 over and over.  This is the LIFO in action, reusing the same process
 over and over.

===

To: Sam Horrocks <sam@daemoninc.com>
From: Perrin Harkins <perrin@primenet.com>
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than
mod_perl withsc
Date: Fri, 19 Jan 2001 16:00:52 -0800 (PST)

On Fri, 19 Jan 2001, Sam Horrocks wrote:
>  > You know, I had brief look through some of the SpeedyCGI code yesterday,
>  > and I think the MRU process selection might be a bit of a red herring. 
>  > I think the real reason Speedy won the memory test is the way it spawns
>  > processes.
> 
>  Please take a look at that code again.  There's no smoke and mirrors,
>  no red-herrings.

I didn't mean that MRU isn't really happening, just that it isn't the
reason why Speedy is running fewer interpeters.

>  Also, I don't look at the benchmarks as "winning" - I
>  am not trying to start a mod_perl vs speedy battle here.

Okay, but let's not be so polite about things that we don't acknowledge
when someone is onto a better way of doing things.  Stealing good ideas
from other projects is a time-honored open source tradition.

>  Speedy does not check on every request to see if there are enough
>  backends running.  In most cases, the only thing the frontend does is
>  grab an idle backend from the lifo.  Only if there are none available
>  does it start to worry about how many are running, etc.

Sorry, I had a lot of the details about what Speedy is doing wrong.  
However, it still sounds like it has a more efficient approach than
Apache in terms of managing process spawning.

>  You're correct that speedy does try not to overshoot, but mainly
>  because there's no point in overshooting - it just wastes swap space.
>  But that's not the heart of the mechanism.  There truly is a LIFO
>  involved.  Please read that code again, or run some tests.  Speedy
>  could overshoot by far, and the worst that would happen is that you
>  would get a lot of idle backends sitting in virtual memory, which the
>  kernel would page out, and then at some point they'll time out and die.

When you spawn a new process it starts out in real memory, doesn't
it?  Spawning too many could use up all the physical RAM and send a box
into swap, at least until it managed to page out the idle
processes.  That's what I think happened to mod_perl in this test.

>  If you start lots of those on a script that says 'print "$$\n"', then
>  run the frontend on the same script, you will still see the same pid
>  over and over.  This is the LIFO in action, reusing the same process
>  over and over.

Right, but I don't think that explains why fewer processes are running.  
Suppose you start 10 processes, and then send in one request at a time,
and that request takes one time slice to complete.  If MRU works
perfectly, you'll get process 1 over and over again handling the requests.  
LRU will use process 1, then 2, then 3, etc.  But both of them have 9
processes idle and one in use at any given time.  The 9 idle ones should
either be killed off, or ideally never have been spawned in the first
place.  I think Speedy does a better job of preventing unnecessary process
spawning.

One alternative theory is that keeping the same process busy instead of
rotating through all 10 means that the OS can page out the other 9 and
thus use less physical RAM.

Anyway, I feel like we've been putting you on the spot, and I don't want
you to feel obligated to respond personally to all the messages on this
thread.  I'm only still talking about it because it's interesting and I've
learned a couple of things about Linux and Apache from it.  If I get the
chance this weekend, I'll try some tests of my own.

- Perrin

===

the rest of The Pile (a partial mailing list archive)

doom@kzsu.stanford.edu