modperl_heavy_server_load_probs

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.

To: modperl@apache.org
From: Justin <jb@dslreports.com>
Subject: the edge of chaos
Date: Wed, 3 Jan 2001 22:25:04 -0500

Hi, and happy new year!

My modperl/mysql setup does not degrade gracefully when reaching
and pushing maximum pages per second  :-) if you could plot 
throughput, it rises to ceiling, then collapses to half or less,
then slowly recovers .. rinse and repeat.. during the collapses,
nobody but real patient people are getting anything.. most page
production is wasted: goes from modperl-->modproxy-->/dev/null

I know exactly why .. it is because of a long virtual
"request queue" enabled by the front end .. people "leave the
queue" but their requests do not.. pressing STOP on the browser
does not seem to signal mod_proxy to cancel its pending request,
or modperl, to cancel its work, if it has started.. (in fact if
things get real bad, you can even find much of your backend stuck
in a "R" state waiting for the Apache timeout variable
to tick down to zero..)

Any thoughts on solving this? Am I wrong in wishing that STOP
would function through all the layers?

thanks
-Justin
===
To: Justin <jb@dslreports.com>
From: Jeff Sheffield <jsheffie@buzzard.kdi.com>
Subject: Re: the edge of chaos
Date: Wed, 3 Jan 2001 23:57:17 -0600

this is not the solution...
but it could be a bandaid until you find one.
set the MaxClients # lower.

# Limit on total number of servers running, i.e., limit on the number
# of clients who can simultaneously connect --- if this limit is ever
# reached, clients will be LOCKED OUT, so it should NOT BE SET TOO
LOW.
# It is intended mainly as a brake to keep a runaway server from
taking
# the system with it as it spirals down...
#
MaxClients 150

===
To: jeff@jspot.org
From: Justin <jb@dslreports.com>
Subject: Re: the edge of chaos
Date: Thu, 4 Jan 2001 02:39:10 -0500

Yep, I am familiar with MaxClients .. there are two backend servers
of 10 modperl processes each (Maxclients=start=10). Thats sized
about right. They can all pump away at the same time doing about
20 pages per second. The problem comes when they are asked to do
21 pages per second :-)

There is one frontend mod_proxy.. currently has MaxClients set
to 120 processes (it doesnt serve images).. the actual number 
in use near peak output varies from 60 to 100, depending on the
mix of clients using the system. Keepalive is *off* on that,
(again, since it doesnt serve images).

When things get slow on the back end, the front end can fill with
120 *requests* .. all queued for the 20 available modperl slots..
hence long queues for service, results in nobody getting anything,
results in a dead site. I don't mind performance limits, just don't
like the idea that pushing beyond 100% (which can even happen with
one of the evil site hoovers hitting you) results in site death.

So dropping maxclients on the front end means you get clogged
up with slow readers instead, so that isnt an option..

-Justin

===
To: Justin <jb@dslreports.com>
From: "G.W. Haywood" <ged@www.jubileegroup.co.uk>
Subject: Re: the edge of chaos
Date: Thu, 4 Jan 2001 15:34:42 +0000 (GMT)

Hi there,

On Thu, 4 Jan 2001, Justin wrote:

> So dropping maxclients on the front end means you get clogged
> up with slow readers instead, so that isnt an option..

Try looking for Randall's posts in the last couple of weeks.  He has
some nice stuff you might want to have a play with.  Sorry, I can't
remember the thread but if you look in Geoff's DIGEST you'll find it.

Thanks again Geoff!

73,
Ged.

===
To: "'G.W. Haywood'" <ged@www.jubileegroup.co.uk>,
From: Geoffrey Young <gyoung@laserlink.net>
Subject: RE: the edge of chaos
Date: Thu, 4 Jan 2001 11:06:35 -0500 

> -----Original Message-----
> From: G.W. Haywood [mailto:ged@www.jubileegroup.co.uk]
> Sent: Thursday, January 04, 2001 10:35 AM
> To: Justin
> Cc: modperl@apache.org
> Subject: Re: the edge of chaos
> 
> 
> Hi there,
> 
> On Thu, 4 Jan 2001, Justin wrote:
> 
> > So dropping maxclients on the front end means you get clogged
> > up with slow readers instead, so that isnt an option..
> 
> Try looking for Randall's posts in the last couple of weeks.  He has
> some nice stuff you might want to have a play with.  Sorry, I can't
> remember the thread but if you look in Geoff's DIGEST you'll find it.

I think you mean this:
http://forum.swarthmore.edu/epigone/modperl/phoorimpjun

and this thread:
http://forum.swarthmore.edu/epigone/modperl/zhayflimthu

(which is actually a response to Justin :)

> 
> Thanks again Geoff!

glad to be of service :)

===
To: modperl@apache.org
From: Vivek Khera <khera@kciLink.com>
Subject: Re: the edge of chaos
Date: Thu, 4 Jan 2001 11:10:25 -0500

>>>>> "J" == Justin  <jb@dslreports.com> writes:

J> When things get slow on the back end, the front end can fill with
J> 120 *requests* .. all queued for the 20 available modperl slots..
J> hence long queues for service, results in nobody getting anything,

You simply don't have enough horsepower to serve your load, then.

Your options are: get more RAM, get faster CPU, make your application
smaller by sharing more code (pretty much whatever else is in the
tuning docs), or split your load across multiple machines.

If your front ends are doing nothing but buffering the pages for the
mod_perl backends, then you probably need to lower the ratio of
frontends to back ends from your 6 to 1 to something like 3 to 1.

===
To: Geoffrey Young <gyoung@laserlink.net>
From: Justin <jb@dslreports.com>
Subject: Re: the edge of chaos
Date: Thu, 4 Jan 2001 17:55:54 -0500

Hi,
Thanks for the links! But. I wasnt sure what in the first link
was useful for this problem, and, the vacuum bots discussion
is really a different topic.
I'm not talking of vacuum bot load. This is real world load.

Practical experiments (ok - the live site :) convinced me that 
the well recommended modperl setup of fe/be suffer from failure
and much wasted page production when load rises just a little
above *maximum sustainable throughput* ..

If you want to see what happens to actual output when this
happens, check this gif:
   http://www.dslreports.com/front/eth0-day.gif
From 11am to 4pm (in the jaggie middle secton delineated by
the red bars) I was madly doing sql server optimizations to
get my head above water.. just before 11am, response time
was sub-second. (That whole day represents about a million
pages). Minutes after 11am, response rose fast to 10-20 seconds
and few people would wait that long, they just hit stop..
(which doesnt provide my server any relief from their request).

By 4pm I'd got the SQL server able to cope with current load,
and everything was fine after that..

This is all moot if you never plan to get anywhere near max
throughput.. nevertheless.. as a business, if incoming load
does rise (hopefully because of press) I'd rather lose 20% of
visitors to a "sluggish" site, than lose 100% of visitors
because the site is all but dead..

I received a helpful recommendation to look into "lingerd" ...
that would seem one approach to solve this issue.. but a
lingerd setup is quite different from popular recommendations.

===
To: modperl@apache.org
From: Justin <jb@dslreports.com>
Subject: Re: the edge of chaos
Date: Thu, 4 Jan 2001 18:07:31 -0500

I need more horsepower. Yes I'd agree with that !

However... which web solution would you prefer:

A. (ideal)
load equals horsepower:
  all requests serviced in <=250ms
load slightly more than horsepower:
  linear falloff in response time, as a function of % overload

..or..

B. (modperl+front end)
load equals horsepower:
  all requests serviced in <=250ms
sustained load *slightly* more than horsepower
  site too slow to be usable by anyone, few seeing pages

Don't all benchmarks (of disk, webservers, and so on),
always continue increasing load well past optimal levels,
to check there are no nasty surprises out there.. ?

===

To: Justin <jb@dslreports.com>, modperl@apache.org
From: ___cliff rayman___ <cliff@genwax.com>
Subject: Re: the edge of chaos
Date: Thu, 04 Jan 2001 15:28:22 -0800

i see 2 things here, classic queing problem, and the fact
that swapping to disk is 1000's of times slower than serving
from ram.

if you receive 100 requests per second but only have the
ram to serve 99, then swapping to disc occurs which slows
down the entire system.  the next second comes and 100 new
requests come in, plus the 1 you had in the queue that did not
get serviced in the previous second. after a little while,
your memory requirements start to soar, lots of swapping is
occuring, and requests are coming in at a higher rate than can
be serviced by an ever slowing machine.  this leads to a rapid
downward spiral.  you must have enough ram to service all the apache
processes that are allowed to run at one time.  its been my experience
that once swapping starts to occur, the whole thing is going to spiral
downward very quickly.  you either need to add more ram, to service
that amount of apache processes that need to be running simultaneously,
or you need to reduce MaxClients and let apache turn away requests.

P.S. used your service several times with good results! (and
no waiting) thanks!

===

To: Justin <jb@dslreports.com>
From: Perrin Harkins <perrin@primenet.com>
Subject: Re: the edge of chaos
Date: Thu, 04 Jan 2001 15:38:21 -0800

Justin wrote:
> Thanks for the links! But. I wasnt sure what in the first link
> was useful for this problem, and, the vacuum bots discussion
> is really a different topic.
> I'm not talking of vacuum bot load. This is real world load.
> 
> Practical experiments (ok - the live site :) convinced me that
> the well recommended modperl setup of fe/be suffer from failure
> and much wasted page production when load rises just a little
> above *maximum sustainable throughput* ..

The fact that mod_proxy doesn't disconnect from the backend server when
the client goes away is definitely a problem.  I remember some
discussion about this before but I don't think there was a solution for
it.

I think Vivek was correct in pointing out that your ultimate problem is
the fact that your system is not big enough for the load you're
getting.  If you can't upgrade your system to safely handle the load,
one approach is to send some people away when the server gets too busy
and provide decent service to the ones you do allow through.  You can
try lowering MaxClients on the proxy to help with this.  Then any
requests going over that limit will get queued by the OS and you'll
never see them if the person on the other end gets tired of waiting and
cancels.  It's tricky though, because you don't want a bunch of slow
clients to tie up all of your proxy processes.

It's easy to adapt the existing mod_perl throttling handlers to send a
short static "too busy" page when there are more than a certain number
of concurrent requests on the site.  Better to do this on the proxy side
though, so maybe mod_throttle could do it for you.

- Perrin
===
To: <perrin@primenet.com>, "Justin" <jb@dslreports.com>
From: "Ed Park" <epark@athenahealth.com>
Subject: RE: the edge of chaos
Date: Thu, 4 Jan 2001 19:48:43 -0500

A few thoughts:

In analyzing a few spikes on our site in the last few days, a clear pattern
has emerged: the database spikes, and the database spikes induce a
corresponding spike on the mod_perl server about 2-6 minutes later(because
mod_perl requests start queuing up). This is exacerbated by the fact that as
the site slows down, folks start double and triple-clicking on links and
buttons, which of course just causes things to get much worse.

This has a few ramifications. If your pages are not homogeneous in database
usage (i.e., some pages are much heavier than others), then throttling by
number of connections or throttling based on webserver load doesn't help
that much. You need to throttle based on database server load. This requires
some sort of mechanism whereby the webserver can sample the load on the
database server and throttle accordingly. Currently, we just mount a common
NFS fileserver, sample every minute, and restart the webserver if db load is
too high, which works OK.

The best course of action, though, is to tune your database, homogenize your
pages, and buy a bigger box, which we're doing.

-Ed

===
To: Ed Park <epark@athenahealth.com>
From: Perrin Harkins <perrin@primenet.com>
Subject: Re: the edge of chaos
Date: Thu, 04 Jan 2001 19:08:33 -0800

Ed Park wrote:
> If your pages are not homogeneous in database
> usage (i.e., some pages are much heavier than others), then throttling by
> number of connections or throttling based on webserver load doesn't help
> that much. You need to throttle based on database server load. This requires
> some sort of mechanism whereby the webserver can sample the load on the
> database server and throttle accordingly. Currently, we just mount a common
> NFS fileserver, sample every minute, and restart the webserver if db load is
> too high, which works OK.

You could also just use an access handler that turns new requests away
with a simple "too busy" page when the database is hosed, rather than
actually restarting the server.

- Perrin
===
To: "Justin" <jb@dslreports.com>, "Geoffrey Young"
<gyoung@laserlink.net>
From: "Les Mikesell" <lesmikesell@home.com>
Subject: Re: the edge of chaos
Date: Thu, 4 Jan 2001 23:37:57 -0600

===
To: modperl@apache.org----- Original Message ----- 
From: siberian <siberian@siberian.org>
Subject: Re: the edge of chaos
Date: Thu, 4 Jan 2001 21:41:14 -0800 (PST)

On Thu, 4 Jan 2001, Les Mikesell wrote:

> 
> ----- Original Message ----- 
> From: "Justin" <jb@dslreports.com>
> To: "Geoffrey Young" <gyoung@laserlink.net>
> Cc: <modperl@apache.org>
> Sent: Thursday, January 04, 2001 4:55 PM
> Subject: Re: the edge of chaos
> 
> 
> > 
> > Practical experiments (ok - the live site :) convinced me that 
> > the well recommended modperl setup of fe/be suffer from failure
> > and much wasted page production when load rises just a little
> > above *maximum sustainable throughput* ..
> 
> It doesn't take much math to realize that if you continue to try to
> accept connections faster than you can service them, the machine
> is going to die, and as soon as you load the machine to the point
> that you are swapping/paging memory to disk the time to service
> a request will skyrocket.   Tune down MaxClients on both the
> front and back end httpd's to what the machine can actually
> handle and bump up the listen queue if you want to try to let
> the requests connect and wait for a process to handle them.  If
> you aren't happy with the speed the machine can realistically
> produce, get another one (or more) and let the front end proxy
> to the other(s) running the backends.
> 


On this thread here is a question.

Given a scenario with two machines serving web requests.

Is it better to have 1 machine doing proxy and 1 machine doing mod perl or
is it generally better to have each machine running a proxy and a mod
perl?

Lame question, I've never benchmarked differences, just curious what some
of you think. 

John


From: "Justin" <jb@dslreports.com>
To: "Geoffrey Young" <gyoung@laserlink.net>
Cc: <modperl@apache.org>
Sent: Thursday, January 04, 2001 4:55 PM
Subject: Re: the edge of chaos


> 
> Practical experiments (ok - the live site :) convinced me that 
> the well recommended modperl setup of fe/be suffer from failure
> and much wasted page production when load rises just a little
> above *maximum sustainable throughput* ..

It doesn't take much math to realize that if you continue to try to
accept connections faster than you can service them, the machine
is going to die, and as soon as you load the machine to the point
that you are swapping/paging memory to disk the time to service
a request will skyrocket.   Tune down MaxClients on both the
front and back end httpd's to what the machine can actually
handle and bump up the listen queue if you want to try to let
the requests connect and wait for a process to handle them.  If
you aren't happy with the speed the machine can realistically
produce, get another one (or more) and let the front end proxy
to the other(s) running the backends.

     Les Mikesell
         lesmikesell@home.com



===
To: "siberian" <siberian@siberian.org>, <modperl@apache.org>
From: "Les Mikesell" <lesmikesell@home.com>
Subject: Re: the edge of chaos
Date: Fri, 5 Jan 2001 08:07:47 -0600

"siberian" <siberian@siberian.org> wrote:

> On this thread here is a question.
> 
> Given a scenario with two machines serving web requests.
> 
> Is it better to have 1 machine doing proxy and 1 machine doing mod perl or
> is it generally better to have each machine running a proxy and a mod
> perl?
> 
> Lame question, I've never benchmarked differences, just curious what some
> of you think. 

I haven't done any real testing either, although I now have a site spread
over 4 different machines.  My approach was to start by moving
the slowest mod_perl programs to a different backend-only box,
then moving more to somewhat balance the load,  then moving
the sql server off to a different box.  The 4th box is actually the
front end for several smaller sites plus the backend for a few
jobs from the main site.   The machines all have copies of all
the software so a slight reconfiguration could allow running
without any one of them at slightly less capacity.  Right now
everything is controlled by mod_rewrite on the front-end box
but the next update is going to be to put a hardware load balancer
in front of them.   Without hard numbers, I would guess that
running the front/back on the same box is slightly faster at
lower loads, but completely splitting front/back ends would
handle a slightly higher load before melting down due to
a little better memory sharing and disk cache handling where
each box only runs one version.

      Les Mikesell
        lesmikesell@home.com


===
To: modperl@apache.org
From: Vivek Khera <khera@kciLink.com>
Subject: Re: the edge of chaos
Date: Fri, 5 Jan 2001 13:56:47 -0500

>>>>> "J" == Justin  <jb@dslreports.com> writes:

J> I received a helpful recommendation to look into "lingerd" ...
J> that would seem one approach to solve this issue.. but a
J> lingerd setup is quite different from popular recommendations.

I think that's mostly because lingerd is so new.  I'm sure as people
experiment with it we will see it incorporated into the docs and
recommended setups if it holds up.
===
To: Rick Myers <rik@sumthin.nu>
From: Justin <jb@dslreports.com>
Subject: Re: the edge of chaos (URL correction)
Date: Fri, 5 Jan 2001 18:55:16 -0500

On Thu, Jan 04, 2001 at 06:10:09PM -0500, Rick Myers wrote:
> On Jan 04, 2001 at 17:55:54 -0500, Justin twiddled the keys to say:
> > 
> > If you want to see what happens to actual output when this
> > happens, check this gif:
> >    http://www.dslreports.com/front/eth0-day.gif
> 
> You sure about this URL? I get a 404...

My bad. it is
  www.dslreports.com/front/example.gif
Sorry for those curious enough to check the URL out.

===
the rest of The Pile (a partial mailing list archive)
doom@kzsu.stanford.edu