modperl_beware_bad_robots

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.



To: modperl@apache.org
From: Justin <jb@dslreports.com>
Subject: experience on modperl-killing "vacuum bots"
Date: Wed, 20 Dec 2000 23:56:51 -0500

Hi again,

Tracing down periods of unusual modperl overload I've
found it is usually caused by someone using an agressive
site mirror tool of some kind.

The Stonehenge Throttle (a lifesaver) module was useful
to catch the really evil ones that masquerade as a real
browser ..  although the version I grabbed did need to be
tweaked as when you get really hit hard, the determination
that yes, it is that spider again, involved a long read loop
of a rapidly growing fingerprint of doom.. to the point
where the determination that it was the same evil spider
was taking quite a long time per hit! (some real nasty ones
can hit you with 1000s of requests per minute!)

Also - sleeping to delay the reader as it reached the
soft limit was also bad news for modperl.

So I changed it to be more brutal about number of requests
per time frame, and bytes read per time frame, and also
black-list the md5 of the IP/useragent combination for
longer when that does happen. Matching on IP/useragent
combo is necessary rather than just IP to avoid blocking
big proxy on one IP which are in use in some large companies
and some telco ISPs.

In filtering error_logs over time, I've assembled a list
of nastys that have triggered the throttle repeatedly.

The trouble is, the throttle can take some time to 
wake up which can still floor your server for very
short periods..
So I also simply outright ban these user agents:

(EmailSiphon)|(LinkWalker)|(WebCapture)|(w3mir)|
(WebZIP)|(Teleport Pro)|(PortalBSpider)|(Extractor)|
(Offline Explorer)|(WebCopier)|(NetAttache)|(iSiloWeb)|
(eCatch)|(ecila)|(WebStripper)|(Oxxbot)|(MuscatFerret)|
(AVSearch)|(MSIECrawler)|(SuperBot 2.4)

Nasty little collection huh..

MSIECrawler is particularly annoying. I think that is
when somebody uses one of the bill gates IE5 "ideas":
save for offline view, or something.

Anyway.. hope this is helpful next time your modperl
server gets so busy you have to wait 10 seconds just to
get a server-status URL to return.

This also made me think that perhaps it would be nice
to design a setup that reserved 1 or 2 modperl processes
for serving (say) the home page .. that way, when the site
gets jammed up at least new visitors get a reasonably 
fast home page to look at (perhaps including an alert
warning against slow response lower down..).. that is
better than them coming in from a news article or search
engine, and getting no response at all.

It would also be nice for mod_proxy to have a better
way of controlling timeout on fetching from the backend,
and the page to show in case timeout occurs.. has anyone
done something here? then after 10 seconds (say) mod_proxy
can show a pretty page explaining that due to the awesome
success of your product/service, the website is busy and
please try again very soon :-) [we should be so lucky].
At the moment what happens under load is mod_proxy seems
to queue the request up (via the tcp listen queue) .. the
user might give up and press stop or reload (mod_proxy does
not seem to know this) and thus queue up another request via
another front end, and pretty soon there is a 10 second
page backlog for everyone and loads of useless requests to
start to fill ..

-Justin

===

To: Justin <jb@dslreports.com>
From: merlyn@stonehenge.com (Randal L. Schwartz)
Subject: Re: experience on modperl-killing "vacuum bots"
Date: 20 Dec 2000 23:25:03 -0800

>>>>> "Justin" == Justin  <jb@dslreports.com> writes:

Justin> So I also simply outright ban these user agents:

Justin> (EmailSiphon)|(LinkWalker)|(WebCapture)|(w3mir)|
Justin> (WebZIP)|(Teleport Pro)|(PortalBSpider)|(Extractor)|
Justin> (Offline Explorer)|(WebCopier)|(NetAttache)|(iSiloWeb)|
Justin> (eCatch)|(ecila)|(WebStripper)|(Oxxbot)|(MuscatFerret)|
Justin> (AVSearch)|(MSIECrawler)|(SuperBot 2.4)

Here's my list, after running Stonehenge::Throttle for probably
longer than you have... :)

	    or m{Offline Explorer/} # bad robot!
	    or m{www\.gozilla\.com} # bad robot!
	    or m{pavuk-}	# bad robot!
	    or m{ExtractorPro}	# bad robot!
	    or m{WebCopier}	# bad robot!
	    or m{MSIECrawler}	# bad robot!
	    or m{WebZIP}	# bad robot!
	    or m{Teleport Pro}	# bad robot!
	    or m{NetAttache/}	# bad robot!
	    or m{gazz/}		# bad robot!
	    or m{geckobot}	# bad robot!
	    or m{nttdirectory}	# bad robot!
	    or m{Mister PiX}	# bad robot!
	    or m{ia_archiver}	# bad robot!
	    or m{DIIbot/}	# bad robot!
	    or m{WhizBang!}	# bad robot!
	    or m{WebCopy/}	# bad robot!
	    or m{WebStripper/}	# bad robot!
	    or m{EmailSiphon}	# bad robot!
	    or m{AlkalineBOT}	# bad robot! (in Perl!)

That last one is nasty.  A Perl bot that basically sucks full speed.
Evil.

===

the rest of The Pile (a partial mailing list archive)

doom@kzsu.stanford.edu