This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.
To: modperl@apache.org From: Justin <jb@dslreports.com> Subject: experience on modperl-killing "vacuum bots" Date: Wed, 20 Dec 2000 23:56:51 -0500 Hi again, Tracing down periods of unusual modperl overload I've found it is usually caused by someone using an agressive site mirror tool of some kind. The Stonehenge Throttle (a lifesaver) module was useful to catch the really evil ones that masquerade as a real browser .. although the version I grabbed did need to be tweaked as when you get really hit hard, the determination that yes, it is that spider again, involved a long read loop of a rapidly growing fingerprint of doom.. to the point where the determination that it was the same evil spider was taking quite a long time per hit! (some real nasty ones can hit you with 1000s of requests per minute!) Also - sleeping to delay the reader as it reached the soft limit was also bad news for modperl. So I changed it to be more brutal about number of requests per time frame, and bytes read per time frame, and also black-list the md5 of the IP/useragent combination for longer when that does happen. Matching on IP/useragent combo is necessary rather than just IP to avoid blocking big proxy on one IP which are in use in some large companies and some telco ISPs. In filtering error_logs over time, I've assembled a list of nastys that have triggered the throttle repeatedly. The trouble is, the throttle can take some time to wake up which can still floor your server for very short periods.. So I also simply outright ban these user agents: (EmailSiphon)|(LinkWalker)|(WebCapture)|(w3mir)| (WebZIP)|(Teleport Pro)|(PortalBSpider)|(Extractor)| (Offline Explorer)|(WebCopier)|(NetAttache)|(iSiloWeb)| (eCatch)|(ecila)|(WebStripper)|(Oxxbot)|(MuscatFerret)| (AVSearch)|(MSIECrawler)|(SuperBot 2.4) Nasty little collection huh.. MSIECrawler is particularly annoying. I think that is when somebody uses one of the bill gates IE5 "ideas": save for offline view, or something. Anyway.. hope this is helpful next time your modperl server gets so busy you have to wait 10 seconds just to get a server-status URL to return. This also made me think that perhaps it would be nice to design a setup that reserved 1 or 2 modperl processes for serving (say) the home page .. that way, when the site gets jammed up at least new visitors get a reasonably fast home page to look at (perhaps including an alert warning against slow response lower down..).. that is better than them coming in from a news article or search engine, and getting no response at all. It would also be nice for mod_proxy to have a better way of controlling timeout on fetching from the backend, and the page to show in case timeout occurs.. has anyone done something here? then after 10 seconds (say) mod_proxy can show a pretty page explaining that due to the awesome success of your product/service, the website is busy and please try again very soon :-) [we should be so lucky]. At the moment what happens under load is mod_proxy seems to queue the request up (via the tcp listen queue) .. the user might give up and press stop or reload (mod_proxy does not seem to know this) and thus queue up another request via another front end, and pretty soon there is a 10 second page backlog for everyone and loads of useless requests to start to fill .. -Justin === To: Justin <jb@dslreports.com> From: merlyn@stonehenge.com (Randal L. Schwartz) Subject: Re: experience on modperl-killing "vacuum bots" Date: 20 Dec 2000 23:25:03 -0800 >>>>> "Justin" == Justin <jb@dslreports.com> writes: Justin> So I also simply outright ban these user agents: Justin> (EmailSiphon)|(LinkWalker)|(WebCapture)|(w3mir)| Justin> (WebZIP)|(Teleport Pro)|(PortalBSpider)|(Extractor)| Justin> (Offline Explorer)|(WebCopier)|(NetAttache)|(iSiloWeb)| Justin> (eCatch)|(ecila)|(WebStripper)|(Oxxbot)|(MuscatFerret)| Justin> (AVSearch)|(MSIECrawler)|(SuperBot 2.4) Here's my list, after running Stonehenge::Throttle for probably longer than you have... :) or m{Offline Explorer/} # bad robot! or m{www\.gozilla\.com} # bad robot! or m{pavuk-} # bad robot! or m{ExtractorPro} # bad robot! or m{WebCopier} # bad robot! or m{MSIECrawler} # bad robot! or m{WebZIP} # bad robot! or m{Teleport Pro} # bad robot! or m{NetAttache/} # bad robot! or m{gazz/} # bad robot! or m{geckobot} # bad robot! or m{nttdirectory} # bad robot! or m{Mister PiX} # bad robot! or m{ia_archiver} # bad robot! or m{DIIbot/} # bad robot! or m{WhizBang!} # bad robot! or m{WebCopy/} # bad robot! or m{WebStripper/} # bad robot! or m{EmailSiphon} # bad robot! or m{AlkalineBOT} # bad robot! (in Perl!) That last one is nasty. A Perl bot that basically sucks full speed. Evil. ===