PDA

View Full Version : Increase the accuracy of api stats and decrease your server load: block bad bots



prosperent brian
05-13-2011, 08:51 AM
The majority of the requests sent to the api are just junk bots and scrapers trying to steal your content. Blocking these bad bots has several advantages. It increases the accuracy of your api reports because you aren't sending us as many junk requests, but it also helps lower your server load so you can handle more legitimate traffic. This ultimately means you can build more sites on each server and earn more money. That is the end goal afterall ;).

There are a few methods to handle the bad bot's, so we will go over a couple here. First is blocking them by useragent in the .htaccess file. The .htaccess file is a file that your web server checks before sending out a requested page. in this case, we look at the requesting client's useragent which is used to identify the web browser they are using or the name of the bot. By blocking bots in the .htaccess, we prevent the web server from having to send out data to junk bots. This isn't as effective as some of the other methods we will talk about, but it is dead simple to implement. All you have to do is edit the .htaccess file in the root directory of your api web site and add the following lines:


RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]

Save the file, then restart apache. Either "service httpd restart" from shell (without the quotes), or via your control panel software. Again, this is the least effective method, but it is better than nothing. Next we will talk about setting up an actual bot trap.

prosperent brian
05-13-2011, 08:57 AM
For the next step, building an actual bot trap we have a good tutorial over at : http://danielwebb.us/software/bot-trap/

This covers setting up a fake directory on your server that is disallowed by your robots.txt file. When a bad bot comes in, it will often ignore your robots.txt file and request the directory anyway. When it does it will get added to a bad bot list which you can then look over and block via your .htaccess file or firewall.

prosperent brian
05-13-2011, 09:07 AM
The final method is by far the best. it employs a bot trap like the second method, but it goes further by blocking the bots at the firewall level. Blocking a bot at the firewall means they never even make it into your server to make a request to the web server. This cuts the bandwidth down, automates the entire process, and will have the most dramatic effect on server load since no pages are ever even requested. I HIGHLY suggest using this third option if you can. Here are a couple step by step tutorials: http://www.rubyrobot.org/article/protect-your-web-server-from-spambots

My last suggestion, and one you should look at using no matter which option you chose si to install the apache module mod_security. It monitors requests in realtime and blocks bad requests as they come in. It is fully configurable and super easy to install. Here is a tutorial for using it with centos for example: http://www.cyberciti.biz/faq/rhel-fedora-centos-httpd-mod_security-configuration/

AcidRaZor
05-13-2011, 09:39 AM
Man, when you drop knowledge-bombs you obliterate! Thanks for the info!

AcidRaZor
05-13-2011, 09:56 AM
My last suggestion, and one you should look at using no matter which option you chose si to install the apache module mod_security. It monitors requests in realtime and blocks bad requests as they come in. It is fully configurable and super easy to install. Here is a tutorial for using it with centos for example: http://www.cyberciti.biz/faq/rhel-fedora-centos-httpd-mod_security-configuration/

Is the default .conf file for mod_security fine to run or is there anything you suggest we turn on/tweak?

lhw455
05-13-2011, 04:13 PM
If we're using Zaphod's MFPMu script, should be just add in to the bottom of the .htaccess?

monalisa
05-14-2011, 03:29 AM
This has helped a lot to reduce junk traffic and server loads across all my servers.

toykilla
05-14-2011, 05:46 AM
Do you really have to restart apache when you edit a .htaccess file ?

monalisa
05-14-2011, 06:43 AM
Do you really have to restart apache when you edit a .htaccess file ?

If the rules are in the apache conf files then YES, otherwise if they are in .htaccess files NO.

garydubbs
05-14-2011, 03:06 PM
brian will you please stop adding to my to do list :)

prosperent brian
05-14-2011, 09:02 PM
Haha, on it :).

For the others that asked, i'll go into more detail on the config options for mod security and such on Monday :)

AcidRaZor
05-15-2011, 02:49 AM
Thanks Brian!

Hoops
05-15-2011, 08:50 AM
Just thought I would add another option for cleaning up your traffic. In the past I have blocked whole IP address ranges on an ebay related site and I saw my stats and clicks return to normal levels.

Here's a good starting point to find the IP addresses that you may want to block: http://www.wizcrafts.net/chinese-blocklist.html

Of course, you may need to add exceptions if you do business with people in the countries listed.

prosperent brian
05-15-2011, 09:19 AM
If you are blocking countries or ip ranges I would do it at the firewall level. That way, they aren't even hitting your web server and consuming resources :)

AcidRaZor
05-15-2011, 09:28 AM
If you are blocking countries or ip ranges I would do it at the firewall level. That way, they aren't even hitting your web server and consuming resources :)

Won't adding too many IP's slow down the server? I keep reading that

Hoops
05-15-2011, 09:36 AM
If you are blocking countries or ip ranges I would do it at the firewall level. That way, they aren't even hitting your web server and consuming resources :)

That makes a lot of sense. I guess I'll need to learn how to make changes to my vps firewall - assuming I have one... :confused: