Spotting Spam Bots

On AMAZON AWS Locate the machine that's causing your performance issues and try to look at it's monitoring graphs. You should probably spot traffic and server load peaks, some might be running for extended periods of time.

OK, now take a look at the times when load peaks occur and try to remember one of them. Since we know that server load bursting is caused by too high traffic, lets take a look at apache log files. Normally log files can be quite big - up to 1GB in size, so searching in them can be quite tedious. Luckily for our discovery we can use Grep Linux/Unix command line tool for searching strings in files. OK, yeah but how to find a Spamer? Remember the load peak times you looked at earlier. This can be our hint!
Apache web server uses Common log format in their logging files. One of the things that log format supports is timestamp - hence every request keeps a track of daytime when it occurred.

So we are ready to narrow down our search.

cd /var/log/apache2/ cat |  grep "01/Dec/2014:14"

The above just means that we are doing an on screen print of log file contents, and using a pattern search on the output. In the above example we are trying to look for all log events that happened one the 1st of December 2014 over the times from 2:01 - 2:59 in the afternoon.

Now take a look on the data what you've seen on the screen. You could get many lines in output - all depends on the amount of traffic your site is getting.

Even though Spam robots try to hide themselves as much as possible, many of them still identify themselves. Normally their identity comes from User Agent string which is also getting logged.

What is user agent

Basically in usual case - as we are on the web, this should be browser from where we are accessing our site. In that case user agent string would look like "Mozilla/5.0 (Linux; U; Android 4.1.2; en-gb; GT-I8190N Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30", which Looks like Mozilla browser on Android phone.

Now try to spot something which does not look like a normal web browser. In one case it was AhrefsBot which was causing massive disruptions on our site.

According to their website you could block them via an entry in robots.txt file, however a little more googling resulted me over this resourceAhrefsBot - SEO Spybots, which suggested blocking this particular Spamer on an ip level

cat | grep -c "AhrefsBot"

In code above I am using -c argument which instead of printing out all textual matches just gives their count.