Today I worked on a small Perl project: collecting statistics of search engine spiders from an access_log. First I looked for a list of IP addresses of search engine spiders (or bots). I found a site with for each search engine a long list of IP addresses.
I downloaded each list of search engine spider IP addresses and saved them all into the same directory as text files. The Perl program reads and processes all text files in this directory. Each file contains zero or more quads (4 numbers separated by dots) and zero or more triples (partial address, 3 numbers separated by dots). I stored all quads in one hash as keys, and all triples in another as keys, and used the name of the file without the extension as the value associated with the key, e.g. google for google.txt.
Next, the Perl program reads an access_log line by line and extracts the quad from the line (all the first non-space characters of an access_log line), and the user agent. It checks if the quad is in the quad hash, and if it is, then it fetches the value, does some bookkeeping, and moves on to the next line in the log. If the quad is not in the quad hash, the last dot and number are dropped from the quad, turning it into a triple, and the triple hash is checked. If it's in the hash, the value is fetched, some bookkeeping done, and the program moves to the next line in the log. And if the triple is not found, the program also moves to the next line in the access_log.
When the the end of the access_log is reached, the file is closed and the program calculates percentages and prints the results of those calculations and the counts obtained in the bookkeeping step.
I fed the program a part of the October 2006 access_log for this site, and here are some results:
I have no idea why Yahoo! considers it necessary to cause 2.33% of all hits on my site, nor why Yahoo! Slurp is almost 10 times more active compared to MSNbot, and more then 10 times more active compared to Googlebot. It looks like that for the big 3 less is indeed better.
Anyway, over 3% of hits this site gets are caused by Search Engine spiders. It wouldn't amaze me if all malware hitting this site (email harvesting bots, comment spam bots, script kiddie tools) added to this 3% will result in a total of 10%.