Search engines send out what are called
spiders, crawlers or robots to visit the site
and gather web pages. These robots leave traces
behind in the access logs.
How to identify a spider ?
Those from the major search engines can sometimes be identified from their host names. These often incorporate part of the search engine's name or the company's name. For example, one of WebCrawler's host names is spidey.webcrawler.com.
A better way of spotting spiders is to look
for their agent names, or what some people
call browser names. Spiders have their own
names, just like browsers.
For example, Netscape identifies
itself by saying Mozilla. Alta Vista's spider
says Scooter, while HotBot's spider is named
Slurp.
Some resources for getting a list of host
and agent names for the major search engines
is below. However, it's useful to know how
to spot any robot, because names can change,
or new robots can appear. The principles of
spotting spiders still remains the same, however.
The Best Clue: robots.txt
This is a file that tells robots what they may and may not index within a site. Not all spiders follow the robots.txt convention, but most do. Anything requesting this file is almost certainly a spider, robot or an agent.
By reviewing the requests, we can usually spot spiders from the major search engines by their host names, which in turn tells us the latest agent names. This is surprising to note that how many smaller search engines, personal agents and other robots are also accessing the site.