Playing in Googlebot's Sandbox with Slurp, Teoma, & MSNbot - Spiders Display Differing Personalities

There has been endless webmaster speculation andwide-rangingchanges to site, we miss this for a week.
worry aboutthe so-called "Google Sandbox" - theSpider actionis spotty for 10 days until we fix robots.txt
indexing time delay fornew domain names - rumored* June 4 - Teoma returns and crawls 590 pages! No
to last for at least 45 days fromthe date of firstothers.
"discovery" by Googlebot. This recognizedlisting delay* June 5 - Teoma returns and crawls 1902 pages! No
came to be called the "Google Sandbox effect."others.
Ruminations on the algorithmic elements of this* June 6 - Teoma returns and crawls 290 pages. No
sandbox timedelay have ranged widely since theothers.
indexing delay was firstnoticed in spring of 2004. Some* June 7 - Teoma returns and crawls 471 pages. No
believe it to be an issue ofone single element of goodothers.
search engine optimization suchas linking campaigns.* June 8-14 Odd spider behavior, looking at robots.txt
Link building has been the focus ofmost discussion, butonly.
others have focused on the possibilityof size of a new* June 15 - Slurp gets thirsty, gulps 1396 pages! No
site or internal linking structure or justspecific timeothers.
delays as most relevant algorithmic elements.* June 16 - Slurp still thirsty, gulps 1379 pages! No
Rather than contribute to this speculation andothers.
furthermuddy the Sandbox, we'll be looking at a caseSo we'll take a break here at the 5 weeks point and
study of asite on a new domain name, establishedtake noteof the very different behavior of the top
May 11, 2005 and thespecific site structure, submissionscrawlers. Googlebotvisits once and looks at a
activity, external andinternal linking. We'll see how thissubstantial number of pages butdoesn't return for over
plays out in searchengine spider activity vs. indexinga month. Slurp finds bad links andseems addicted to
dates at the top foursearch engines.them as it stops crawling good pages untilit is told to
Ready? We'll give dates and crawler action in dailylay off the bad liquor, er that is links bygetting robots.txt
lists andsee how this all plays out on this single newto slap slurp to its senses. MSNbot visitslooking for that
site over time.robots.txt and won't crawl any pages untiltold what
* May 11, 2005 Basic text on large site posted onNOT to do by the robots.txt file. Teoma just crawlslike
newlypurchased domain name and going live by dayscrazy, takes breaks, then comes back for more.
end. Searchfriendly structure implemented with textThis behavior may imitate the differing personalities of
linking makingfull discovery of all content possible bythesoftware engineers who designed them. Teoma is
robots. Homepage updated with 10 new text contenttenacious andhard working. MSNbot is timid and needs
pages added daily.instruction and somereassurance it is doing the right
Submitted site at Google's "Add URL" submissionthing, picks up pages slowlyand carefully. Slurp has
page.addictive personality and performserratically on a
* May 12 - 14 - No visits by Slurp, MSNbot, Teoma orrandom schedule. Googlebot takes a good longlook
Google.and leaves. Who knows whether it will be back and
(Slurp is Yahoo's spider and Teoma is from Askwhen.
Jeeves)Now let's look at indexing by each engine. As of this
Posted link on WebSite101 to new domain at * May 15writingon July 7, each engine also shows differing
- Googlebot arrives and eagerly crawls 245 pagesonindexing behavioras well. Google shows no pages
new domain after looking for, but not findingindexed although it crawled
therobots.txt file. Oooops! Gotta add that robots.txt file!250 pages nearly two months ago. Yahoo has three
* May 16 - Googlebot returns for 5 more pages andpages indexedin a clear aging routine that doesn't list
stops.any of the nearly
Slurp greedily gobbles 1480 pages and 1892 bad links!8,000 pages it has crawled to date (not all itemized
Those bad links were caused by our email maskingabove.)
meantto keep out bad bots. How ironic slurp likesMSN has 187 pages indexed while crawling fewer
these.pages thanany of the others. Ask Jeeves has crawled
* May 17 - Slurp finds 1409 more masking links &more pages to datethan any search engine, yet has
only 209new content pages. MSNbot visits for the firstnot indexed a single page.
time andasks for robots.txt 75 times during the day,Each of the engines will show the number of pages
but leaveswhen it finds that file missing! Finally getindexed ifyou use the query operator
around toadd robots.txt by days end & stop slurp"site:publish101.com" without thequotes. MSN 187 pages,
crawling emailmasking links and let MSNbot know it'sAsk none, Yahoo 3 pages, Google none.
safe to come in!The daily activity not listed in the three weeks since
* May 23 - Teoma spider shows up for the first timeJune 16above has not varied dramatically, with Teoma
andcrawls 93 pages. Site gets slammed bycrawling a bitmore than other engines, Slurp erratically
BecomeBot, a spiderthat hits a page every 5 to 7up and down and
seconds and strains ourresources with 2409 rapid fireMSN slowly gathering 30 to 50 pages daily. Google is
requests for pages. Addedabsent.
BecomeBot to robots.txt exclusion list to keep 'em out.Linking campaign has been minimal with posts to
* May 24 - MSNbot has stopped showing up for adiscussionlists, a couple of articles and some blog
week sincefinding the robots.txt file missing. Slurp isactivity. Lookingback over this time it is apparent that a
showing upevery few hours looking at robots.txt andlisting delay isactually quite sensible from the view of
leaving againwithout crawling anything now that it isthe search engines.
excluded fromthe email masking links. BecomeBotOur site restructuring and bobbled robots.txt
appears to be honoringthe robots.txt exclusion butimplementationseems to have abruptly stalled crawling
asks for that file 109 timesduring the day. Teomabut the indexingbehavior of each engine displays
crawls 139 more pages.distinctly differing policyby each major player.
* May 25 - We realize that we need to re-allocateThe sandbox is apparently not just Google's
serverresources and database design and thisplayground, butit is certainly tiresome after nearly two
requires changesto URL's, which means all previouslymonths. I think I'dlike to leave for home, have some
crawled pages arenow bad links! Implementlunch and take a nap now.
subdomains and wonder what now?Back to class before we leave for the day kiddies.
Slurp shows up and finds thousands of new emailWhat didwe learn today? Watch early crawler activity
maskinglinks as the robots.txt was not moved to newand be certainto implement robots.txt early and adjust
directorystructures. Spiders are getting errors pagesoften for bad bots.
upon newvisits. Scampering to put out fires afterOh yes, and the sandbox belongs to all search engines.