| There has been endless webmaster speculation and | | | | wide-rangingchanges to site, we miss this for a week. |
| worry aboutthe so-called "Google Sandbox" - the | | | | Spider actionis spotty for 10 days until we fix robots.txt |
| indexing time delay fornew domain names - rumored | | | | * June 4 - Teoma returns and crawls 590 pages! No |
| to last for at least 45 days fromthe date of first | | | | others. |
| "discovery" by Googlebot. This recognizedlisting delay | | | | * June 5 - Teoma returns and crawls 1902 pages! No |
| came to be called the "Google Sandbox effect." | | | | others. |
| Ruminations on the algorithmic elements of this | | | | * June 6 - Teoma returns and crawls 290 pages. No |
| sandbox timedelay have ranged widely since the | | | | others. |
| indexing delay was firstnoticed in spring of 2004. Some | | | | * June 7 - Teoma returns and crawls 471 pages. No |
| believe it to be an issue ofone single element of good | | | | others. |
| search engine optimization suchas linking campaigns. | | | | * June 8-14 Odd spider behavior, looking at robots.txt |
| Link building has been the focus ofmost discussion, but | | | | only. |
| others have focused on the possibilityof size of a new | | | | * June 15 - Slurp gets thirsty, gulps 1396 pages! No |
| site or internal linking structure or justspecific time | | | | others. |
| delays as most relevant algorithmic elements. | | | | * June 16 - Slurp still thirsty, gulps 1379 pages! No |
| Rather than contribute to this speculation and | | | | others. |
| furthermuddy the Sandbox, we'll be looking at a case | | | | So we'll take a break here at the 5 weeks point and |
| study of asite on a new domain name, established | | | | take noteof the very different behavior of the top |
| May 11, 2005 and thespecific site structure, submissions | | | | crawlers. Googlebotvisits once and looks at a |
| activity, external andinternal linking. We'll see how this | | | | substantial number of pages butdoesn't return for over |
| plays out in searchengine spider activity vs. indexing | | | | a month. Slurp finds bad links andseems addicted to |
| dates at the top foursearch engines. | | | | them as it stops crawling good pages untilit is told to |
| Ready? We'll give dates and crawler action in daily | | | | lay off the bad liquor, er that is links bygetting robots.txt |
| lists andsee how this all plays out on this single new | | | | to slap slurp to its senses. MSNbot visitslooking for that |
| site over time. | | | | robots.txt and won't crawl any pages untiltold what |
| * May 11, 2005 Basic text on large site posted on | | | | NOT to do by the robots.txt file. Teoma just crawlslike |
| newlypurchased domain name and going live by days | | | | crazy, takes breaks, then comes back for more. |
| end. Searchfriendly structure implemented with text | | | | This behavior may imitate the differing personalities of |
| linking makingfull discovery of all content possible by | | | | thesoftware engineers who designed them. Teoma is |
| robots. Homepage updated with 10 new text content | | | | tenacious andhard working. MSNbot is timid and needs |
| pages added daily. | | | | instruction and somereassurance it is doing the right |
| Submitted site at Google's "Add URL" submission | | | | thing, picks up pages slowlyand carefully. Slurp has |
| page. | | | | addictive personality and performserratically on a |
| * May 12 - 14 - No visits by Slurp, MSNbot, Teoma or | | | | random schedule. Googlebot takes a good longlook |
| Google. | | | | and leaves. Who knows whether it will be back and |
| (Slurp is Yahoo's spider and Teoma is from Ask | | | | when. |
| Jeeves) | | | | Now let's look at indexing by each engine. As of this |
| Posted link on WebSite101 to new domain at * May 15 | | | | writingon July 7, each engine also shows differing |
| - Googlebot arrives and eagerly crawls 245 pageson | | | | indexing behavioras well. Google shows no pages |
| new domain after looking for, but not finding | | | | indexed although it crawled |
| therobots.txt file. Oooops! Gotta add that robots.txt file! | | | | 250 pages nearly two months ago. Yahoo has three |
| * May 16 - Googlebot returns for 5 more pages and | | | | pages indexedin a clear aging routine that doesn't list |
| stops. | | | | any of the nearly |
| Slurp greedily gobbles 1480 pages and 1892 bad links! | | | | 8,000 pages it has crawled to date (not all itemized |
| Those bad links were caused by our email masking | | | | above.) |
| meantto keep out bad bots. How ironic slurp likes | | | | MSN has 187 pages indexed while crawling fewer |
| these. | | | | pages thanany of the others. Ask Jeeves has crawled |
| * May 17 - Slurp finds 1409 more masking links & | | | | more pages to datethan any search engine, yet has |
| only 209new content pages. MSNbot visits for the first | | | | not indexed a single page. |
| time andasks for robots.txt 75 times during the day, | | | | Each of the engines will show the number of pages |
| but leaveswhen it finds that file missing! Finally get | | | | indexed ifyou use the query operator |
| around toadd robots.txt by days end & stop slurp | | | | "site:publish101.com" without thequotes. MSN 187 pages, |
| crawling emailmasking links and let MSNbot know it's | | | | Ask none, Yahoo 3 pages, Google none. |
| safe to come in! | | | | The daily activity not listed in the three weeks since |
| * May 23 - Teoma spider shows up for the first time | | | | June 16above has not varied dramatically, with Teoma |
| andcrawls 93 pages. Site gets slammed by | | | | crawling a bitmore than other engines, Slurp erratically |
| BecomeBot, a spiderthat hits a page every 5 to 7 | | | | up and down and |
| seconds and strains ourresources with 2409 rapid fire | | | | MSN slowly gathering 30 to 50 pages daily. Google is |
| requests for pages. Added | | | | absent. |
| BecomeBot to robots.txt exclusion list to keep 'em out. | | | | Linking campaign has been minimal with posts to |
| * May 24 - MSNbot has stopped showing up for a | | | | discussionlists, a couple of articles and some blog |
| week sincefinding the robots.txt file missing. Slurp is | | | | activity. Lookingback over this time it is apparent that a |
| showing upevery few hours looking at robots.txt and | | | | listing delay isactually quite sensible from the view of |
| leaving againwithout crawling anything now that it is | | | | the search engines. |
| excluded fromthe email masking links. BecomeBot | | | | Our site restructuring and bobbled robots.txt |
| appears to be honoringthe robots.txt exclusion but | | | | implementationseems to have abruptly stalled crawling |
| asks for that file 109 timesduring the day. Teoma | | | | but the indexingbehavior of each engine displays |
| crawls 139 more pages. | | | | distinctly differing policyby each major player. |
| * May 25 - We realize that we need to re-allocate | | | | The sandbox is apparently not just Google's |
| serverresources and database design and this | | | | playground, butit is certainly tiresome after nearly two |
| requires changesto URL's, which means all previously | | | | months. I think I'dlike to leave for home, have some |
| crawled pages arenow bad links! Implement | | | | lunch and take a nap now. |
| subdomains and wonder what now? | | | | Back to class before we leave for the day kiddies. |
| Slurp shows up and finds thousands of new email | | | | What didwe learn today? Watch early crawler activity |
| maskinglinks as the robots.txt was not moved to new | | | | and be certainto implement robots.txt early and adjust |
| directorystructures. Spiders are getting errors pages | | | | often for bad bots. |
| upon newvisits. Scampering to put out fires after | | | | Oh yes, and the sandbox belongs to all search engines. |