One Million Pages of WebmasterWorld Dropped by Google as Forum Bans Bots

The top internet forum and best known discussion siteCorporations looking for online commentary, both
forwebsite owners, WebmasterWorld has beenpositive andnegative to their company, use web
dropped entirely fromreputation services whichcrawl the web with
Google! A site with over a million pages seeing over 2reputation bots (crawling mostly blogs andnews
millionpage views a month just disappeared fromstories) looking for comments about their clients
search engines! Howoften have you been searchingthatmay harm or help them. This may be of value to
for the answer to issuesaffecting your web site whenthosecorporations, but it needlessly slows site
you found a thread inperformance to noadvantage for webmasters. If a site
WebmasterWorld forums in the top search results?owner has trashed acompany on their blog, they
Never again will you see WebmasterWorld in searchcertainly don't want the "Web
resultsuntil this bot ban is reversed.Reputation Police" crawling their content in order to sue
The following URL actually takes up in the middle ofthemfor libel.
the "FOO"forum discussion that runs over 40 pagesRogue bots are a serious problem, but they simply
(at the time of thiswriting) But there is a nice recap ofcan't becontrolled with robots.txt. Tabke said himself
issues that leads thepage there recapping much of thethat even thecookies and login are useless against
previous 23 pages ofdiscussion.serious scraper bots asthe bot owner must simply
Site owner Brett Tabke is being grilled, toasted andmanually enter their bots through thelogin, which
roastedby forum members for requiring logins (andassigns a cookie to it, then let it loose withinthe forums
assigning cookies)for all visitors and effectively lockingto automatically continue to scrape away once
out all search enginespiders. One big issue is lack ofpastthe gate. Rogue bots don't follow robots.txt
effective site search nowthat you can't use ainstructions.
"site:WebmasterWorld.com" query tofindI've often wondered why anyone would go to such
WebMasterWorld info on specific issues with alengths tosteal content and re-use it elsewhere, when it
Googlesearch. Tabke is being slammed for not havingis unlikely tohelp them in any substantial way. Everyone
an effectivesite search function in place before gettingknows that contentis freely available at several article
the site dropped.marketing archives,but the rogue bot programmers
WebmasterWorld has been entirely removed fromseek out content that rankshighly first - and fail to
Googleafter Tabke decided to use robots.txt to blockrealize that there are multiplereasons for those high
all spiderswith a universal blocking of all crawlers.rankings. Off page factors likequality, relevant, inbound,
User-agent: *one-way links from highly rankedblogs and industry
Disallow: /news sites. The bad boys out there stealingcontent
He has stated that this is due to rogue bots cloggingwon't get those inbound links - OR the high rankingson
andslowing site performance, scraping and re-usingthe sites where they've posted that scraped content.
content andsearching for web reputation on individualArticle archives experience scraper bots too. Bot
companies withinforum comments. I've a similarprogrammerswould rather write a bot program that
problem at my site on a muchsmaller scale. Crawlerscollects content for them
can request pages at excessive ratesthat slow site(to automatically dump it into another site) than
performance for visitors. I've instituted atocarefully choose relevant work to post in sensible
"Crawl-delay" for Yahoo and MSN, but rogue botshierarchiesof useful content. Automated scrape and
don't followrobots.txt instructions. (Google is more politedump laziness. Whatother reasons would you have for
and requestspages at a more liesurely rate.)scraping free articles?
Can't say I completely understand theThe other reason for scraping content would be to
WebmasterWorld action toban all bots, or if it willplaster itup across Adsense and Yahoo Publisher
achieve what Tabke is after, butit sure is creating aNetwork (YPN) sites ascontent to attract
buzz in search engine circles. Lots ofnew links toadvertisements and hope for clickthroughsfrom visitors
WebmasterWorld will be generated by thisseeking valuable keyword phrases that generate
extremeaction and then, when access to searchcontextual ads worth more to those webmasters. This
engine spiders is once again allowed from theconvoluted thinking results in sites that don't end up
robots.txt file, the site is likely to get re-indexed by allranking very well and don't generate much income to
the engines once again in it's entirety.those lazy, bot programming, nerdsthat create those
That will certainly be a heavy crawl schedule totypes of sites.
re-index over a million pages by the top searchThere are several software and cloaking packages
engines, further loading the server and slowing the siteavailable tolazy webmasters that claim to gather
for visitors. Perhaps Tabke plans a phased re-crawlkeyword-phrase-basedcontent from across the web
by allowing Googlebot to index the site first, then Slurpvia bots and scrapers, thenpublish that content to
(Yahoo), then MSN bot, then Teoma. It could be that"mini-webs" automatically, with nowork on your part
he's created more work for himself in managing thatrequired. Those pages are cloakedautomatically,
re-crawl.against search engine best practices, and then
When this happens, there'll be thousands of new linksAdsense and YPN ads are plastered over those
from all the buzz and many articles discussing the botautomaticallycreated pages, yes, you guessed it -
ban which will lead to WebmasterWorld becomingautomatically. Serioussearch engine sp*m, cloaked, so
even more popular. Many have suggestedthe extremesearch engines don't know.
move of banning all crawlers was simply a plan to gainOne last reason for content scrapers is to find content
public relations value, and links, but somehow I doubt it.touse on blogs in the latest craze used to fill those
Tabke claims the bot ban was done in a moment offake blogs
frustration after his IP address ban list grew to over(also known as Spam Blogs or Splogs) with content,
4000 and management of rogue bots became a 10then pingthe blog search services to notify them of
hour a week job.new posts. Constantnewly added scraped content is
Barry Schwartz of SEO Roundtable interviewedadded to the blogs and thepinging suggests that the
Tabke after hisdramatic decision to ban all bots. Thatblog is prolific and should behighly ranked. This is closely
interview clarifiesmuch confusion, but still doesn't fullyrelated and promoted by theabove mentioned article
justify the dramaticmove that effectively drops overscrapers. This is the latest type ofspam that is being
one million pages fromcombatted by search engines. It seems thatsearch
Google.engine sp*m is just as serious as emailed sp*m.
Web reputation crawlers are partially at play here asGood luck to WebmasterWorld's effort to ban those
well.rogue botsand scrapers!