| The top internet forum and best known discussion site | | | | Corporations looking for online commentary, both |
| forwebsite owners, WebmasterWorld has been | | | | positive andnegative to their company, use web |
| dropped entirely from | | | | reputation services whichcrawl the web with |
| Google! A site with over a million pages seeing over 2 | | | | reputation bots (crawling mostly blogs andnews |
| millionpage views a month just disappeared from | | | | stories) looking for comments about their clients |
| search engines! Howoften have you been searching | | | | thatmay harm or help them. This may be of value to |
| for the answer to issuesaffecting your web site when | | | | thosecorporations, but it needlessly slows site |
| you found a thread in | | | | performance to noadvantage for webmasters. If a site |
| WebmasterWorld forums in the top search results? | | | | owner has trashed acompany on their blog, they |
| Never again will you see WebmasterWorld in search | | | | certainly don't want the "Web |
| resultsuntil this bot ban is reversed. | | | | Reputation Police" crawling their content in order to sue |
| The following URL actually takes up in the middle of | | | | themfor libel. |
| the "FOO"forum discussion that runs over 40 pages | | | | Rogue bots are a serious problem, but they simply |
| (at the time of thiswriting) But there is a nice recap of | | | | can't becontrolled with robots.txt. Tabke said himself |
| issues that leads thepage there recapping much of the | | | | that even thecookies and login are useless against |
| previous 23 pages ofdiscussion. | | | | serious scraper bots asthe bot owner must simply |
| Site owner Brett Tabke is being grilled, toasted and | | | | manually enter their bots through thelogin, which |
| roastedby forum members for requiring logins (and | | | | assigns a cookie to it, then let it loose withinthe forums |
| assigning cookies)for all visitors and effectively locking | | | | to automatically continue to scrape away once |
| out all search enginespiders. One big issue is lack of | | | | pastthe gate. Rogue bots don't follow robots.txt |
| effective site search nowthat you can't use a | | | | instructions. |
| "site:WebmasterWorld.com" query tofind | | | | I've often wondered why anyone would go to such |
| WebMasterWorld info on specific issues with a | | | | lengths tosteal content and re-use it elsewhere, when it |
| Googlesearch. Tabke is being slammed for not having | | | | is unlikely tohelp them in any substantial way. Everyone |
| an effectivesite search function in place before getting | | | | knows that contentis freely available at several article |
| the site dropped. | | | | marketing archives,but the rogue bot programmers |
| WebmasterWorld has been entirely removed from | | | | seek out content that rankshighly first - and fail to |
| Googleafter Tabke decided to use robots.txt to block | | | | realize that there are multiplereasons for those high |
| all spiderswith a universal blocking of all crawlers. | | | | rankings. Off page factors likequality, relevant, inbound, |
| User-agent: * | | | | one-way links from highly rankedblogs and industry |
| Disallow: / | | | | news sites. The bad boys out there stealingcontent |
| He has stated that this is due to rogue bots clogging | | | | won't get those inbound links - OR the high rankingson |
| andslowing site performance, scraping and re-using | | | | the sites where they've posted that scraped content. |
| content andsearching for web reputation on individual | | | | Article archives experience scraper bots too. Bot |
| companies withinforum comments. I've a similar | | | | programmerswould rather write a bot program that |
| problem at my site on a muchsmaller scale. Crawlers | | | | collects content for them |
| can request pages at excessive ratesthat slow site | | | | (to automatically dump it into another site) than |
| performance for visitors. I've instituted a | | | | tocarefully choose relevant work to post in sensible |
| "Crawl-delay" for Yahoo and MSN, but rogue bots | | | | hierarchiesof useful content. Automated scrape and |
| don't followrobots.txt instructions. (Google is more polite | | | | dump laziness. Whatother reasons would you have for |
| and requestspages at a more liesurely rate.) | | | | scraping free articles? |
| Can't say I completely understand the | | | | The other reason for scraping content would be to |
| WebmasterWorld action toban all bots, or if it will | | | | plaster itup across Adsense and Yahoo Publisher |
| achieve what Tabke is after, butit sure is creating a | | | | Network (YPN) sites ascontent to attract |
| buzz in search engine circles. Lots ofnew links to | | | | advertisements and hope for clickthroughsfrom visitors |
| WebmasterWorld will be generated by this | | | | seeking valuable keyword phrases that generate |
| extremeaction and then, when access to search | | | | contextual ads worth more to those webmasters. This |
| engine spiders is once again allowed from the | | | | convoluted thinking results in sites that don't end up |
| robots.txt file, the site is likely to get re-indexed by all | | | | ranking very well and don't generate much income to |
| the engines once again in it's entirety. | | | | those lazy, bot programming, nerdsthat create those |
| That will certainly be a heavy crawl schedule to | | | | types of sites. |
| re-index over a million pages by the top search | | | | There are several software and cloaking packages |
| engines, further loading the server and slowing the site | | | | available tolazy webmasters that claim to gather |
| for visitors. Perhaps Tabke plans a phased re-crawl | | | | keyword-phrase-basedcontent from across the web |
| by allowing Googlebot to index the site first, then Slurp | | | | via bots and scrapers, thenpublish that content to |
| (Yahoo), then MSN bot, then Teoma. It could be that | | | | "mini-webs" automatically, with nowork on your part |
| he's created more work for himself in managing that | | | | required. Those pages are cloakedautomatically, |
| re-crawl. | | | | against search engine best practices, and then |
| When this happens, there'll be thousands of new links | | | | Adsense and YPN ads are plastered over those |
| from all the buzz and many articles discussing the bot | | | | automaticallycreated pages, yes, you guessed it - |
| ban which will lead to WebmasterWorld becoming | | | | automatically. Serioussearch engine sp*m, cloaked, so |
| even more popular. Many have suggestedthe extreme | | | | search engines don't know. |
| move of banning all crawlers was simply a plan to gain | | | | One last reason for content scrapers is to find content |
| public relations value, and links, but somehow I doubt it. | | | | touse on blogs in the latest craze used to fill those |
| Tabke claims the bot ban was done in a moment of | | | | fake blogs |
| frustration after his IP address ban list grew to over | | | | (also known as Spam Blogs or Splogs) with content, |
| 4000 and management of rogue bots became a 10 | | | | then pingthe blog search services to notify them of |
| hour a week job. | | | | new posts. Constantnewly added scraped content is |
| Barry Schwartz of SEO Roundtable interviewed | | | | added to the blogs and thepinging suggests that the |
| Tabke after hisdramatic decision to ban all bots. That | | | | blog is prolific and should behighly ranked. This is closely |
| interview clarifiesmuch confusion, but still doesn't fully | | | | related and promoted by theabove mentioned article |
| justify the dramaticmove that effectively drops over | | | | scrapers. This is the latest type ofspam that is being |
| one million pages from | | | | combatted by search engines. It seems thatsearch |
| Google. | | | | engine sp*m is just as serious as emailed sp*m. |
| Web reputation crawlers are partially at play here as | | | | Good luck to WebmasterWorld's effort to ban those |
| well. | | | | rogue botsand scrapers! |