Feed aggregators and robot exlcusion

The world of feed aggregators has been compared to the HTML Internet of 1994 by Scott Rosenberg and others.. We are starting to consume and make sense of this new data, but there are currently no well-defined methods or implementations of selective consumption. If I publish content it is instantly available to feed aggregators and search companies with no restraints on its usage regardless of licensing and robot preferences. If a Major League Baseball launched a weblog for the private, non-commercial use of their audience there is nothing stopping companies from adding or supplementing the content without the consent of the publisher. As aggregators and live search companies develop a business model around your content is there a need for methods of defining selective exclusion?

It’s just a HTTP request

A feed is repackaged content for alternate consumption. A feed aggregator is just another web browser. Web browsers such as Internet Explorer and Firefox present your markup free of advertising but other services that alter or supplement that data may upset some people (remember Smart Tags?).

Introduction of robots.txt

In 1994 the robots exclusion standard was introduced to allow site administrators to declare content off-limits to all or some crawlers. If you want to exclude a search engine such as Google or Yahoo! from indexing all or some of your site pages, you add some lines to domain.tld/robots.txt file and the search engine should obey.

User-agent: ia_archiver
Disallow: /

The above code excludes the Internet Archive from crawling your site and storing your content. You could similarly exclude Google or any other search engine from crawling the content of your site. If the Internet Archive cached just your home page it would still be in violation of the robots exclusion standard.

User agents of feed aggregators could be defined in the robots.txt file to include or exclude use of site content.

Introduction of robots meta tag

In 1996 the robots meta tag was introduced as a method of defining robot preferences on a page-by-page basis. The robots meta tag allowed large domains with multiple authors to define their willingness to be crawled.

<meta name=”robots” content=”noindex,follow” />

The meta tag defined above tells a crawler not to index the contents of the given page and but allows the crawler to follow the links. I use this meta tag on the front page of my weblog to instruct crawlers to grab the individual entry pages and not my main page. Robot meta tags are useful for large hosted publishing sites such as LiveJournal or Blogger that cannot define the preferences of millions of its members inside the robots.txt file.

The robots meta tag is useful only for live search engines that index HTML. Feed aggregators and feed-only search engines would not see this robots meta tag.

Current usage

YahooFeedSeeker, the feed engine behind My Yahoo!, is currently the only aggregator requesting my robots.txt file. Mikel Maron’s The World as a Blog requests my robots.txt before including my content in his application. HTTrack requests my robots.txt file before storing a copy of my content.

What to do?

Are robot exclusion standards enough for the world of weblogs? In most cases there are only requests to individual files and not a link traversal. The main issue to me is excluding certain user agents from accessing my content. In centralized cases such as live search companies and online aggregators an IP block does the trick. In decentralized cases such as client applications things become more difficult with rewrite conditions and mod_rewrite.

Assuming developers storing content offline would play by the rules and follow such a protocol, what is the best way to define a standard method of feed and HTML usage for companies such as Technorati, Feedster, Bloglines, Ranchero, and NewsGator?