Feed exclusion using categories

Many current and future feed publishers create content targeted at individuals for personal use and are not meant for widespread consumption. You may have a customized feed from Netflix, FeedBurner, or WordPress.com to track your movie queue, subscriber count, or blog stats respectively. Some feeds offer privacy through obfuscated URLs and others are just a one-time token exchange at the time of subscription. Given the current merged back-ends of online search aggregators with search and other methods of open discovery, how can a feed publisher opt-out of a public index?

One solution using existing element sets may be to overload the category element in RSS and Atom 1.0. Using the domain/scheme attribute it is possible to indicate the type of data communicated at either a feed or individual item level.

<category domain=”http://www.robotstxt.org/wc/meta-user.html”>noindex</category>
<category term=”noindex” scheme=”http://www.robotstxt.org/wc/meta-user.html” />

The domain and scheme attribute values communicate “categorization” according to the Atom and RSS 2.0 specifications and this use case seems within that specified use. Multiple values can be specified using multiple category elements.

A subscription agent could also check the domain’s robots.txt and the meta robots value of the feed’s alternate HTML for a more complete picture. Some aggregators take the position that since a feed is requested by a user and not a spider it should not need to check these extra locations. Adding robot exclusion to the feed itself seems like the most reliable way to operate.

What do you think?

Tags: