Amazon DevCon

Amazon is hosting DevCon this week for its software development team. If you could like to join the developer chat you may have a chance to ask questions of Guido van Rossum and other speakers on today’s schedule. The web services team plans to post video and possibly audio, but the summaries are an interesting enough read regardless.

Some excerpts:

  • Joel Spolsky: “[P]eople won’t understand where their emotional reactions are coming from. Use this info in real life. Do same thing with software, put people in control, good emotional response, good physical feel, remind them of mom. Make it pretty, get good initial first reaction.”
  • Eric Neustadter: “In just under two months, users spent over 69 million hours playing Halo 2 on Xbox Live.”
  • Rael Dornfest: “A hacker is a tinkerer, not a bad guy. An experimenter, take stuff apart and see what happens. Put on a brave front, pop the top, see what happens. Unlike MacGyver, we usually don’t make things blow up.”
  • Michael Tiemann: “Developer #388 on Apache to get to 100%, top 20 guys are 80%. Most proprietary projects top off at 30-35 people. Consequence is that the marginal activity which ends up as bug fixes, downstream products, and so forth happens for OSS, not for proprietary.”

Bill Joy joins Kleiner, Perkins, Caufield & Byers

Bill Joy is the newest partner at Kleiner, Perkins, Caufield & Byers.

John Doerr of KPCB said “It’s our tradition every year end to ask Bill what innovations, what important ideas are just over the horizon. Last month we agreed we should work together.” Doerr added, “Whether the innovation is in internet web services, software, architectures, energy, material science, info/life sciences – or entirely new fields – Bill’s insights and relationships are respected and valued.”

Big news!

Feed aggregators and robot exlcusion

The world of feed aggregators has been compared to the HTML Internet of 1994 by Scott Rosenberg and others.. We are starting to consume and make sense of this new data, but there are currently no well-defined methods or implementations of selective consumption. If I publish content it is instantly available to feed aggregators and search companies with no restraints on its usage regardless of licensing and robot preferences. If a Major League Baseball launched a weblog for the private, non-commercial use of their audience there is nothing stopping companies from adding or supplementing the content without the consent of the publisher. As aggregators and live search companies develop a business model around your content is there a need for methods of defining selective exclusion?

It’s just a HTTP request

A feed is repackaged content for alternate consumption. A feed aggregator is just another web browser. Web browsers such as Internet Explorer and Firefox present your markup free of advertising but other services that alter or supplement that data may upset some people (remember Smart Tags?).

Introduction of robots.txt

In 1994 the robots exclusion standard was introduced to allow site administrators to declare content off-limits to all or some crawlers. If you want to exclude a search engine such as Google or Yahoo! from indexing all or some of your site pages, you add some lines to domain.tld/robots.txt file and the search engine should obey.

User-agent: ia_archiver
Disallow: /

The above code excludes the Internet Archive from crawling your site and storing your content. You could similarly exclude Google or any other search engine from crawling the content of your site. If the Internet Archive cached just your home page it would still be in violation of the robots exclusion standard.

User agents of feed aggregators could be defined in the robots.txt file to include or exclude use of site content.

Introduction of robots meta tag

In 1996 the robots meta tag was introduced as a method of defining robot preferences on a page-by-page basis. The robots meta tag allowed large domains with multiple authors to define their willingness to be crawled.

<meta name=”robots” content=”noindex,follow” />

The meta tag defined above tells a crawler not to index the contents of the given page and but allows the crawler to follow the links. I use this meta tag on the front page of my weblog to instruct crawlers to grab the individual entry pages and not my main page. Robot meta tags are useful for large hosted publishing sites such as LiveJournal or Blogger that cannot define the preferences of millions of its members inside the robots.txt file.

The robots meta tag is useful only for live search engines that index HTML. Feed aggregators and feed-only search engines would not see this robots meta tag.

Current usage

YahooFeedSeeker, the feed engine behind My Yahoo!, is currently the only aggregator requesting my robots.txt file. Mikel Maron’s The World as a Blog requests my robots.txt before including my content in his application. HTTrack requests my robots.txt file before storing a copy of my content.

What to do?

Are robot exclusion standards enough for the world of weblogs? In most cases there are only requests to individual files and not a link traversal. The main issue to me is excluding certain user agents from accessing my content. In centralized cases such as live search companies and online aggregators an IP block does the trick. In decentralized cases such as client applications things become more difficult with rewrite conditions and mod_rewrite.

Assuming developers storing content offline would play by the rules and follow such a protocol, what is the best way to define a standard method of feed and HTML usage for companies such as Technorati, Feedster, Bloglines, Ranchero, and NewsGator?

Dear Technorati: Play well with others

This week’s announcements from Technorati have been mixed with endorsements and a message that Technorati may work better with one weblog platform than another.

The Technorati tags help file contains an inline advertisement for TypePad. I could understand the use of Flickr and del.icio.us as the current most prominent uses of tags by the online community, but endorsing a weblog hosting company in your help file stinks of paid placement.

Adding rel=”tag” to any link should be enough to build a tag library for links off the link text. Technorati instead grabs the last part of the URL after the “/” and treats it as a post tag. I was hoping for a decentralized del.icio.us implementation.

Claiming your weblog using the Blogger or Metaweblog API after discovering the availability of such services via RSD is cool, and may have been specially built for Tucows but Movable Type, TypePad, Blogger, and other weblog platforms can use the feature just as well.

As you continue to grow and figure out how to make money please remember your users and content producers you rely on publish on a variety of tools and may look to you for guidance on how to create more structured content for your benefit or send some business your way.

If you feel strongly enough, the TechnoratiBot visits from IP address 209.237.230.104. If you ever wanted to block Technorati adding “Deny from 209.237.230.104” to your htaccess file would do the trick. I see no requests from Technorati for robots.txt so you cannot exclude through the robot exclusion standard and Technorati does not seem to obey the robots meta tag. Currently an IP is the only way to stop Technorati from indexing your content.

Update: I chatted with Bradley and Derek from Technorati this morning. The TypePad link was inserted as a tip to get people started with weblogs. The link has been removed from the staging server instance of the page. Excluding pages from Technorati indexing was not on their radar since they figured a ping is explicit action to crawl. A new bug was created to look at ways to exclude crawlers from indexing your content such as the robots meta tag. Hosted sites such as Blogger, Live Journal, and Six Apart cannot define robot exclusions in robots.txt for your millions of users and stay under the file size limit of about 50 kilobytes.

Technorati launches Technorati Tags

Technorati launched Technorati Tags, a new, decentralized method of categorizing posts that also integrates with popular online communities del.icio.us and Flickr.

Browse tags

The front page displays a sampling of current popular tags sorted by UTF value. A tag’s popularity is expressed through font size: the more popular tags appear larger.

Tag page

Technorati tag page for iPod screenshot

Each tag has a page aggregating photos from Flickr, weblog posts indexed by Technorati, and del.icio.us links. Each page contains ten photographs, twenty posts, and fifteen links. http://www.technorati.com/tag/ + tag of your choice (no spaces).

Join the game of tag

How can you be sure your weblog posts are included in the Technorati tag index? If you use Flickr or del.icio.us already, Technorati will add your public photos’ tags to its index. If you would like to configure your weblog for inclusion in the Technorati tags index you can check a few things.

Your RSS feed and Technorati tags

Your RSS feed should have a category value for each item. I added a Technorati tags domain attribute to to each category element: <category domain=”http://www.technorati.com/tag/”>[tagname]</category>.

If you use Movable Type your RSS feed uses the MTEntryCategory template tag, the leaf node of your primary category. If you use WordPress category is included by default in your RSS and Atom feeds.

Your Atom feed and Technorati tags

The new Atom format specification has a category construct but for version 0.3 there is no built-in support for categories. Technorati is most likely using the Dublin Core subject element for this purpose.

If you use WordPress the Dublin Core subject element is already defined. If you use Movable Type and would like to add categories to your Atom 0.3 feed you need to do a little work.

To add categories to your Atom feed through the use of the Dublin Core subject element you must first declare the namespace in the feed element and then add a dc:subject element to each entry.

  1. Change your feed element.
    <feed version="0.3" xmlns="http://purl.org/atom/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">
  2. In between each <entry> and </entry> you need to add <dc:subject>[tagname]</dc:subject>
Any link can be a tag

To tag your links just add rel=”tag” to any link on the main page of your weblog. Example: <a href=”http://www.apple.com/ipod/” rel=”tag”>music</a> would associate a tag of “music” with Apple’s iPod page.

Kevin Marks of Technorati, the man behind the crawler, tells me the example above would tag your post with “ipod” and not “music.” Kevin says if you want to tag a link, use del.icio.us.

Can more groups come and play?

Hopefully Technorati is open to working with other sites utilizing tags for user-defined taxonomies. Buzznet has buzzwords and 43 Things has their lifestyle tags that could be integrated into the Technorati tag ecosystem.

Marissa Mayer on Google user experience

Marissa Mayer of Google spoke at PARC on Tuesday night. Marissa is product manager for Google.com and formerly the technical lead for the user-interface team. Alan Williamson provides a good summary of the event.

Some interesting bits of information:

  • The Google copyright statement was added to the bottom of the home page as an end of page marker after users expected more page content to load.
  • If at least 20% of people use a feature Google will include it in the full site. At least 5% of users need to use an advanced feature before it is included in Google’s advanced search page.
  • I’m Feeling Lucky” is hardly ever used, but users view the button as a comfort and a part of the Google experience so it has not been removed.
  • Gmail designers discovered there were approximately six types of e-mail users. Google was designed around these six usage cases and used within Google for two years prior to public announcement.