Exclusive: Google to offer feed API

Google plans to offer a feed reader API to allow third-party developers to build new views of feed data on top of Google’s backend. The new APIs will include synchronization, feed-level and item-level tagging, per-item read and unread status, as well as rich media enclosure and metadata handling. Google Reader PM Jason Shellen and engineer Chris Wetherell both confirmed Google’s plans after I posted my reverse-engineering analysis of the Google Reader backend.

The new APIs will allow aggregator developers to build new views and interactions on top of Google’s data. Google currently has at least two additional Google Reader views running on current development builds.

Google may offer public access to the feed API as early as next month. Shellen said the team wants to nail a few more bugs before publicly making the service available to the world.

Hopefully the Google team is considering offering API hosting and processing similar to Alexa’s platform. Hosting personal homepage widgets on Google Base is a good start but think about if developers could interact with data via JavaScript on the same domain as the service!

Google Desktop would be an ideal first implementation of the new APIs, centralizing Google’s feed retrieval and reducing load on individual servers. Google feed grabber, FeedFetcher, currently collects content for Google Reader and Google personalized homepage.

Google’s new offering is direct competition to NewsGator’s synchronization APIs but are easier to code against (no SOAP required). Google currently does not have the same reach across devices as NewsGator but an easy-to-use API from the guys who brought you the Blogger API and “Blog This!” might really shake up the feed aggregator ecosystem.

WordPress 2.0

WordPress 2.0 is now available for download from the newly redesigned WordPress.org. The new release includes many behind-the-scenes changes as well as some front-end AJAX goodness. A Subversion update is the best way to upgrade your existing installation.

My favorite new features:

  • Improved user permissions that allow you to select a role instead of a number.
  • Better importers. WordPress importers can login to other blogging services and suck out your data, comments and all!
  • Abstracted data layer allows future support of various databases and makes plugin development a bit easier.
  • Rich post authoring through a WSYIWYG interface and drag and drop post components. You can even add new categories direct from the posting interface. Users may actually prefer to author their posts through the default interface instead of a desktop tool.
  • Photo attachments are generated as sub-pages complete with its own comments and tracking.
  • Persistent cache. Frequent database queries are cached to disk allowing for faster response times, especially on high-traffic sites.

WordPress 2.0 also includes the Akismet anti-spam plugin in the default install.

Google Reader API

Google Reader is an online feed aggregator with heavy use of JavaScript and pretty quick loading of the latest feed data from around the web. Google’s AJAX front-end styles back-end data published in the Atom syndication format. The data technologies powering Google Reader can easily be used and extended by third-party feed aggregators for use in their own applications. I will walk you through the (previously) undocumented Google Reader API.

Update 10:40 p.m.:Jason Shellen, PM of Google Reader, called me to let me know that Google built its feed API first and the Google Reader application second as a demonstration of what could be done with the underlying data. Jason confirmed my documentation below is very accurate and Google plans to release a feed API “soon” and perhaps within the next month! Google Reader engineer Chris Wetherell has also confirmed the API in the comments below.

A reliable feed parser managed by a third party lowers the barrier to entry of new aggregator developers. Google and its team of engineers and server clusters can handle the hard work of understanding feeds in various formats and states of validation, allowing developers to focus on the interaction experience and other differentiating features. You can also retrieve and synchronize feed subscription lists with an established user base that could be in the millions, providing a better experience for users on multiple devices and platforms. Google Reader’s “lens” provides only one view of the available data.

Google Reader users are assigned a 20-digit user ID used throughout Google’s feed system. No cookies or session IDs are required to access this member-specific data. User-specific data is accessible using the google.com cookie named “SID.”

Feed retrieval

/reader/atom/feed/

Google converts all feed data to Atom regardless of its original publication format. All RSS feed post content appears in the summary element and unlike the My Yahoo! backend I found no additional metadata about the feed containing full posts but Google does publish content data where available.

You may request any feed from the Google Reader system using the following URL structure:

You may specify the total number of feed entries to retrieve using the n parameter. The default number of feed items returned is 20 (n=20).

Google strips off all the data it does not render in Reader. Stripped data includes namespaced data such as Apple iTunes podcast data and Yahoo! Media RSS, additional author data such as e-mail and home URL, and even copyright data.

Subscription list

/reader/atom/user/[user id]/pref/com.google/subscriptions

Google Reader’s feed subscription list contains a user’s current feed subscriptions as well as past deleted subscriptions. Each feed is contained in an entry complete with feed URL, published and updated dates, and user-specific tags, if present. Current subscriptions are categorized as a reading list state. You may request the full list of feeds by setting the complete to true.

Here is a copy of my Google Reader subscription list with my user ID zeroed out. I am not subscribed to my RSS feed (index.xml) and I have added tags to my Atom feed. Each listed feed contains an author element which appears to be empty regardless of declarations within the original feed. Perhaps Google plans to add some feed claiming services, but its own Google blog has no affiliated author information.

Reading list

/reader/atom/user[user id]/state/com.google/reading-list

My favorite feature of the Google Reader backend is direct access to a stream of unread entries across all subscribed feeds. Google will output the latest in a “river of news” style data view.

Here is a sample from my limited subscription set. You may specify the total number of entries you would like Google to return using the n parameter — the default is 20 (n=20).

Read items only

http://www.google.com/reader/atom/user/[user ID]/state/com.google/read

You can retrieve a listing of read items from Google Reader. You might want to analyze the last 100 items a user has read to pull out trends or enable complete search and this function may therefore be useful. You may adjust the number of items retrieved using the n parameter — the default is 20 (n=20).

Reading list by tag

/reader/atom/user/[user id]/label/[tag]

You may also view a list of recently published entries limited to feeds of a certain tag. If you have tagged multiple feeds as “marketing” you might want to request just the latest river of news for those marketing feeds. The returned feed contains both read and unread items. Read items are categorized as read (state/com.google/read) if you would like to hide them from view. The number of returned results may be adjusted using the n parameter.

Starred items only

/reader/atom/user[user id]/state/com.google/starred

Google Reader users can flag an item with a star. These flagged items are exposed as a list of entries with feed URL, tags, and published/updated times included. You may specify the total number of tagged entries to return using the n parameter — the default value is 20 (n=20).

Google treats starred items as a special type of tag and the output therefore matches the tag reading list.

Add or delete subscriptions

/reader/api/0/edit-subscription

You may add any feed to your Google Reader list using the Google Reader API via a HTTP post.

  • /reader/api/0/edit-subscription — base URL
  • ac=[“subscribe” or “unsubscribe“] — requested action
  • s=feed%2F[feed URL] — your requested subscription
  • T=[command token] — expiring token issued by Google. Obtain your token at /reader/api/0/token.

Add tags

/reader/api/0/edit-tag

You may also add tags to any feed or individual item via a HTTP post.

  • /reader/api/0/edit-tag — base URL
  • s=feed%2F[feed URL] — the feed URL you would like to tag
  • i=[item id] — the item ID presented in the feed. Optional and used to tag individual items.
  • a=user%2F[user ID]%2Flabel%2F[tag] — requested action. add a tag to the feed, item, or both.
  • a=user%2F[user ID]%2Fstate%2Fcom.google%2Fstarred — flag or star a post.
  • T=[special scramble] — three pieces of information about the user to associate with the new tag. Security unknown and therefore unpublished.

Conclusion

It is possible to build a your own feed reader on top of Google’s data with targeted server calls. You can power an application both online and offline using Google as your backend and focus on building new experiences on top of the data. Advanced functionality is available with a numeric Google ID and some variable tweaks.

Google has built the first application on top of this data API, the Google Reader lens, and judging from their choice of URLs the lens may not be Google’s last application built on this data set. I like the openness of the data calls and think the Google Reader APIs are simple enough to bootstrap a few new applications within Google or created by third-party developers.

Update: The Google Feed API now provides official Google endpoints for most of the data explained in this 2005 post.

Bloglines 3.0

Bloglines logo

Bloglines switched it’s user-agent from “Bloglines 2.1” to “Bloglines 3.0-rho” on Tuesday afternoon. My guess is that “rho” are the initials of a Bloglines search engineer who pushed his test code live without changing the user-agent but there could be some big changes coming in near future.

If you are a FeedBurner user you may have noticed a large drop in your subscriber statistics over the past day (hat tip: jasonspage). I just heard back from Dick Costolo, CEO, and learned that FeedBurner had previously ignored this Bloglines user-agent as it was just a development test but they will be changing over their site code this morning to reflect the new behavior of Bloglines and its code.

Tags:

Correcting Kottke

Popular blogger Jason Kottke recently posted an entry criticizing blog search companies for the incompleteness of their results compared to his internal search tool powered by Movable Type. I happen to know both Movable Type and blog search pretty well, so I decided to dig into the data and see where search engines might have missed the mark in the interest of improving quality. I found that Jason’s criticisms where a bit unfounded yet still may alter the perceptions of many people who are heavily influenced by what they read on his blog.

Jason found more results searching his installation of Movable Type 3.15 than he was able to find using many search engines. I manually checked every page on Jason Kottke’s Movable Type install for mention of the word “Freakanomics” and found some disconnects between what was presented to Jason in his Movable Type search results page and what is presented to the world at large, including search engines.

Jason’s installation of Movable Type is located at Yoink.org. I searched all blogs on his Movable Type installation for “Freakonomics” over the past 6 months Update: Jason has since deactivated public search. I chose 6 months because has only been indexing feeds since June and I wanted a good base for comparison.

Movable Type returned Jason’s most recent blog post as well as 9 posts from his link blog.

  1. The economics of sex… posted on December 12. The term “Freakonomics” appears nowhere in the entire source code of the page.
  2. Profile by Michael Lewis of Mike Leach posted on December 7. There is a link to freakonomics.com near the end of the post but the word “Freakonomics” appears nowhere in the post text.
  3. A pair of Boston economists… posted on December 5. “Freakonomics” appears nowhere in the entire source code of the page.
  4. …People who don’t clean up after their dogs.. posted on October 7. “Freakonomics” appears no where in the entire source code of the page.
  5. Unique Planned Parenthood pledge drive posted on September 19. There is a link to freakonomics.com at the end of the post but the word “Freakonomics” appears nowhere in the post text.
  6. Oakland A’s are rolling posted on August 16. There is a link to freakonomics.com near the end of the post but the word “Freakonomics” appears nowhere in the text of the post.
  7. Crime fell because of rap music posted on August 9. There is a link to freakonomics.com in the post but the word “Freakonomics” appears nowhere in the text of the post.
  8. Where did all the crack go posted on August 8. There is a link to freakonomics.com at the beginning of the post but the word “Freakonomics” appears nowhere in the text of the post.
  9. Economics of poker written on July 18. The word “Freakonomics” appears nowhere in the entire source code of the page.

4 out of the 9 posts surfaced by Movable Type’s search functionality contained no mention of “Freakonomics” anywhere in the outputted post. The word “Freakonomics” may occur somewhere in a field not outputted to the final page such as keywords, excerpt, extended entry, or something else, but there is no content that anyone could expect a search engine to match for the desired query. Jay Allen wrote the search engine built-in to Movable Type and I’m sure he could answer any questions about your individual install.

5 out of the 9 posts contain a link URL partially represented by the search term. A search engine could pull out “freakonomics” from the URL if it chooses and a query term contained in a URL is one factor used to rank queries in large search engines such as Google. Technorati tries to optimize its various search indexes available to user queries by limiting search possibilities. If you are searching for a link a query analyzer should only look through a list of available links and not keywords. If you are looking for a keyword a query analyzer should throw away any link data and search only against the words in a post.

I am not sure where The New York Times sourced its data but it didn’t come through me.

If you have any questions about “what they are telling us is actually true” and would like some answers for your own posts or research you can contact me to find out more about how search works. I’m a big fan of researched blog posts and adding more original and thoughtful content into the world.

Update 12/22: Jason updated his post based on this new information. I e-mailed him last night with a link to the post, an alert that his search interface was showing, and inviting further conversation. He wishes I had just stuck to an e-mail instead of a full post, but I don’t see it as “airing dirty laundry.” Thousands of people would read his post while he was asleep and I had a chance to TrackBack and provide some extra information for people viewing the web page and believing all search engines suck.

Tags:

WordPress developers get corporate

Automattic

This morning the lead developers of WordPress unveiled a new corporate entity based on the popular open-source blogging platform. The new company, Automattic, employs WordPress lead developers Ryan Boren and Matt Mullenweg, contributing developer Donncha O’Caoimh, and Andy Skelton.

The new company will provide WordPress consulting services and develop and maintain services such as hosted blogging site WordPress.com and antispam tool Akismet.

Automattic is a Delaware corporation founded on March 28, 2005 and previously mentioned on this blog as “WordPress Inc.” in late March. The company’s official launch is meant to coincide with the release of WordPress 2.0 later this month.

Tags:

My Yahoo! feed API

Powered by My Yahoo!

Yahoo! has developed a backend infrastructure that can be easily deployed across various applications online or on the desktop with full synchronization and feed parsing handled on its servers. Developers could tap into the Yahoo! backend and develop new feed-aware applications quickly and easily on a robust platform already used by millions of users. Yahoo! just needs to publicize the code and make sessions a bit easier but I reverse engineered their code and I’ll give you a primer.

Aggregator developers spend a lot of time dealing with issues such as proper parsing, feed storage, and at later stages providing synchronization between online applications or other desktops for a seamless reading experience across multiple devices. Yahoo! has the infrastructure behind the scenes to power a services-based feed aggregator on any platform based on a (previously) undocumented My Yahoo! API.

You must first submit login credentials to Yahoo! and receive a few cookies. Yahoo! will also generate a web services session ID that will regenerate in a nice XML message when it expires. Requests are handled by a pool of API servers located at api[0-3].my.mud.yahoo.com.

Individual feed data

/rss/Content/V3.0/getFeedData

Yahoo! serves feed data to its applications through a format built on top of RSS 2.0. You can pass the API parameters such as the maximum number of items to retrieve, your desired date format, the level of processed content to return, and the requested ordering of your results. Yahoo! returns a fully processed and cleaned-up feed with extra metadata.

Yahoo! tracks when the feed and each individual post was created, modified and updated. They add information such as whether the feed contains the full content of my post and if I am a podcaster. Rich media such as images or audio content is contained in separate content elements. Yahoo! even creates a special server-side unique identifier for each post, saving aggregators a lot of headaches.

Here is the locally stored Yahoo! output of my blog’s RSS feed compared to my actual feed.

You might want to check out the live feed data for my blog after you have logged-in to Yahoo!.

All an independent developer would have to do is style the returned clean feed data and only deal with one data format plus some custom elements. A lot easier than typical development.

Synchronization

/ymws?m=SetMetaData

Yahoo! synchronizes its list of feeds between applications using SOAP messages. Each client application appears to be assigned an identifier and a version number as well as unique user information tied to a session.

Here is the My Yahoo! synchronization SOAP message with my personal account information removed. Requests are processed using mod_gsoap.

Entire feeds may be marked as read using a resetUnseen element inside a SOAP message. Yahoo! does send some communication back to the server when each post is selected in the Yahoo! Mail view but the individual IDs do not correlate with the internal post ID. Given Yahoo!’s current interfaces synchronizing read status at a per-item level might not make much sense.

Yahoo!’s mail servers can serve as an example endpoint.

Adding and deleting feeds

/rss/Subscriptions/V3.0/addUserSubs

It is possible to add individual subscriptions using an individual API call with a client property, feed URL, and a web service ID.

Here is an example query adding Scripting News to your Yahoo! aggregation space.

/rss/Subscriptions/V3.0/delUserSubs

You can similarly remove the same feed. Here is an example query removing Scripting News from your Yahoo! aggregation space.

Conclusion

I think it is possible to construct a new aggregator interface completely powered by Yahoo! on the backend and extended by individual developers for the best user experience and ease-of-use. Developers could also store data locally in a desktop aggregator for online access or to allow better search over time.

I think Yahoo! should develop a desktop aggregator powered by its APIs. In my limited testing I believe it is possible to build a Yahoo! desktop aggregator for Mac OS X using Cocoa, WebKit, and CoreData. I’ve been meaning to talk to the RSS folks at Yahoo! about desktop client possibilities and there’s nothing like a lengthy blog post detailing my current progress to get the conversation started.

New Technorati search results, profile features

Technorati released a redesigned search results page and member profiles tonight, including some features I’ve been wanting for a long time. You can find the official announcement on the Technorati weblog and I will share my personal thoughts and favorites below.

Personal tag cloud

Technorati personal tag cloud

Technorati now displays a personal tag cloud for each member profile! You can now glance at a blogger’s profile and get a pretty good idea about his or her most blogged about topics. Tim Appnel’s Tags.App plugin for Movable Type displays some similar tag visualizations but now anyone on any blog platform can visualize their topical focus.

Yes, I want this component to be one of the options bloggers can select as part of their Technorati JavaScript embed but that takes just a little more time and it’s best to get the goodies out the door and iterate. The top tags per blog has been available via the Technorati API for months but this is the first time it’s been exposed on the main site.

New search results format

New Technorati search result

Search result excerpts are now a bit longer and meant to welcome the newbie by hiding away some of the more advanced options. Above I hovered over the magnifying glass to reveal a scoped search option for the word “soccer” from a mommy blogger. The talk bubble icon displays the number of inbound links to that blog.

I miss not being able to glance at a person’s link count on the search results page but I think there may be better ways to represent the same data to quickly communicate worthwhile information. I like the eBay stars program used to communicate a feedback rating and perhaps some similar tiny graphics could be applied to Technorati link counts but with more meaningful colors.

Keyword search charts

Technorati charts

It’s now easy to visualize keyword use trends over the past 30 days with keyword search charts. The chart above shows the use of the word “Santa” in blog posts over the last 30 days. I’ve been using these charts to follow spikes around product announcements around the industry to see if the excitement and interest in a product remains or quickly fades. Right now I am watching the search result for “Ronaldinho,” a Brazilian who was just named best soccer player in the world for the second year in a row.

Advanced charts

You can easily construct your own charts in various sizes using a standard URL structure:

http://technorati.com/chart/ + [keyword] + ?size= + [size]

Charts are outputted in PNG format. The size of the chart varies from a default of s (small, or 180×150 pixels) to xl (extra large, or 700×500 pixels). Publishing tools could easily add a Technorati chart to graph trends over time.

Enjoy! We’ve promised the Technorati ops team not to release any new features until after Christmas to keep our servers running a bit more predictably than normal.

iTunes Movable Type podcast tip

I like to output as much data as possible for feed aggregators, especially when one feed aggregator accounts for over 50% of my podcast listeners. I was frustrated that my podcast did not have a published duration until I realized there was a pretty simple solution: MTEntryKeywords.

I almost never use the keywords field, yet it is already built-in as a storage option for every post. Perfect! No plugins necessary. Here’s how you do it.

  1. Add your podcast’s duration (mm:ss) as your entry keywords.
  2. Add a line to your RSS 2.0 feed:
    <itunes:duration><$MTEntryKeywords$></itunes:duration>

The time column will now be populated in iTunes for all of your podcasts.