Google Mondrian: web-based code review and storage

Google Mondrian logo

Guido van Rossum unveiled his first Google project, Mondrian, tonight during a Python tech talk at the Google campus in Mountain View. Mondrian is a web-based code review system built on top of a Perforce and BigTable backend with a Python-powered front-end. Mondrian is a pretty impressive system and is currently in use across Google.

Shared Development Environment

Google uses a company-wide Perforce depot with almost no developer branches. Each developer has their own NFS workspace readable by anyone in the company, including automated processes. An administrative process takes snapshots of each developer workspace including local development environments accessed over SSH. Files within these snapshots can be compared to checked-in data, encrypted, and archived.

Previous methods of review

Previous to Mondrian code review was conducted largely over e-mail using Google command-line wrappers built on top of Perforce. A developer could initiate a code review from within the g4 mail tool, which would fire off an e-mail and begin a review thread. When the developer receives a response of “looks good to me,” or lgtm for short, they could proceed to checkin. Changes could be compared using tkdiff.

Design-level reviews are often conducted by e-mailing around Word documents or editing a team wiki. Recently some design reviews have moved onto an internal version of Google Docs.

Web-based collaboration meets code review

Mondrian code review

The Mondrian tool creates a much better workflow by creating task-specific dashboards, in-line commenting, well-tracked statistics, and more. The application is built on top of Python open source libraries such as the Django framework, smtpd.py mail service, and the wsgiref web server software.

Code reviews can be initiated and completed from within the Mondrian interface. A developer requests a review from another user or a group of users to kick off the process. Each invited reviewer can add comments directly underneath a line of code or reference the entire file. You can request and diff the file against previous versions as well. It’s a pretty slick interface, lightly highlighting each line of code as you hover, and popping open a comment box in response to a double-click. Comments can be saved as a draft and shared at a later time.

Putting the entire code review process online means you never have to worry about referencing the most recent version of a file or losing e-mails. Mondrian captures every outgoing e-mail related to the workflow, looks for key data such as revision numbers, and updates a to-do list accordingly.

More on BigTable

Mondrian uses BigTable as backend storage for user data. More specifically, it’s used to store:

  • Change metadata such as a description or list of files
  • Comments entered through the web interface or via e-mail
  • Encrypted file snapshots taken from user workspaces
  • Per-user data such as active changes or last view dates

Summary

The Mondrian web code review system is pretty impressive. Guido estimates he has spent about 25% of his work time on the project since joining Google in December 2005. Mondrian served as Guido’s introduction to Google technologies and processes with the help of a few other Googlers treating it as a side-project. The application is so deeply intertwined with Google technologies it’s not likely to be available as open source until Subversion and a backend such as SQLite can be supported.

Guido’s full talk, including a demo of Mondrian, is available on Google Video.

Feed publishing best practices

Web feed syndication is made up of two base vocabularies: RSS 2.0 and the Atom Syndication Format. These base vocabularies are extended using namespaces to create a common set of expressions for your web feed data. In this post I’ll walk through some best practices for publishers syndicating their data via web feeds.

Should I use RSS or Atom?

The RSS 2.0 syndication format has been around for about four years and over that time it has been used by web publishers large and small to represent their data for syndication. The New York Times publishes its top stories via RSS to deliver updates to readers with appropriate viewing software. NPR distributes audio attachments commonly referred to as “podcasts” using RSS enclosures to iTunes and other specialized subscription programs.

The Atom Syndication Format was released in December 2005 under the standardization process of the Internet Engineering Task Force (IETF). A few popular uses include Google GData for API responses, FeedBurner resyndication, and Six Apart blogging products.

Choosing RSS or Atom for feed syndication is a bit like selecting GIF or JPEG as your image format: publishers have preferences for the best representation of the original data but most renderers support both. There are a few easy answers however. If you syndicate audio or video in your feed, RSS offers more reliable compatibility across deployed players. If you would like to use your feed as a lightweight API or present data for government consumption, Atom should be your format of choice.

Extended vocabularies

RSS and Atom take advantage of XML to express data not included in their base vocabularies. A number of groups and companies have authored namespace extensions to represent a variety of data. Here’s a look at some of the more popular namespace expressions:

Dublin Core metadata
The Dublin Core namespace might be used to specify an author name, a contributor, or copyrights to an individual feed item. Many Dublin Core elements are better expressed using Atom base elements.
Comments
Comment feeds and counts can be included with a feed item. Slash and Well-Formed Web namespaces are popular additions to RSS while Atom feeds may use Atom Threading Extensions.
Photo, audio, and video
Publishers may add more information about media enclosures using Yahoo! Media RSS or the iTunes podcast namespace. Yahoo! Media RSS lets a publisher describe multiple available data types available, such as MP3 and AAC. The iTunes namespace enhances your listings within the iTunes Store.
Search results
OpenSearch expresses search results and related data for consumption by search aggregators and the built-in search features of Internet Explorer 7 and Firefox 2.
Creative Commons
To declare Creative Commons license data inside a RSS feed. Atom publishers can use rights instead.
Geographical coordinates
Publishers can express latitude and longitude coordinates using the W3C Basic Geo vocabulary. A geotagged set of photos might be syndicated with coordinates or traffic conditions might publish a corresponding location.
Item pricing
Buy.com product module uses a specialized namespace for pricing, thumbnail image, text-only description, and SKU.
Weather conditions
Yahoo! Weather publishes weather forecast data using a specialized namespace. The National Weather Service uses Digital Weather Markup Language.
Forums
Jive Forums namespace covers forum issues such as total post messages and individual threads.
Calendar
Google Calendar namespace is one way of expressing calendar data.
List formatting
Microsoft’s Simple List Extensions define a unique ordering of feed items such as a Top 10 list or upcoming movies in your rental queue.

Avoid confusion of tongues

Paul Gustave Dore Confusion of Tongues

Given the amount of expression available in both the base and available and widely deployed extended namespace a new feed publisher would be well-suited sticking to these vocabularies where possible. Just as the color value “cyan” may have no value to a color picker with a limited vocabulary of expressions, your expressed data might never be parsed or understood by feed parsers if you become overly inventive.

Most feed parsers don’t actually walk the XML of each feed. They rely on feed parser libraries to handle feed errors, similar markup across different publication formats, and retrieving remote files from your server. A parser such as Universal Feed Parser contains built-in support for over 40 namespaces and attempts to normalize various ways of expressing title, author name, etc. A newly invented namespace is less likely to be supported by these intermediate libraries than existing methods of data definition.

Here’s a sampling of some of the popular feed parsing libraries by programming language:

Windows/C#
Windows RSS Platform
Apple Leopard/Cocoa
Apple Syndication Platform (unreleased)
Python
Universal Feed Parser
PHP
Magpie
Java
Rome
Perl
XML::FeedPP
Ruby
Simple RSS

Check for errors

Once you’ve published your feed you’ll want to check for XML and feed errors. Some parsers are more liberal than others, but a single error could result in users of specific services not receiving your latest updates.

You can check your files for errors with Feed Validator or the W3C Feed Validation Service. You can program web services directly against the W3C interface, or you can download the feed validator code for local use.

Feed marketing

Once you’ve published a feed using well-understood element sets and valid markup you’ll want to be sure the world can find your latest updates. Aggregators and search engines support ping notifications, a quick way of letting a service know they should visit your website and/or feed and discover new updates.

Ping

Most ping servers accept update notifications delivered via XMLRPC and the weblogUpdates.ping method name for website title and website URL and/or weblogUpdates.extendedPing for the same data plus a feed URL. You can send notification updates to a variety of sources for quick inclusion in a search index or feed aggregator. Below are just a few popular ping endpoints serving a general audience:

Google
http://blogsearch.google.com/ping/RPC2
Yahoo!
http://api.my.yahoo.com/RPC2
http://ping.blo.gs/
NewsGator
http://services.newsgator.com/ngws/xmlrpcping.aspx
Bloglines
http://www.bloglines.com/ping
Technorati
http://rpc.technorati.com/rpc/ping
VeriSign
http://rpc.weblogs.com/RPC2

Create new subscriptions

A few search services restrict their index to user feed subscriptions. If you’re not already a user, create a new account and subscribe to your feed, adding notes and tags where appropriate. Be sure to cover popular online aggregators such as My Yahoo!, Google Reader, Bloglines, etc.

These additional actions give your feed a few extra importance points, since at least one user cares enough about the data to subscribe.

Claim your site, claim your feed

Some search services allow a publisher to verify their website and/or feed for more frequent updates, statistics tracking, or highlighted search results listings. You’ll likely have to place a specially issued code within a web page or feed to prove your account has the ability to edit the site you would like to claim. Here are a few search services that offer author claiming:

Local Resources

This blog post is meant to serve as a general overview of the worldwide market for feed publishers. My views are skewed towards blogs published in English inside the United States. If you publish content in other languages or focused on a particular national audience, research the integration opportunities available with those specific services.

Summary

Feed publishing is a pretty busy space! Millions of customers are ready to receive regularly delivered content updates, either through their feed aggregator or through a search engine. Structured data delivered in easily digestible chunks is a good thing.

Feeds can serve many purposes, from lightweight APIs and data interchange formats to news updates. Each use has an intended audience and possible extended audience, and creating well described data in commonly understood data formats will extend your distribution reach and allow the many parsers and feed interfaces already present on the web to begin remixing your data in new ways for custom delivery and interpretation.

Declaring alternate web content for searchability and discoverability

Web authors may declare alternate versions of a single web page, exposing additional languages available or various file formats. HTML documents express these relationships using the link element in the document header.

Alternate language

Wikipedia main language offerings

A single Wikipedia article about “search” might have alternate representations and translations, such as “buscar” in Spanish, “suche” in German, “rechercher” in French, etc. A search engine or web browser software can discover the availability of these alternate document versions if declared by the publisher.

<link title=”Arabic” href=”http://ar.example.com/” rel=”alternate” hreflang=”ar” type=”text/html” charset=”ISO-8859-6″ />

The example markup above advertises an alternate version of example.com available in Arabic expressed in the ISO character set 8859-6. If a user capable of reading Arabic arrives at the page they can now take appropriate action.

Alternate format

The HTML specification also allows publishers to associate alternate file formats with a web page. A publisher might declare alternate versions of the page available in plain text, PDF, or a web feed format such as RSS or Atom.

<link title=”Print Me” href=”http://example.com/index.pdf” rel=”alternate” media=”print” type=”application/pdf” />

Modern browsers take advantage of these alternate file format declarations, lighting up a special icon when a web feed is discovered. Internet Explorer 7, Firefox 2, and Opera 9 advertise the availability of a web feed corresponding to the viewed web page.

Internet Explorer 7 web feed highlight

The ease-of-use and availability of these new feed discovery tools will convert website visitors into website subscribers, strengthening each user’s relationships with your content.

This post is the part 1 of 2 of a 15-minute feed syndication best practices presentation from WebmasterWorld PubCon 2006 in Las Vegas. Part 2, Feed publishing best practices, is much longer.

The Spam Farms of the Social Web

Blogs and other social media tools have changed the publishing landscape over the past few years, making it easier than ever to share information with the world. The ease of use and focused attention of the medium has also helped create new opportunities for spammers to automatically generate content, buy links, and get noticed by search engines and other points of aggregation. In this post I will break down the operations of one spam network utilizing social media technologies such as WordPress, Digg, del.icio.us, and more to climb the search results and generate revenue through ads and affiliate programs.

Last weekend I noticed a Digg submission about weight loss tips had climbed the site’s front page, earning a covetous position in the top 5 technology stories of the moment. The 13 sure-fire tips were authored by “Dental Geek” and posted to the “Discount Dental Plan” category on his WordPress blog. Scanning the sidebar links and adjacent content it was obvious this content was out of place on a page optimized for dental insurance. The webmaster of i-dentalresources.com had inserted some Digg bait, seeded a few social bookmarking services, and waited for links and page views to roll in, creating a new node in a spam farm fueled by high-paying affiliate programs and identity collection for resale.

eBizzSol portfolio snapshot

The spammer’s domain is managed by eBizzSol, a company with fake domain registration information including the address block of a Christian church in Fullerton, California. The dental site is registered to an address in Dhaka, the capital of Bangladesh. Based on the broken English I’ve found on the network’s sites an offshore base of operations would not surprise me. eBizzSol mentions about 200 sites in its portfolio, including real estate, mortgage, casinos, and more. They even advertise a content generation service for SEOs offering six blog posts a month for $75 optimized for specific keywords, including guarantees for blog directory and ping submissions. There are other sources of content generation available for hire online, creating a flow of content republished across a target category optimized for specific terms.

Follow the money

Why would someone want to create a site optimized for dental services? A search engine such as Google or Yahoo! discovers the site, indexes its pages, and starts including its content in search results for targeted keywords. Web searchers associate search engine rank with authority on a subject such as lowering an insurance premium or mortgage and generate a large amount of money per action. This particular site is collecting $40 or more per dental plan sold through a dental plan reseller and targeting specific keywords of value and boasts search engine index inclusion of “just a few hours” on its pages.

The dental terms targeted cost up to $18 a click, offering incentives for top organic search conversion. Below is a price estimate from Google for keyword targeting in the United States.

Google AdWords pricing
Search termCPC ($)
teeth whitening18.66
sedation dentistry12.80
cosmetic dentistry12.76
dental plans9.78
dental implant6.85
pediatric dentist6.77
discount dental plans5.93
oral surgery4.95
braces3.39
cavity1.88

Gathering links

Directories

Yahoo! directory pricing

This webmaster bought links from the Yahoo! directory, the Microsoft Small Business Directory, Business.com, and a few others, placing a link to their site within targeted categories. They are cheaper than the $1000 links purchased on sites such as the W3C, but these listings are often just as spammy.

Virality

Digg sample count The article link was submitted to Digg by a user who joined Digg last month yet is already ranked in the top 150. The story received over 900 Diggs and is currently buried. A newly minted user posted to Reddit, posted to Newsvine, and posted to del.icio.us using the same name on each service. Seeding and voting up the content worked, as the blog post made its way to the top story listings on each social news service.

As of this evening the spam site has 353 inlinks from 212 external pages, mostly due to its viral marketing efforts on social networks. Some social bookmarking users include their bookmarked links in their blog sidebar, creating additional direct links throughout their entire site in addition to the original bookmarking service location. The spam network had successfully spread a piece of content throughout multiple user communities, and onto individual blogs in the process.

Summary

Certain topics are especially well suited for baiting the technology-oriented crowds of social news and bookmarking sites. Stories focused on Apple, Firefox, Google, Nintendo, history of computers, top X lists, or the target social site itself are common baiting practices used to attract attention and place a new content node on the map. Opportunists will continue to jump into new networks of influence and promote their own sites, gathering search engine juice even when the brief blip of attention has passed and the crowd moves on to another story of the moment.

World of Warcraft female human with shovel

I believe social media accounts are currently available for rent or for sale, rewarding active users with paid placements or account resells in much the same way as a World of Warcraft character might be resold on eBay. Social media sites and search engines need to stay on top of this new form of content creation, continually analyzing data and scrubbing out the dirt. Sites overrun with web spam quickly lose their utility and might be banned from search engines.

Social media sites continue to change the way we interact with data but expect more activity and content shaping in the future from marketers targeting the social media space for a quick link injection.

Social network marketing, spam, and gaming

I spent the last few days among webmasters at the PubCon conference, where most conversations were focused on marketing yourself online to humans and search engines. The 2000 attendees focused on ranking themselves as high as possible in search engine result pages and driving site traffic. Methods of achieving these goals cover a full spectrum of white hat to black. Social networking and crowdsourcing sites are new focuses of the search engine marketing sector, taking advantage of loose editing and account creation restrictions to boost a site’s visibility.

Social networking and e-commerce

Should every item in your product catalog have a MySpace profile? A few retailers think so, and mentioned creating automated processes to create new accounts on sites such as MySpace and Vox. If a user wants to add Tickle Me Elmo Extreme to his or her friend list it might just be a profile created by a shopping comparison site, toy merchant, or an affiliate. Toymakers such as Mattel are likely not policing their brand on sites such as MySpace, leaving some opportunity for others to produce the content and gather links, affiliate fees, and more.

Most web publishers aren’t making a cent and would happy to take a few dollars in exchange for a link. That’s the opinion of a few new companies and webmasters specializing in buying links on weblogs and hobby sites on the web. A few dollars might buy a link on a recipe site making sure every mention of “sharp knife” points to a specific product. Marketers who pay a little more might buy a link in a blog plugin or theme. One consultant mentioned local trade associations are really easy to “buy off.”

The links are distributed across the web, look almost natural, and are a tougher for a search engine to spot as purchased. Sometimes a sponsorship such as Bizrate’s placement on CPAN pages is one example, but the success of blog placement in search engine results creates cheaper and more distributed points of purchase.

Gaming Digg

Digg was a popular topic of discussion in the hallways, with lots of stories about how sites can tap into the Digg’s huge audience and secure a few choice links and good traffic. Some marketers create a story aimed at the Digg audience, such as the top 10 reasons Mac users love Daily Show, and with the appropriate submitters and human or bot-powered voting rise towards the top. A few search engine marketing consultants are promoting their account status and influence on Digg to clients. User-powered content is a popular target, and some of the techniques used are pretty clever and advanced.

Summary

There is a lot of activity in the social networking and user generated content space from marketers and spammers. New services need to pay attention to a variety of attack vectors and patch holes and vulnerabilities quickly to stay relevant, useful, performing well. I’ve summarized some of the already public and well-discussed vectors of exploitation, but there are a lot more advanced methods skewing search and discovery on today’s social web I won’t be blogging about.

EmTrace WidgetStation

WidgetStation

A Korean company specializing in smartphone development is releasing a hardware device next year focused on widgets. The WidgetStation from Emtrace Technologies has both a mono and color LCD and receives content update over Ethernet and/or USB connections. It’s a mini computer with an ARM processor, NAND flash memory for local storage, and RAM.

The mono LCD is designed for long-term display items such as a clock or weather while the color LCD displays built-in and customizable content from the Internet or your desktop, including support for audio playback.

Emtrace’s past developing for smartphones in a mobile-heavy culture such as Korea should give it a leg-up in this emerging market of widget hardware producers. Competitors include PortalPlayer Preface, Chumby, and Ambient Devices. Akihabaranews loves the WidgetStation, which has already won a CES Innovations award in the Personal Electronics category.

Will people use a dedicated hardware device for widgets? I think so. I’ve eyed Internet-enabled photo frames, digital audio players, weather stations, atomic clocks, and more for my own personal use but price and bulk usually keeps me away. Combining functions on one device remotely configurable from the Internet makes a lot of sense and could be pretty popular.

Google Personalized Homepage for your domain

Google Apps for Your Domain Start Page

Users of Google Apps for Your Domain can now add a homepage with a custom set of configured gadgets for their users. The new feature lets companies configure mail messages, calendar data, specialized web feeds, and more as their employees’ portal to the web. The group customization feature was previously only available to large partnerships such as Dell and Gateway.

The Apps for Your Domain program launched in August and includes custom branded and custom addressable access to Google Mail, Talk, Calendar, webpage creation, and now your own start page to bring it all together. The Google search box unites most of these services, bring users closer to Google’s advertisements.

Speaking at PubCon on Tuesday

I’ll be in Las Vegas early next week speaking at PubCon on feed syndication best practices. The session takes place from 10:15-11:30 a.m. on Tuesday if you are attending the search conference.

I have not been to Las Vegas in a few years so I’ll be checking out new pieces of grandeur at the Wynn, new Caesars Palace, Treasure Island, etc.

Hopefully there will be lots of search geeks in attendance leading to interesting conversations.

Business Plan Archive studies companies of the late 90s

I’ve been sucked in to the Business Plan Archive created by the University of Maryland and George Mason. They’ve archived business plans, investor pitches, term sheets, screenshots, mockups, and even old business cards. Here’s a sample company description from an online job site:

The company will introduce a new network of online recruiting sites that will utilize the latest advances in internet technology to provide online interaction between employers and job seekers.

Seeking a $26 million Series A in mid-1998. Another company I found was after $10 million for a sports site and ended up with $500,000 in convertible notes. I’m having fun clicking around through interesting company names.

Today’s Wall Street Journal cites new research from economist David A. Kirsch focused on every business plan submitted to a single East Coast venture firm between the Netscape IPO in August 1995 to the Nasdaq peak in March 2000. Nearly half of the 1100 companies studied were still in business in 2004. The study shows even more startups could have been created during that time focusing on small markets instead of a grow big fast mentality, but VCs often passed on the smaller deals.

Update: I found promotional and training videos from a few companies and even logo merchandise a.k.a. swag.

Fox Interactive should host a MySpace conference

Yesterday’s Widgets Live! conference provided an overview of an industry but I think there is enough interest in the social networking space to warrant a separate conference. I think Fox Interactive Media and Adobe should partner to create a MySpace conference in the first quarter of 2007 focused on integrating your content, brand, or products on MySpace. The event would cover topics such as the development of widgets, the right and wrong way to engage a social media community, help create new SpringWidgets and outline ways to work with Fox Interactive Media for continued success.

There are currently lots of developers creating embedable content for the MySpace community. A few products such as YouTube are in direct competition with similar products such as MySpace Video but there is a long tail of content such as rubbing a budda’s belly for good luck or counting down the days until school’s out that will be developed by multiple outside companies and help make MySpace a success. Comments from parent company News Corp execs such as Peter Chernin make these developers feel about as welcome as a fakester profile on Friendster.

FIM could do a really good thing and directly engage that community, shaping the content and quality present on its network. Adobe is an ideal partner since Flash is the preferred embed of MySpace and the basis of SpringWidgets. Host it in the winter when people are excited to leave the snow and come to southern California. Fox owns a few venues in town, which should make planning a lot easier.