Blog – Page 5 – Niall Kennedy

Google App Engine for developers

On Monday Google launched Google App Engine, a hosted dynamic runtime environment for Python web applications inside Google’s geo-distributed architecture. Google App Engine is the latest in a series of Google-hosted application environments and the first publicly-available dynamic runtime and storage environment based on large-scale propriety computing systems.

Google App Engine lets any Python developer execute CGI-driven Web applications, store its results, and serve static content from a fault-tolerant geo-distributed computing grid built exclusively for modern Web applications. I met with the App Engine’s team leads on Monday morning for an in-depth overview of the product, its features, and its limitations. Google has been working on the Google App Engine since at least March 2006 and has only just begun revealing some of its features. In this post I will summarize Google App Engine from a developer’s point of view, outline its major features, and examine pitfalls for developers and startups interested in deploying web applications on Google’s servers.

What is Google App Engine?

Google App Engine is a proprietary virtualized computing suite covering the major common components of a modern web application: dynamic runtime, persistent storage, static file serving, user management, external web requests, e-mail communication, service monitoring, and log analysis. The Google App Engine product offers a single hosted production web server stack hosted on Google’s custom-designed computers and datacenters distributed around the world.

Google App Engine is a managed hosting environment with a tightly managed stack running in a machine-independent environment. It simplifies the deployment and management of your web application software stack while constraining you to a specific stack. When I start a new web development project today I have to first setup a tiered system to effectively handle site growth:

Purchase dedicated servers or virtualized slices. Estimate necessary CPU, memory, disk space, etc. at each tier.
Configure a web server for dynamic content. Install Python and its eggs, Apache HTTPd and extra modules such as modwsgi. Configure and tweak each. Open appropriate ports. Listen.
Setup a MySQL database server and choose the appropriate storage engine. Configure MySQL, add users, add permissions. Tweak and optimize.
Add an in-memory caching layer for frequently accessed dynamic content.
Monitor your uptime and resource utilization with Ganglia and/or other tools on each machine.
Serve static files such as JavaScript, CSS, and images from a specialized serving environment such as Amazon’s Simple Storage Service.
Turn your static server into an origin server for a CDN with points of presence close to your website’s users.
Connect each piece of the stack, keep its software updated to avoid security vulnerabilities, and hopefully respond to all website requests in less than a second.
Dedicate work hours and expertise to all the above. Hire outside assistance if needed.
Don’t go broke trying.

Your tiers will expand as your new web application gains popularity. Your single-server tiers become load-balanced services, message bus broadcasts and listeners, and distributed cache arrays at scale. You’ll probably spend time rearchitecting your application at each stage of growth to incorporate for these new resource demands if you can afford the time, expertise, and effort.

Google App Engine is a new and interesting solution for Python developers interested in adding features, not servers. Google spends hundreds of millions of dollars developing its custom infrastructure with 12-volt power supplies tapped into a hydro-electric dam next door and fat fiber pipes owned by local governments carrying requests and responses to their proper home. Google’s physical infrastructure is vast array of highly optimized web machines, and we’ll now be able to see how such infrastructure performs across more generic applications on App Engine.

Freemium hosting model

Google App Engine is a “freemium” business model offering basic features for free with paid upsells available for application developers exceeding approximately 5 million pageviews a month. This resource quota approximately matches the Google Analytics 5 million pageview limit. Google Analytics customers may currently exceed this limit if they maintain an active AdWords account with a daily advertising budget of $1 or more. The Google App Engine team plans to introduce pricing and service level agreements for additional resources, priced in a pay-as-you-go marginal resource structure, once the product leaves its limited 10,000-person preview period later this year.

Quota Type	Limit / day
HTTP requests	650,000
Bandwidth In	9.77 GB
Bandwidth Out	9.77 GB
CPU megacycles	200 million
E-mails	2,000
Datastore calls	2.5 million
External URL requests	160,000

Google publishes these quotas and provides administrative monitoring tools. The quotas are just a guideline as Google may cut off access to your application if you receive a traffic spike of an unspecified duration. The Google App Engine quota page specifies:

If your application sustains very heavy traffic for too long, it is possible to see quota denials even though your 24-hour limit has not yet been reached.

Google App Engine already failed the Techcrunch effect and appears the platform is currently unable to handle referral traffic loads from a popular blog or news site typically associated with a product launch. The traffic spike cutoffs make me think twice about hosting anything of value on App Engine.

The team

The Google team behind App Engine has a long history in developer services. Team members include some of the top Python experts in the world, financial transaction specialists, and developer tool builders.

Python creator Guido van Rossum wrote the App Engine SDK and ported the Python runtime and Django framework for the new environment. Google App Engine is Guido’s first full-time project at Google after his Noogler project Mondrian.
Technical lead Kevin Gibbs previously worked on the the SashXB Linux development toolset and multiple RPC projects at IBM before he created Google Suggest in 2004.
Developer Ryan Barrett wrote the BigTable datastore implementation and related APIs. Previously Ryan was tech lead on Moneta, Google’s transaction processing platform and customer data store.
Product lead Paul McDonald has worked on Google Checkout, AdWords, and a Web-based IDE named Mashup Editor (all strong candidates for App Engine inclusion).
Product manager Peter Koomen has previously authored papers on natural language search and semantic analysis.

The list above is just a sampling of the full-team behind App Engine.

Feature limitations

Google App Engine is not without its faults. Applications cannot currently expand beyond the quota’s ceiling. It’s still unclear how an application will dynamically scale on App Engine once it leaves the farm leagues, and at what cost.

A few major issues include:

Static files are limited to 1 MB. App Engine does not support partial content requests (Accept-Ranges).
Cron jobs and other long-life processes are not permitted.
Applications are not uniquely identifiable by IP address, leading to a lack of identification for external communications. Applications may suffer from bad neighbor penalties from API providers upset at another app on the service.
No SSL support. No IP address complicates signing, but port 443 is open for requests. You can rely on Google services (and branding) for trusted login and possibly future payments.
No image processing. Python Imaging Library relies on C, and is therefore not a possible App Engine module.
Google user accounts. Site visitors are very aware of your choice in web hosts each time they attempt to logon to your application. I feel like this flow makes your application seem less professional, but may be a reasonable trade-off. Google will store your user data and potentially mine its data for better ad targeting.

Summary

Overall I am quite impressed with Google App Engine and its potential to remove operations management and systems administration from my task list. I am not confident in Google App Engine as a hosting solution for any real business while the host is in preview stage but those concerns may be alleviated once the product is ready for real customers and real service-level agreements.

Python developers have just been granted a few superpowers for future projects. As an existing Python and Django developer I know how difficult it can be to find a managed hosting provider with modern Python support. Many hosts are years behind, running Python 2.3. I am excited App Engine already features the programming tools I use every day, with a few modifications for their proprietary systems. App Engine should introduce more developers to Python and the Django framework and hopefully cause other web hosts to provide better Python support as well.

iPhone web app performance

The Exceptional Performance group at Yahoo! just released a detailed performance analysis of web applications on the iPhone. Yahoo! analyzed the full capabilities of the iPhone’s Safari browser including browser cache and transfer speeds.

Cache persistence

The Safari browser on iPhone allocates memory from the shared system memory but does not save web content into persistent storage. Any cached objects (CSS, JavaScript, images, etc.) are removed from memory on reboot.

Optimal component size

Safari for iPhone will only cache files 25 KB or smaller served using the Expires explicit expiration time or Cache-Control max-age directive HTTP headers. Safari decodes the file before saving it cache, meaning your total unzipped file size must squeeze under the 25 KB ceiling to hit the cache. Components already in cache are only replaced by new cacheable components using the least recently used algorithm.

Safari for iPhone is able to cache a maximum of 19 external components, placing a maximum cache limit at around 475 KB.

Download speed

Yahoo! found typical download iPhone download speeds vary from 82 kbps to 150 kbps when connected to a GSM cellular data network. Wi-Fi connections over an 802.11b/g networks obviously speed up the experience, but pages should assume cellular data load times when designing for a compelling user experience.

Summary

Web applications built for the iPhone’s Safari browser need to specifically target web performance these small devices and special cache rules. Desktop browser best practices such as zipped components and combined files for CSS and JavaScript may be too bloated for the Safari mobile browser. A few tips:

Limit cacheable components to a decompressed size of 25 KB or less
Limit yourself to 19 or less cached components
Minify CSS and JavaScript for slimmer file weights.
Use CSS sprites to combine multiple small images into a shared image under 25 KB

Sniff browser history for improved user experience

The social web has filled our websites with too much third-party clutter as we figure out the best way to integrate content with the favorite sites and preferences of our visitors. Intelligent websites should tune-in to the content preferences of their visitors, tailoring a specific experience based on each visitor’s favorite sites and services across the social web. In this post I will teach you how to mine the rich treasure trove of personalization data sitting inside your visitor’s browser history for deep personalization experiences.

I first blogged about this technique almost two years ago but I will now provide even more details and example implementations.

Evaluate links on a page

Web browsers store a list of web pages in local history for about a week by default. Your browsing history improves your browsing experience by autocompleting a URL in your address bar, helping you search for previously viewed content, or coloring previously visited links on a page. Link coloring, or more generally applying special CSS properties to a :visited link, is a DOM-accessible page state and a useful method of comparing a known set of links against a visitor’s browser history for improved user experience.

New Site
Visited site

A web browser such as Firefox or Internet Explorer will load the current user’s browser history into memory and compare each link (anchor) on the page against the user’s previous history. Previously visited links receive a special CSS pseudo-class distinction of :visited and may receive special styling.

<style type="text/css">
ul#test li a:visited{color:green !important}
</style>
<ul id="test">
  <li><a href="http://example.com/">Example</a></li>
</ul>

The example above defines a list of test links and applies custom CSS to any visited link within the set. Your site’s JavaScript code can request each link within the test unordered list and evaluate its visited state.

Test a known set of links

Any website can test a known set of links against the current visitor’s browser history using standard JavaScript.

Place your set of links on the page at load or dynamically using the DOM access methods.
Attach a special color to each visited link in your test set using finely scoped CSS.
Walk the evaluated DOM for each link in your test set, comparing the link’s color style against your previously defined value.
Record each link that matches the expected value.
Customize content based on this new information (optional).

Each link needs to be explicitly specified and evaluated. The standard rules of URL structure still apply, which means we are evaluating a distinct combination of scheme, host, and path. We do not have access to wildcard or regex definitions of a linked resource.

In less geeky terms we need to take into account all the different ways a particular resource might be referenced. We might need to check the http and https versions of the page, with and without a www. prefix to more thoroughly evaluate active use of a particular website and its pages.

I group my tests into sets of URLs with the most likely matches placed at the beginning of the set. I evaluate each link in the set until I find a match thereby exhausting positive indicators of site activity while prioritizing the data scan.

Live demos and examples

Sniffing a visitor’s browser history has good and evil implications. An advertiser can determine if you visited Audi’s website lately, drill down on exact Audi models, and offer related information without ever placing code on the Audi website. I have been scanning the browser history of my site visitors for the past few months and I have coded a few examples to show benevolent uses for improved user experience.

Online aggregators

Clusters of feed subscription buttons clutter our websites, displaying tiny banner ads for online aggregators of little use to most of our site visitors. My blog checks a known list of online aggregators against the current visitor’s browser history and adds a targeted feed subscription button for increased conversion. A Google Reader user will see an “Add to Google button” and a Netvibes user will see an “Add to Netvibes” button without cluttering up the interface. I insert direct links to each site’s feed handlers to help convert the current visitor into a long-term subscriber.

Once I match a particular service I could also check to see if the current visitor is already subscribed to my feed. I would simply need to run a second test against the data retrieval URL, such as feedid=1234, to match web traffic with subscriber numbers.

Visit my live example of link scanning popular online feed aggregators for a demo and the applicable code.

Social Bookmarks

I like to see my latest blog posts spread all over the web thanks to social bookmarking sites and other methods of content filtering and annotation. Most sites spray a group of tiny service icons near their blog posts and hope a visitor recognizes the 16 pixel square and takes action. Suck. There has to be a better way.

I can scan a current visitor’s browser history to determine an active presence on one or more bookmarking sites. Once I determine the current visitor is also a Digg user I can show live data from Digg.com to prompt a specific action such as submitting a story or voting for content. I can create a much better user experience for 3 services I know my visitor actively uses instead of spraying 50 sites across the page.

Visit my live example of link scanning popular social bookmarking sites for a demo and the applicable code.

OpenID providers

OpenID is an increasingly popular single sign-on method and centralized identity service. OpenID lets a member of your site sign-on using a username and password from a growing list of OpenID providers including your instant messenger, web portal, blog host, or telephone company account. Visitors signing up for your site or service shouldn’t have to know anything about OpenID, federated identities, or other geeky things, but should be able to easily discover they can sign-in with a service they already use and trust every day.

I can scan a list of sign-in endpoints for a list of OpenID providers and only present my site visitor with options actually relevant to their everyday web usage. Prompting a user to sign-in to your service with their WordPress.com account should be much more effective than an input field sporting an OpenID icon. Link scanning for active usage should increase new member sign-ups, reduce support costs due to yet another username and password, and make your members happy.

Visit my live example of link scanning current OpenID providers for a demo and applicable code.

Mapping services

Online mapping services have changed the way we interact with location data. Need to get to 123 Main Street? Not a problem, I’ll just send that data over to your favorite mapping service to help you find your way.

I can scan a visitor’s browser history to determine their favorite mapping service. Perhaps she is most comfortable with MapQuest, Google Maps, or Yahoo. Or maybe she uses a Garmin GPS unit and would prefer a direct sync with that specialized service. Determining my visitors’ favorite mapping tool helps me deliver a valuable visualization or link I know they prefer.

Visit my live example of link scanning map API providers for a demo and applicable code.

Summary

Websites should take advantage of the full capabilities of modern browsers to deliver a compelling user experience. Built-in capabilities such as XMLHttpRequest took years of implementation before finding its asynchronous groove in data-heavy websites. I hope we can similarly probe other latent useful features to improve the social web through more personalized and responsive experiences.

I have been the browser history of my website visitors for the past few months to gracefully enhance adding my Atom feed to their favorite feed reader. Easily recognized branding such as “Add to My Yahoo” has yielded much higher conversion rates than a simple Atom link with a minimal effect on page load performance. Dynamically checking for active usage of 50 or so aggregators allows me to extend my total test list and promote an obscure tool that might never make the cut for permanent on-screen real estate.

How will your site utilize your visitor’s browser history for a more custom user experience? How will you connect data in new ways once you have concrete knowledge of the new feature developments that will be most useful to your visitors’ online lifestyle?

Data interchange for the social web

Data portability is only useful if outside systems can comprehend the exported data. Well-described and interoperable data sets open new possibilities for context-aware social applications, importing your friends, photos, or genetic markup from an existing system into your current tool of choice. In this post I will discuss website best practices for exporting portable, descriptive data sets in the name of data portability. This post builds upon user authorization concepts covered in my last post.

Expressing data between two unrelated systems is difficult at best. You need a shared set of vocabulary to explain even the basic data points (time, person, etc.). Good data exports will want to represent as much data as possible with the least probable data loss.

NASA launched the Voyager 1 spacecraft into space in September 1977 with a set of golden records onboard. These records communicate small pieces of human knowledge to any intelligent life that may discover our small explorer. The graphic above is humanity’s attempt at data interoperability, teaching alien explorers the proper positioning of an included stylus over a record rotating once every 3.6 seconds (time is expressed as the fundamental transition of the hydrogen atom). Thankfully web developers do not have to worry about interoperability with so many unknown measures, but your data could just as easily lost and never played back for other worlds to hear.

Identify exportable data

The first step in data export is identifying the unique pieces of information you would like to package and ship outside your walls. What information might be useful to a user seeking to backup or otherwise export his or her data? How would you like to import such data back into your own website?

Pictured above is a list of messages stored in Gmail. One message is part of a continuing conversation or thread, another message is flagged, and two messages have custom labels. A typical e-mail system might just export a list of raw messages but could possibly lose key data such as a flagged state or labels/tags.

Research existing data standards

Data interoperability is not a new concept and your current challenges may be easily solved by existing certified and de-facto standards. Standards increase the chances your data will be consumed, processed, and understood by others. You could invent an entirely new dialect and vocabulary to describe your information but you will be much more successful at disseminating data if you are easily interpreted.

Standards organizations have spent years analyzing the essential elements and interoperability requirements of many common forms of data. Below are just a few standard data formats for elements of the social web.

People, Places, and Things: vCard; xNAL; KML; LDAP
Events: iCalendar
News articles: Atom Syndication Format; News Industry Text Format
Human DNA: NCBI homo sapien genome build 36.2, FASTA.

Each data markup has a specific set of required data intended for a specific audience or interpreter. Google Maps prefers a feed of business listings and locations in xNAL while Google Earth prefers KML for example. Bloggers output news articles in Atom for consumption by a specific set of tools, while mainstream publications mark up their stories in a news industry format for increased granularity. Some formats may not be applicable if your product does not store all the required types of data (i.e. you know their name but not their hometown). Your company will need to select a target output format based on expected external use and how your information might map onto a format’s required elements.

Extend where appropriate

Each format supports extended namespaces for custom data not covered by the base vocabulary. A member’s favorite food or soccer club is not an essential component of an international standards body but can easily be extended with your own custom namespace where appropriate.

The same rules of data loss apply to custom namespaces: custom definitions are more likely to be missed while common namespaces are more easily understood. Extended namespaces may already be in active use by a big company or a coalition, increasing your chances of data visibility. An AOL Instant Messenger screenname is defined as “X-AIM” in a vCard context for example, where the X- represents an extension element.

Summary

Data portability and interoperability on the social web continues to be a hot topic. While there are PR benefits for first-movers I expect there will not be widespread adoption until portable data has a remote consumer. Startups with limited resources will need to see a possible consuming service for their exported data before carving out part of their product cycle for the new feature. I think data portability is a great project for this summer’s interns, providing deep exposure to data complexity and the industry as a whole while balancing proper authentication and privacy concerns.

Data Portability, Authentication, and Authorization

The social web is booming, signing up new users and generating new pieces of unique content at a steady clip. A recurring theme of the social web is “data portability,” the ability to change providers without leaving behind accumulated contacts and content. Most nodes of the social web agree data portability is a good thing, but the exact process of authentication, authorization, and transport of a given user and his or her data is still up in the air. In this post I will take a deeper look at the current best practices of the social Web from the point of view of its major data hubs. We will take a detailed look at the right and wrong ways to request user data from social hubs large and small, and outline some action items for developers and business people interested in data portability and interoperability done right.

General issues

Friends, photographs, and other objects of meaning are essential parts of the social web. We’re much more inclined to physically move from one city to the next if our friends, furniture, and clothes come along with us. The interconnectedness of the digitized social web makes the moving process much simpler: we can lift friends from one location into another, clone your digital photographs, and match your blog or diary entries to the structure of your new social home. Each of these digital movers represent what we generally call “social network portability” or, more generically, “data portability.”

Social networks accelerate interactions and your general sense of happiness in your new home through automated pieces of software designed to help you move data, or simply mine its content, from some of the most popular sites and services on the Web. These access paths are roughly equivalent to a new physical location setting up easy transit routes between some of the largest cities to help fuel new growth.

Your e-mail inbox is currently the most popular way to construct social context in an entirely new location. Site such as Facebook request your login credentials for a large online hub such as Google, Yahoo!, or Microsoft to impersonate you on each network and read all data which may be relevant to the social network such as a list of e-mail correspondents. Every day social network users hand over working user names and passwords for other websites and hope the new service does the right thing with such sensitive information. Trusted brands don’t like external sites collecting sensitive login information from their users and want to prevent a repeat of the phishing scams faced by PayPal and others. There is a better way to request sensitive data on behalf of a user, limited to a specific task, and with established forms of trust and identity.

Use the front door

Google, Yahoo!, and Microsoft all support web-based authentication by third parties requesting data on behalf of an active user. The Google Authentication Proxy interface (AuthSub), Yahoo! Browser-Based Authentication, and Microsoft’s Windows Live ID Web Authentication issue a security token to third-party requesters once a user has approved data access. This token can allow one-time or repeated access and is the preferred method of interaction for today’s large data hubs. The OAuth project is a similar concept to web-based third-party authentication systems of the large Internet portals, and may be a common form of third-party access in the future.

Supporting websites provide limited account access to a registered entity after receiving authorization from a specific user. The user can typically view a list of previously authorized third parties and revoke access at any time. The third-party retains access to a particular account even after the user changes his or her password.

Imagine if you could give your local grocery store access to just your kitchen, but not hand over the keys to your entire house. A delivery person would be automatically scanned upon arrival, compared against a registry, and granted access to the kitchen if yo previously assigned them access. You could revoke their access to your kitchen at any time, but they never have access to your jewelry box or other non-essential functions within your house.

Identify yourself

Third-party applications requesting access should first register with the target service for accurate identification and tracking. Applications receive an identification key for future communications connected to a base set of permissions required to accomplish your task (e.g. read only or read/write). A registered application can complete a few extra steps for added user trust and less user-facing warning messages.

State your intentions

Your application or web service should focus on a specific task such as retrieving a list of contacts from an online address book. Your authentication requests should specify this scope and required permissions (e.g. read only) when you request a user’s permission to access his or her data.

An application declaring scope lets users know you are only interested in a single scan of their e-mail and you will not have access to their credit card preferences, stored home address, or the ability to send e-mails from their account. Not requesting full account access in the form of a username and a password creates better trust from the user and the user’s existing service(s).

Provide secure transport

How will you transport my user’s data back to your servers? Did you bring an armored car with your company’s logo prominently displayed on the side or will my data sit in the back of your borrowed pick-up truck? Requesting applications should transport user data over secure communications channels to prevent eavesdropping and forged messages. Registered and verified secured communications will result in less user-facing warning messages of mistrust, and secure certificates are relatively inexpensive. Large portals such as Google or Microsoft will bump your communications (and privileges) to mutual authentication if you are capable.

Register an SSL/TLS certificate for your website to enable secure transport and further identify yourself. Certificates vary in cost and complexity from a free self-signed cert to paid certificates from a major provider with extended validation and server-gated cryptography. Google and Yahoo! use 256-bit keys. Windows Live and Facebook use 128-bit keys.

Summary

Data authorization is the first step in data portability. Emerging standards such as OAuth combined with established access methods from Internet giants provide specialized access for third-parties acting on behalf of another user. Sites interested in importing data from other services should take note of these best practices and prepare their services for intelligent interchange.

Upgrade your Google Analytics tracker

Google released a new version of its Google Analytics tracking code in December after a two-month limited beta. The new Google Analytics tracker is a complete rewrite of JavaScript inherited from the Urchin acquisition in 2005 and the first time the two products have been officially decoupled. The existing version of Google Analytics tracker, urchin.js, has been deprecated but should continue to function until the end of 2008. Google will only roll out new features on the new ga.js tracker. If you currently track website statistics using Google Analytics you should upgrade your templates to take advantage of the new libraries.

What changed?

The new Google Analytics tracker supports proper JavaScript namespacing and more intuitive configuration methods (e.g. _setDomainName instead of _udn). My tests show about a 100 ms faster execution even with a 24% increase (1514 bytes) in file size (ga.js is also minified).

The new tracking code makes advanced features a lot more accessible. You can now track a page on multiple Google Analytics accounts, which should help user generated content sites integrate their author’s Google Analytics IDs alongside the company’s own tracking account. The new event tracker lets you group a set of on-page related actions such as clicking a drop-down menu or typing a search query (very useful for widgets). Ecommerce tracking is now a lot more readable. You can read about all the tracker changes in the Google Analytics migration guide PDF.

Implementation

Switching your site tracker is pretty simple. Trackers are now created as objects and configured before the page is tracked.

<script type="text/javascript" src="http://www.google-analytics.com/ga.js"></script>
<script type="text/javascript">
var pageTracker=_gat._getTracker('UA-XXXXXX-X');
pageTracker._initData();
pageTracker._trackPageview();
</script>

That’s it. You are now running the new Google Analytics tracker. You’ll need to swap in your Analytics account and profile IDs, which should be pretty easy to spot in your existing code.

Summary

Google Analytics tracking code is completely rewritten for faster on-page behavior that plays well with others. The old tracker will be deprecated within a year, and new features are only available to users running the new code. Existing Google Analytics users should swap out their tracking code to take full advantage of this free stats tool.

FeedDemon and NetNewsWire are now free

NewsGator is giving away desktop feed readers FeedDemon, NetNewsWire, and NewsGator Inbox. The company hopes to regain any loss of revenue from its desktop business with new enterprise sales leads and better attention metadata. The company announced the change in pricing in a press release today and a blog post by founder Greg Reinacker.

NewsGator’s desktop feed readers previously cost about $30 each and faced some commoditization through feed reading software bundled with modern operating systems, office suites, or competitive open-source solutions. Windows client FeedDemon needs to compete with feed reading capabilities built-in to Windows Vista and Internet Explorer 7 or open-source clients such as RSS Bandit. Apple client NetNewsWire competes with Mail.app in Leopard and open-source freeware such as Vienna. NewsGator Inbox competes directly with Outlook 2007. Online competitors such as Google Reader are starting to deliver desktop-like speeds in an always up-to-date, always available model.

NewsGator differentiates its desktop client offerings from the competition through the NewsGator Online hub. Each client filters its requests for feed data through the centralized online service and synchronizes each user’s list of subscriptions, read/unread items, shared snippets, and more. NewsGator plans to use the extended user base available via its free clients to fine-tune relevancy and other metrics available through uniquely identifiable attention data.

[B]y using your data, in combination with aggregate data from other users, we can deliver a better experience for everyone. And that’s a good thing – both for us and for you.

Each desktop application can also sync with a local activity hub NewsGator is selling within enterprises. They hope free tools will infiltrate corporate America to generate new sales leads and internal advocates for bigger licensing fees.

Summary

NewsGator’s move to free is an interesting risk for a changing business. Competitors such as Attensa do not have a similar strength in the desktop client space, and NewsGator will continue to worry about Microsoft shipping an update to SharePoint that could shake up their enterprise market. In the mean time thousands of consumers will be able to download quality software for free, and the small desktop clients can continue developing cool new features funded by enterprise usage.

Update: Nick Bradbury, creator of FeedDemon, shares his thoughts on the freebies on his blog.

Google processes over 20 petabytes of data per day

Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters. The average MapReduce job ran across approximately 400 machines in September 2007, crunching approximately 11,000 machine years in a single month. These are just some of the facts about the search giant’s computational processing infrastructure revealed in an ACM paper by Google Fellows Jeffrey Dean and Sanjay Ghemawat.

Twenty petabytes (20,000 terabytes) per day is a tremendous amount of data processing and a key contributor to Google’s continued market dominance. Competing search storage and processing systems at Microsoft (Dyrad) and Yahoo! (Hadoop) are still playing catch-up to Google’s suite of GFS, MapReduce, and BigTable.

MapReduce statistics for different months
	Aug. 2004	Mar. 2006	Sep. 2007
Number of jobs (1000s)	29	171	2,217
Avg. completion time (secs)	634	874	395
Machine years used	217	2,002	11,081
`map` input data (TB)	3,288	52,254	403,152
`map` output data (TB)	758	6,743	34,774
`reduce` output data (TB)	193	2,970	14,018
Avg. machines per job	157	268	394
Unique implementations
`map`	395	1,958	4,083
`reduce`	269	1,208	2,418

Google processes its data on a standard machine cluster node consisting two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. This type of machine costs approximately $2400 each through providers such as Penguin Computing or Dell or approximately $900 a month through a managed hosting provider such as Verio (for startup comparisons).

The average MapReduce job runs across a $1 million hardware cluster, not including bandwidth fees, datacenter costs, or staffing.

Summary

The January 2008 MapReduce paper provides new insights into Google’s hardware and software crunching processing tens of petabytes of data per day. Google converted its search indexing systems to the MapReduce system in 2003, and currently processes over 20 terabytes of raw web data. It’s some fascinating large-scale processing data that makes your head spin and appreciate the years of distributed computing fine-tuning applied to today’s large problems.

MacSB Macworld dinner

I am once again organizing a dinner gathering during the Macworld conference for Mac small business owners and developers. This year’s MacSB Macworld dinner will take place on Tuesday, January 15, starting at 6.p.m. at Chaat Cafe in San Francisco. We will discuss the latest keynote announcements, plan future iPhone applications, and eat Indian food.

Chaat Café is located at 320 3rd Street (corner of 3rd and Folsom) in downtown San Francisco, one block from Macworld and the Moscone conference center. The restaurant has free Wi-Fi and power outlets near some tables, so bring your laptop to show off your latest creations. You will order food and drink individually near the restaurant entrance and pay only for what you personally eat or drink (typically less than $10). Metered parking is free after 6 p.m. or you may park in the building’s parking garage (enter on 3rd Street) with two hours of validated parking if you choose to drive.

Past MacSB gatherings in 2007 and in 2006 have been good opportunities to reflect on the changing Mac software market, share tips with like-minded small business owners, or attend group therapy as Apple just annihilated your product with their own software release. Mac fans are welcome to come out and meet the independent developers of some of their favorite apps.

I have warned the restaurant staff to expect a big crowd but you can help make things run a bit smoother by leaving an RSVP in the comments of this post or on Upcoming.org.

Add to iCal

What is Google App Engine?

Freemium hosting model

The team

Feature limitations

Summary

Cache persistence

Optimal component size

Download speed

Summary

Evaluate links on a page

Test a known set of links

Live demos and examples

Online aggregators

Social Bookmarks

OpenID providers

Mapping services

Summary

Identify exportable data

Research existing data standards

Extend where appropriate

Summary

General issues

Use the front door

Identify yourself

State your intentions

Provide secure transport

Summary

What changed?

Implementation

Summary

Summary

Summary

What is a Facebook Page?

The Fakester Problem

Permission-based inclusion

Local business authorization is expensive or impossible

Big brand, many managers

The techie friend

Protecting your company and brand

Summary