January 2008 Archives

  1. Jan29

    Data interchange for the social web

    Data portability is only useful if outside systems can comprehend the exported data. Well-described and interoperable data sets open new possibilities for context-aware social applications, importing your friends, photos, or genetic markup from an existing system into your current tool of choice. In this post I will discuss website best practices for exporting portable, descriptive data sets in the name of data portability. This post builds upon user authorization concepts covered in my last post.

    Expressing data between two unrelated systems is difficult at best. You need a shared set of vocabulary to explain even the basic data points (time, person, etc.). Good data exports will want to represent as much data as possible with the least probable data loss.

    Voyager golden record cover

    NASA launched the Voyager 1 spacecraft into space in September 1977 with a set of golden records onboard. These records communicate small pieces of human knowledge to any intelligent life that may discover our small explorer. The graphic above is humanity's attempt at data interoperability, teaching alien explorers the proper positioning of an included stylus over a record rotating once every 3.6 seconds (time is expressed as the fundamental transition of the hydrogen atom). Thankfully web developers do not have to worry about interoperability with so many unknown measures, but your data could just as easily lost and never played back for other worlds to hear.

    Identify exportable data

    The first step in data export is identifying the unique pieces of information you would like to package and ship outside your walls. What information might be useful to a user seeking to backup or otherwise export his or her data? How would you like to import such data back into your own website?

    Google Mail message listing sample

    Pictured above is a list of messages stored in Gmail. One message is part of a continuing conversation or thread, another message is flagged, and two messages have custom labels. A typical e-mail system might just export a list of raw messages but could possibly lose key data such as a flagged state or labels/tags.

    Research existing data standards

    Data interoperability is not a new concept and your current challenges may be easily solved by existing certified and de-facto standards. Standards increase the chances your data will be consumed, processed, and understood by others. You could invent an entirely new dialect and vocabulary to describe your information but you will be much more successful at disseminating data if you are easily interpreted.

    Standards organizations have spent years analyzing the essential elements and interoperability requirements of many common forms of data. Below are just a few standard data formats for elements of the social web.

    People, Places, and Things
    vCard
    xNAL
    KML
    LDAP
    Events
    iCalendar
    News articles
    Atom Syndication Format
    News Industry Text Format
    Human DNA
    NCBI homo sapien genome build 36.2, FASTA.

    Each data markup has a specific set of required data intended for a specific audience or interpreter. Google Maps prefers a feed of business listings and locations in xNAL while Google Earth prefers KML for example. Bloggers output news articles in Atom for consumption by a specific set of tools, while mainstream publications mark up their stories in a news industry format for increased granularity. Some formats may not be applicable if your product does not store all the required types of data (i.e. you know their name but not their hometown). Your company will need to select a target output format based on expected external use and how your information might map onto a format's required elements.

    Extend where appropriate

    Each format supports extended namespaces for custom data not covered by the base vocabulary. A member's favorite food or soccer club is not an essential component of an international standards body but can easily be extended with your own custom namespace where appropriate.

    The same rules of data loss apply to custom namespaces: custom definitions are more likely to be missed while common namespaces are more easily understood. Extended namespaces may already be in active use by a big company or a coalition, increasing your chances of data visibility. An AOL Instant Messenger screenname is defined as "X-AIM" in a vCard context for example, where the X- represents an extension element.

    Summary

    Data portability and interoperability on the social web continues to be a hot topic. While there are PR benefits for first-movers I expect there will not be widespread adoption until portable data has a remote consumer. Startups with limited resources will need to see a possible consuming service for their exported data before carving out part of their product cycle for the new feature. I think data portability is a great project for this summer's interns, providing deep exposure to data complexity and the industry as a whole while balancing proper authenication and privacy concerns.

  2. Jan21

    Data Portability, Authentication, and Authorization

    The social web is booming, signing up new users and generating new pieces of unique content at a steady clip. A recurring theme of the social web is "data portability," the ability to change providers without leaving behind accumulated contacts and content. Most nodes of the social web agree data portability is a good thing, but the exact process of authentication, authorization, and transport of a given user and his or her data is still up in the air. In this post I will take a deeper look at the current best practices of the social Web from the point of view of its major data hubs. We will take a detailed look at the right and wrong ways to request user data from social hubs large and small, and outline some action items for developers and business people interested in data portability and interoperability done right.

    General issues

    Friends, photographs, and other objects of meaning are essential parts of the social web. We're much more inclined to physically move from one city to the next if our friends, furniture, and clothes come along with us. The interconnectedness of the digitized social web makes the moving process much simpler: we can lift friends from one location into another, clone your digital photographs, and match your blog or diary entries to the structure of your new social home. Each of these digital movers represent what we generally call "social network portability" or, more generically, "data portability."

    Social networks accelerate interactions and your general sense of happiness in your new home through automated pieces of software designed to help you move data, or simply mine its content, from some of the most popular sites and services on the Web. These access paths are roughly equivalent to a new physical location setting up easy transit routes between some of the largest cities to help fuel new growth.

    Facebook Friend Finder e-mail import

    Your e-mail inbox is currently the most popular way to construct social context in an entirely new location. Site such as Facebook request your login credentials for a large online hub such as Google, Yahoo!, or Microsoft to impersonate you on each network and read all data which may be relevant to the social network such as a list of e-mail correspondents. Every day social network users hand over working user names and passwords for other websites and hope the new service does the right thing with such sensitive information. Trusted brands don't like external sites collecting sensitive login information from their users and want to prevent a repeat of the phishing scams faced by PayPal and others. There is a better way to request sensitive data on behalf of a user, limited to a specific task, and with established forms of trust and identity.

    1. Use the front door
    2. Identify yourself
    3. State your intentions
    4. Provide secure transport

    Use the front door

    Google, Yahoo!, and Microsoft all support web-based authentication by third parties requesting data on behalf of an active user. The Google Authentication Proxy interface (AuthSub), Yahoo! Browser-Based Authentication, and Microsoft's Windows Live ID Web Authentication issue a security token to third-party requesters once a user has approved data access. This token can allow one-time or repeated access and is the preferred method of interaction for today's large data hubs. The OAuth project is a similar concept to web-based third-party authentication systems of the large Internet portals, and may be a common form of third-party access in the future.

    Google Accounts Access example

    Supporting websites provide limited account access to a registered entity after receiving authorization from a specific user. The user can typically view a list of previously authorized third parties and revoke access at any time. The third-party retains access to a particular account even after the user changes his or her password.

    Imagine if you could give your local grocery store access to just your kitchen, but not hand over the keys to your entire house. A delivery person would be automatically scanned upon arrival, compared against a registry, and granted access to the kitchen if yo previously assigned them access. You could revoke their access to your kitchen at any time, but they never have access to your jewelry box or other non-essential functions within your house.

    Identify yourself

    Third-party applications requesting access should first register with the target service for accurate identification and tracking. Applications receive an identification key for future communications connected to a base set of permissions required to accomplish your task (e.g. read only or read/write). A registered application can complete a few extra steps for added user trust and less user-facing warning messages.

    State your intentions

    Your application or web service should focus on a specific task such as retrieving a list of contacts from an online address book. Your authentication requests should specify this scope and required permissions (e.g. read only) when you request a user's permission to access his or her data.

    Google services with Gmail highlighted

    An application declaring scope lets users know you are only interested in a single scan of their e-mail and you will not have access to their credit card preferences, stored home address, or the ability to send e-mails from their account. Not requesting full account access in the form of a username and a password creates better trust from the user and the user's existing service(s).

    Provide secure transport

    Armored Truck How will you transport my user's data back to your servers? Did you bring an armored car with your company's logo prominently displayed on the side or will my data sit in the back of your borrowed pick-up truck? Requesting applications should transport user data over secure communications channels to prevent eavesdropping and forged messages. Registered and verified secured communications will result in less user-facing warning messages of mistrust, and secure certificates are relatively inexpensive. Large portals such as Google or Microsoft will bump your communications (and privileges) to mutual authentication if you are capable.

    Twitter SSL certificate Firefox view

    Register an SSL/TLS certificate for your website to enable secure transport and further identify yourself. Certificates vary in cost and complexity from a free self-signed cert to paid certificates from a major provider with extended validation and server-gated cryptography. Google and Yahoo! use 256-bit keys. Windows Live and Facebook use 128-bit keys.

    Summary

    Data authorization is the first step in data portability. Emerging standards such as OAuth combined with established access methods from Internet giants provide specialized access for third-parties acting on behalf of another user. Sites interested in importing data from other services should take note of these best practices and prepare their services for intelligent interchange.

  3. Jan17

    Upgrade your Google Analytics tracker

    Google Analytics logo

    Google released a new version of its Google Analytics tracking code in December after a two-month limited beta. The new Google Analytics tracker is a complete rewrite of JavaScript inherited from the Urchin acquisition in 2005 and the first time the two products have been officially decoupled. The existing version of Google Analytics tracker, urchin.js, has been deprecated but should continue to function until the end of 2008. Google will only roll out new features on the new ga.js tracker. If you currently track website statistics using Google Analytics you should upgrade your templates to take advantage of the new libraries.

    What changed?

    The new Google Analytics tracker supports proper JavaScript namespacing and more intuitive configuration methods (e.g. _setDomainName instead of _udn). My tests show about a 100 ms faster execution even with a 24% increase (1514 bytes) in file size (ga.js is also minified).

    The new tracking code makes advanced features a lot more accessible. You can now track a page on multiple Google Analytics accounts, which should help user generated content sites integrate their author's Google Analytics IDs alongside the company's own tracking account. The new event tracker lets you group a set of on-page related actions such as clicking a drop-down menu or typing a search query (very useful for widgets). Ecommerce tracking is now a lot more readable. You can read about all the tracker changes in the Google Analytics migration guide PDF.

    Implementation

    Switching your site tracker is pretty simple. Trackers are now created as objects and configured before the page is tracked.

    <script type="text/javascript" src="http://www.google-analytics.com/ga.js"></script>
    <script type="text/javascript">
    var pageTracker=_gat._getTracker('UA-XXXXXX-X');
    pageTracker._initData();
    pageTracker._trackPageview();
    </script>
    

    That's it. You are now running the new Google Analytics tracker. You'll need to swap in your Analytics account and profile IDs, which should be pretty easy to spot in your existing code.

    Summary

    Google Analytics tracking code is completely rewritten for faster on-page behavior that plays well with others. The old tracker will be deprecated within a year, and new features are only available to users running the new code. Existing Google Analytics users should swap out their tracking code to take full advantage of this free stats tool.

  4. Jan09

    FeedDemon and NetNewsWire are now free

    NewsGator is giving away desktop feed readers FeedDemon, NetNewsWire, and NewsGator Inbox. The company hopes to regain any loss of revenue from its desktop business with new enterprise sales leads and better attention metadata. The company announced the change in pricing in a press release today and a blog post by founder Greg Reinacker.

    NewsGator's desktop feed readers previously cost about $30 each and faced some commoditization through feed reading software bundled with modern operating systems, office suites, or competitive open-source solutions. Windows client FeedDemon needs to compete with feed reading capabilities built-in to Windows Vista and Internet Explorer 7 or open-source clients such as RSS Bandit. Apple client NetNewsWire competes with Mail.app in Leopard and open-source freeware such as Vienna. NewsGator Inbox competes directly with Outlook 2007. Online competitors such as Google Reader are starting to deliver desktop-like speeds in an always up-to-date, always available model.

    NewsGator differentiates its desktop client offerings from the competition through the NewsGator Online hub. Each client filters its requests for feed data through the centralized online service and synchronizes each user's list of subscriptions, read/unread items, shared snippets, and more. NewsGator plans to use the extended user base available via its free clients to fine-tune relevancy and other metrics available through uniquely identifiable attention data.

    [B]y using your data, in combination with aggregate data from other users, we can deliver a better experience for everyone. And that’s a good thing - both for us and for you.

    Each desktop application can also sync with a local activity hub NewsGator is selling within enterprises. They hope free tools will infiltrate corporate America to generate new sales leads and internal advocates for bigger licensing fees.

    Summary

    NewsGator's move to free is an interesting risk for a changing business. Competitors such as Attensa do not have a similar strength in the desktop client space, and NewsGator will continue to worry about Microsoft shipping an update to SharePoint that could shake up their enterprise market. In the mean time thousands of consumers will be able to download quality software for free, and the small desktop clients can continue developing cool new features funded by enterprise usage.

    Update: Nick Bradbury, creator of FeedDemon, shares his thoughts on the freebies on his blog.

  5. Jan08

    Google processes over 20 petabytes of data per day

    Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters. The average MapReduce job ran across approximately 400 machines in September 2007, crunching approximately 11,000 machine years in a single month. These are just some of the facts about the search giant's computational processing infrastructure revealed in an ACM paper by Google Fellows Jeffrey Dean and Sanjay Ghemawat.

    Twenty petabytes (20,000 terabytes) per day is a tremendous amount of data processing and a key contributor to Google's continued market dominance. Competing search storage and processing systems at Microsoft (Dyrad) and Yahoo! (Hadoop) are still playing catch-up to Google's suite of GFS, MapReduce, and BigTable.

    MapReduce statistics for different months
    Aug. 2004Mar. 2006Sep. 2007
    Number of jobs (1000s)291712,217
    Avg. completion time (secs)634874395
    Machine years used2172,00211,081
    map input data (TB)3,28852,254403,152
    map output data (TB)7586,74334,774
    reduce output data (TB)1932,97014,018
    Avg. machines per job157268394
    Unique implementations
    map3951,9584,083
    reduce2691,2082,418

    Google processes its data on a standard machine cluster node consisting two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. This type of machine costs approximately $2400 each through providers such as Penguin Computing or Dell or approximately $900 a month through a managed hosting provider such as Verio (for startup comparisons).

    The average MapReduce job runs across a $1 million hardware cluster, not including bandwidth fees, datacenter costs, or staffing.

    Summary

    The January 2008 MapReduce paper provides new insights into Google's hardware and software crunching processing tens of petabytes of data per day. Google converted its search indexing systems to the MapReduce system in 2003, and currently processes over 20 terabytes of raw web data. It's some fascinating large-scale processing data that makes your head spin and appreciate the years of distributed computing fine-tuning applied to today's large problems.

  6. Jan03

    MacSB Macworld dinner

    I am once again organizing a dinner gathering during the Macworld conference for Mac small business owners and developers. This year's MacSB Macworld dinner will take place on Tuesday, January 15, starting at 6.p.m. at Chaat Cafe in San Francisco. We will discuss the latest keynote announcements, plan future iPhone applications, and eat Indian food.

    Chaat Cafe Google Maps

    Chaat Café is located at 320 3rd Street (corner of 3rd and Folsom) in downtown San Francisco, one block from Macworld and the Moscone conference center. The restaurant has free Wi-Fi and power outlets near some tables, so bring your laptop to show off your latest creations. You will order food and drink individually near the restaurant entrance and pay only for what you personally eat or drink (typically less than $10). Metered parking is free after 6 p.m. or you may park in the building's parking garage (enter on 3rd Street) with two hours of validated parking if you choose to drive.

    Yes we want to develop for iPhone

    Past MacSB gatherings in 2007 and in 2006 have been good opportunities to reflect on the changing Mac software market, share tips with like-minded small business owners, or attend group therapy as Apple just annihilated your product with their own software release. Mac fans are welcome to come out and meet the independent developers of some of their favorite apps.

    I have warned the restaurant staff to expect a big crowd but you can help make things run a bit smoother by leaving an RSVP in the comments of this post or on Upcoming.org.

Niall Kennedy Niall Kennedy is a web technologist in San Francisco, California in the United States. I am very interested in the world of... MORE »

Search this weblog:

Subscribe:

Latest feature: Widget development

Archives: Popular Categories

Sites: More from Niall