April 2009 Archives

  1. Apr14

    Google search referer changes

    Google will roll out a change to its search results pages later this week designed to better capture outbound clicks. Google search result pages will link to a gateway URL before delivering the visitor to his final destination. These gateway URLs will replace search result URLs exposed via the Referer HTTP header. Google announced the new gateway page on its Google Analytics blog, giving webmasters a few days to prepare for the change.

    What is changing?

    The Referer path for Google search results will change from /search to /url. It is still not clear which URL parameters from the search page will be passed through the gateway. The search term, q, is still preserved inside the sample URL provided by the Google Analytics blog.

    Before
    http://www.google[.sld].tld/search
    After
    http://www.google[.sld].tld/url

    Scripts, plugins, and helpers replying on a set Referer path for content highlighting or targeting will need to adjust their code as Google's change spreads throughout their data centers worldwide.

    Why the change?

    Google is likely making this change to better track search actions and shield URL parameters from sites downstream. Gateway URLs dependably capture click data and reformat the information passed along to external sites.

    Search engines evaluate customer satisfaction based partly on outbound click behavior. Searchers who consistently click on the third search result may be sending Google a signal about that content's authority for a search term and therefore influencing the ranking algorithms. Traditionally such an action would be measured with a JavaScript onclick event added to the link to pass a signal back to the search engine's servers before taking the searcher to his destination. JavaScript tracking does not work on all clients, including clients accessing search results with JavaScript turned off (e.g. through Google's APIs or a feature phone).

    The search result page includes detailed information needed by Google to deliver the best possible result. A search might include a location from a GPS sensor, social context drawn from a group or custom search engine parameter, or other sources of questionable exposure. Google will only expose a few relevant parameters in URLs included in a web browser's Referer headers.

    Summary

    The way your website interprets traffic from one of its top providers will change later this week. You will need to adjust scripts and check for updates to analytics software where appropriate. If you notice a huge drop in measured search referrals from Google don't panic. Just make sure you are measuring the correct actions.

  2. Apr05

    Facebook's photo storage rewrite

    Facebook logo

    This week Facebook will complete its roll-out of a new photo storage system designed to reduce the social network's reliance on expensive proprietary solutions from NetApp and Akamai. The new large blob storage system, named Haystack, is a custom-built file system solution for the over 850 million photos uploaded to the site each month (500 GB per day!). Jason Sobel, a former NetApp engineer, led Facebook's effort to design a more cost-effective and high-performance storage system for their unique needs. Robert Johnson, Facebook's Director of Engineering, mentioned the new storage system rollout in a Computerworld interview last week. Most of what we know about Haystack comes from a Stanford ACM presentation by Jason Sobel in June 2008. Haystack will allow Facebook to operate its massive photo archive from commodity hardware while reducing its dependence on CDNs in the United States.

    The old Facebook system

    Facebook photo serving architecture 2008

    Facebook has two main types of photo storage: profile photos and photo libraries. Members upload photos to Facebook and treat the transaction as digital archive with very few deletions and intermittent reads. Profile photos are a per-member representation stored in multiple viewing sizes (150px, 75px, etc). The past Facebook system relied heavily on CDNs from Akamai and Limelight to protect its origin servers from a barrage of expensive requests and improve latency.

    Facebook profile photo access is accelerated by Cachr, an image server powered by evhttp with a memcached backing store. Cachr protects the file system from new requests for heavily-accessed files.

    The old photo storage system relied on a file handle cache placed in front of NetApp to quickly translate file name requests into a inode mapping. When a Facebook member deletes a photo its index entry is removed but the file still exists within the backing file system. Facebook photos' file handling cache is powered by lighttpd with a memcache storage layer to reduce load on the NetApp filers.

    No need for POSIX

    Facebook photographs are viewable by anyone in the world aware of the full asset URL. Each URL contains a profile ID, photo asset ID, requested size, and a magic hash to protect against brute-force access attempts.

    /[pvid]_[key]_[magic]_[size].jpg

    Traditional file systems are governed by the POSIX standard governing metadata and access methods for each file. These file systems are designed for access control and accountability within a shared system. An Internet storage system written once and never deleted, with access granted to the world, has little need for such overhead. A POSIX-compliant node must specifically contain:

    • File length
    • Device ID
    • Storage block pointers
    • File owner
    • Group owner
    • Access rights on each assignment: read, write execute
    • Change time
    • Modification time
    • Last access time
    • Reference counts

    Only the top three POSIX requirements matter to a file system such as Facebook. Its servers care where the file is located and its total length but have little concern for file system owners, access rights, timestamps, or the possibility of linked references. The additional overhead of POSIX-compliant metadata storage and lookup on NetApp Filers led to 3 disk I/O operations for each photo read. Facebook simply needs a fast blob store but was stuck inside a file system.

    Haystack file storage

    Facebook Haystack diagram

    Haystack stores photo data inside 10 GB bucket with 1 MB of metadata for every GB stored. Metadata is guaranteed to be memory-resident, leading to only one disk seek for each photo. Haystack servers are built from commodity servers and disks assembled by Facebook to reduce costs associated with proprietary systems.

    The Haystack index stores metadata about the one needle it needs to find within the Haystack. Incoming requests for a given photo asset are interpreted as before, but now contain a direct reference to the storage offset containing the appropriate data.

    Cachr remains a first line-of-defense to Haystack lookups, quickly processing requests and loading images from memcached where appropriate. Haystack provides a fast and reliable file backing for these specialized requests.

    Reduced CDN costs

    The high performance of Haystack combined with new data center presence on the east and west coasts of the United States reduces Facebook's reliance on costly CDNs. Facebook does not currently have the points of presence to match a specialist such as Akamai, but the combined latency of speed of light plus file access should be performant enough to reduce CDN in areas where Facebook already has existing data center assets. Facebook can partner with specialized CDN operators in markets such as Asia where it has no foreseeable physical presence to boost its access times for Asian market files.

    Summary

    Facebook has invested in its own large blob storage solution to replace expensive proprietary offerings from NetApp and others. The new server structure should reduce Facebook's total cost per photo for both storage and delivery moving forward.

    Big companies don't always listen to the growing needs of application specialists such as Facebook. Yet you can always hire away their engineering talent to build you a new custom solution in-house, which is what Facebook has done.

    Facebook has hinted at releasing more details about Haystack later this month, which may include an open-source roadmap.

    Update April 30, 2009: Facebook officially announced Haystack and further details.

Niall Kennedy Niall Kennedy is a web technologist in San Francisco, California in the United States. I am very interested in the world of... MORE »

Search this weblog:

Subscribe:

Recently Popular

Archives: Popular Categories

Sites: More from Niall