Friday, January 25, 2008

Response to MapReduce II.

This blog post is in reply to, and in specific points out fundamental flaws in David DeWitt & Michael Stonebraker's understanding of the MapReduce paradigm. My intent is purely to inform the public on what is really possible with MapReduce, particularly when paired with suitable highly scalable data storage such as DHTs (distributed hash tables).

I'm an experienced developer and have used Oracle, MySQL, and SqlServer professionally. Each filled a relational niche well. This post is not in any way against RDBMSs.

In response to the blog titled "MapReduce II", section No. 1, DeWitt and Stonebraker claim that a chain of three MapReduces would be required to answer the following question from two tables:
Table 1: Rankings (pageUrl, pageRank)
Table 2: UserVisits (sourceIpAddr, destinationUrl, date, adRevenue)
Question: What IP addr generated the most ad revenue during some specific week, and what was the average rank of the pages visited.

Aside from this being two questions, something typically done by RDBMSs as the cost has already been amortized over the large join, it is also an ambiguous question. I'll assume that the request for average rank of pages visited means "pages visited by that ip addr during that week". This is also a (likely) useless query, as many IP addresses are shared by large proxies such as those by AOL, or other ISPs.

Assuming for the moment (big assumption, not the entire answer) that instead of IP addresses in Table 2 we had some form of UserId, I'll say we can do this in a highly scalable way in a single MapReduce job. How:

Phase 1 (only Map Reduce, builds a useful data structure)
Map: scan all UserVisits records and map them to a key of [week-id/begin-date]:UserId
This will cause all user visits to be bucketed by week (begin date), then by revenue.

Reduce: for each set of input, sum the revenue and build a list of destinationUrls. Store the result in a highly scalable datastore that is indexed by key. There are several of these, but basically any large scale DHT with key indexing wil do. The output key will be of the form:[week-id]:RevenueDescending:UserId. Storing this result in a way addressable by this nicely constructed key allows for a single limited scan to find the max-revenue user with a single query to this auxiliary data structure.

After the Map Reduce, you can now find out all the information you need with relatively extremely cheap computation:

Now, read the list of URLs found by the query enabled by the Reduce: I.e. Do a limited (one record only) scan over the keys starting with key: "[week-you-want-id]:", and the first and only record returned will contain both the UserId of interest (answering the first question) and the entire list of destinationUrls for this user. The Aha! moment for some is to realize that the typical, and even a spammer/bot UserId will almost certainly have a very small number of UserVisits in any given week (say under 100,000). With this assumption, you can now simply iterate over the UserVisits list for this week-user bucket, and visit the Rankings table for each Url. Assuming that table is in an efficient-lookup DHT, you are now essentially done (most users of anything, even Facebook or MySpace probably have way, way under 100,000 page views per week).

If for some reason you don't agree with my 100,000 upper bound on page views per week, consider that with a two phase Map Reduce you can eliminate that step and make the final lookups O(1). This would add a phase before my Phase #1 above, that joins the two tables on URL. With the two phase approach, you can also completely reconcile having to support IP addresses (that may belong to AOL proxies and the like) in addition to UserIds. The reason, is that you now have the pageRank, and can compute the average pageRank per IP in the second phase Reducer easily. Now you simply store the average pageRank instead of the URL list, and the results become constant size per-record.

Finally, this Map Reduce has a major benefit (and cost!) above and beyond a single RDBMS query. It creates an auxiliary data result. This can be extremely useful. What if you wanted to chart the max revenue per week over time? With an RDBMS, assuming indexes were used to speed up the join and constrain the dataset by week, you would be looking at one (or many) costly queries. With the Map Reduce, the expensive computation is done once and is now available for future use. This approach of pre-computation isn't for everyone, but for high scalability it can be critical. You don't want to have to scan over these large tables often, if you can avoid it.

No comments: