Scott Watermasysk

Still Learning to Code

Qizmt – MapReduce on Windows

A very interesting announcement by the folks at MySpace:

Today, we are open-sourcing Qizmt, an internally developed framework for distributed computation created by the Data Mining team here at MySpace. Qizmt can be used for many operations that require processing large amounts of data such as collaborative filtering for recommendations and analytics.

Contrary to what many of the tech pundits are reporting, this is not a recommendation engine. It is an implementation of MapReduce written for windows.

This opens up a nice set of options which are generally not available to developers on Windows. Heck, even bing appears to leverage Hadoop.

The release is listed as Alpha, but contains a nice set of features:

  • Rapidly develop mapreducer jobs in C#.Net
  • Easy Do-It-Yourself Installer
  • Built-in IDE/Debugger
  • Automatically colors heap allocations in red
  • Autocomplete for rapid mapreducer development
  • Step through and debug mapreducer jobs directly on target cluster
  • From any machine in a cluster:
  • Edit mapreducer jobs
  • Debug mapreducer jobs
  • Execute mapreducer jobs
  • Administer mapreducer jobs
  • Delta-only exchange option for Mapreduce jobs
  • Configurable data-redundancy/machine level failover
  • Easily add machines to a cluster to increase processing power and capacity
  • CAC (Cluster Assembly Cache) for exposing .Net DLLs to mapreduce jobs
  • Three kinds of jobs
  • Mapreduce – Set-based logic on large amounts of data
  • Remote – For problems that don’t fit into the mapreducer mold
  • Local – For orchestrating a pipeline of Mapreducer and Remote jobs
  • Three ways to exhange data durring mapreduce
  • Sorted – key/value pairs are evenly sorted accross the cluster
  • Grouped – like key/value pairs make their way to same reducer but not sorted
  • Hashsorted – super fast way to sort random data

You can grab the code from Google Code.