Wednesday, December 29, 2010

Spark: Big Data Computing with Scala


Spark is a new distributed computing model built on to of the Mesos cluster operating system. Both projects are done out of UC Berkeley. The interesting thing about Spark is that its concept of resilient distributed datasets (RDDs) and who the underlying Scala language with its support for shipping closures around the network delivers a scalable and high performance model when compared to the state of industry: i.e. Hadoop MapReduce. Spark's architecture naturally allows for efficient in-memory operation as well as caching - so that iterative algorithms such as those that one often encounters in machine learning, matrix algebra, mathematical optimization, and multi-dimensional indexing can really see a significant speed-up. The published performance/scale figures show that as iterations increase (in the case of say a logistic regression problem), Spark is 10-15 times faster in execution than a standard Hadoop MapReduce job operating over the same HDFS datasets. In-memory and cached intermediate state make a big difference in the distributed big data computing space.

A nice write-up of the logistic regression case is provided below from Matei Zaharia's website.

In more interesting applications, users probably need to read a data set and potentially transform it before performing calculations on it. For this purpose, Spark provides a second type of distributed dataset -- a file in the Hadoop Distributed File System (HDFS). Currently, only text files are supported. The HDFS file looks to the programmer like a collection of records (in text files, each record is a line). However, operations on it run at the nodes that contain each block of the file, as in MapReduce.

The corresponding parallel program in Spark (on top of Scala) to implement the logistic regression is the following:

// Read data file and transform it into Point objects
val spark = new SparkContext()
val lines = spark.hdfsTextFile("hdfs://.../data.txt")
val points = lines.map(x => parsePoint(x)).cache()

// Run logistic regression
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = spark.accumulator(Vector.zeros(D))
for (p <- points) {
val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y
gradient += scale * p.x
}
w -= gradient.value
}
println("Result: " + w)

This is an impressive programming model and one that definitely bears watching as Hadoop MapReduce, Microsoft Dryad, and other frameworks (like Sphere and Twister) vie for dominance in the BigData world that has emerged.

1 comment:

Matei said...

Thanks for the writeup about Spark, Alex! I came across it after a friend pointed me to your blog. We'll probably put together a first official release of Spark in the next month if people are interested in trying it out.