Monday, December 31, 2012

Dasein and the Experience of Wabi-Sabi


"LastDay, Gemini 68... Year of the Corporation,  2012...
Carousel Begins..."

After a decade of building fascinating projects and working with some of the best people in the industry, I have decided to retire from Microsoft. 2011-2012 was a year of trying circumstances. perhaps the single most important one was the passing of my best friend of 30 years, Kevin Mulvihill. When you have been best friends with someone since you were both 13, their sudden and unexpected passing gives one pause to consider many things. For me, beyond the immense anguish and sadness of losing someone who knew me better than I know myself, was the tacit recognition that everyone's time is finite. It struck me with a clarity and force that I have seldom experienced. You only have a allotted span to dream, build and define something truly iconic: something that is simultaneously beautiful, useful, and unique.

This notion of human being as finite and temporally grounded goes back to my early reading of the German philosopher Martin Heidegger. For him, human being (he preferred to use the term Dasein), was defined by its historicity and cultural setting, interpretative relationship to its surroundings, finitude in the face of death, temporality, and care (Sorge). In this view, a core part of what and who we are it the set of people and ideas that we care about in a deep way. This is a vastly different foundation to human being than many thinkers, theological and philosophical, have tried to ground human being. I tend to side with Heidegger: one learns far more about the essentially human by understanding what people care about and how they relate to themselves and others as historically grounded beings. 

History, culture, shared life experiences, and language form us: they are not incidental and extrinsic.

I thought a lot about some of the dreams and aspirations that two adolescents had while waiting for the Summer of 1982 to finish so that they could start as Freshmen at Bellarmine College Preparatory in San Jose, CA. We had a blast trying to imagine what our future selves would be like and what they would tell us if they could send us a message backwards down the timeline. Remember John Carpenter's film Prince of Darkness and the "subconscious, tachyonic projection" employed as a dramatic device by the Brotherhood of Sleep? What would I tell our younger selves given the opportunity?

Seek hard challenges, deep emotional connections, self-understanding, and experience the beautiful.

Everything else is dross and amounts to utterly nothing in the longterm. I consider myself a Stoic, and the writings of Epictetus resonate in my adult mind in much the same way that Foucault and Heidegger did in my younger mind. As adults, we are faced with obstacles, problems, and tragedies that we experience only abstractly and intellectually when we are young. It is remarkably easy to be flippant, cynical, and sophistical when there is apparently so little at stake. Everything appears infinite and imperishable - the ego's false immortality. The picture changes dramatically when the ineluctable forces of biological time, economic realities, and encroaching presence of senescence and decay  become everyday occurrences. Things get real, to paraphrase a common expression. The Japanese have a profound aesthetic-philosophical concept: wabi-sabi. I translate it as the beauty of decay - the experience of the beautiful in the finite, impermanent, and transient. The opposite of this way of seeing is the Greek concept of beauty. Both are valid aesthetics and each focuses our attention to a different aspect of the lived experience: the temporal and material versus the formal and the perfected. Time versus form. 

Human life is wabi-sabi. It is self-aware and aware of its intrinsic finitude. In our lives, we are presented with trying and wrenching circumstances and events, we have little ability to alter their facticity and inevitability. How we react, though, is entirely within our power. This is the deep thought within Stoicism: you control your reaction, not the event. These must be separated in your mind. Too often, one easily elides into the other. This is a categorical mistake. "Circumstances do not make the man, they merely reveal him to himself." (Epictetus)  

For me, 2012 was a profound moment of self-revelation. The sleeper has awakened. It took mind warping trauma and the passing of someone very close to me to make that happen. Still, I am glad that it happened. I have returned full circle to the beginning and see it truly for the first time. That reminds of the poignant, final scene in Yukio Mishima's last novel: fate, memories, and emotions suspended in a pristine moment without catharsis.


  

Saturday, December 31, 2011

Understanding Complex Datasets through Matrix Decomposition


David Skillcorn's book on matrix decomposition techniques is superb. I especially enjoyed his coverage on non-negative matrix factorization (NNMF) techniques and eigendecomposition (i.e. spectral techniques). I would recommend the book to those interested in data mining and knowledge extraction. The techniques cover a wide range of media and are not simply restricted to relational datasets and textual documents. The treatment of PageRank is concise and articulate: demonstrating the deep relationship between graph mining and learning techniques and matrix decomposition(SVD amongst others) techniques that make search engines such as Google and Bing possible. As a reviewer summarized, "The author explores the deep connections between matrix decompositions and structures within graphs, relating the PageRank algorithm of Google's search engine to singular value decomposition. He also covers dimensionality reduction, collaborative filtering, clustering, and spectral analysis. With numerous figures and examples, the book shows how matrix decompositions can be used to find documents on the Internet, look for deeply buried mineral deposits without drilling, explore the structure of proteins, detect suspicious emails or cell phone calls, and more."

The link provides the complete text of Skillcorn's book in the form of a PDF.

Sacrifice and Transcendentals

"Sacrifice is nothing other than the production of sacred things ... religion the search for a lost intimacy. Sacrifice destroys an object's real ties of subordination; it draws the victim out of the world of utility and restores it to that of unintelligible caprice. ... The sacrificer declares, 'Intimately, I belong to the sovereign world of the gods and myths, to the world of violent and uncalculated generosity ... I withdraw you, victim, from the world in which you were and could only be reduced to the condition of a thing, having a meaning that was foreign to your intimate nature. I call you back to the intimacy of the divine world, to the profound immanence of all that is.'"

-George Bataille, Theory of Religion

Ecstasy of the Beautiful: Beyond Symbolic Exchange

"We have become completely absorbed by models, completely absorbed by fashion, completely absorbed by simulation. ... Our whole culture is in the process of shifting from games of competition and expression to the games of risk and vertigo. Hence, we move to the form of ecstasy. Ecstasy is that quality specific to each body that spirals in on itself until it has lost all meaning and thus radiates as pure and empty form. Fashion is the ecstasy of the beautiful: the pure and empty form of a spiraling aesthetic. Simulation is the ecstasy of the real."

-Jean Baudrillard, Fatal Strategies

Wednesday, December 14, 2011

Technologies of the Self


"My objective for more than twenty-five years has been to sketch out a history of the different ways in our culture that humans develop knowledge about themselves: economics, biology, psychiatry, medicine, and penology. The main point is not to accept this knowledge at face value but to analyze these so-called sciences as very specific truth games related to specific techniques that human beings use to understand themselves. As a context, we must understand that there are four major types of these technologies, each a matrix of practical reason: (I) technologies of production, which permit us to produce, transform, or manipulate things; (2) technologies of sign systems, which permit us to use signs, meanings, symbols, or signification; (3) technologies of power, which determine the conduct of individuals and submit them to certain ends or domination, an objectivizing of the subject; (4) technologies of the self, which permit individuals to effect by their own means or with the help of others a certain number of operations on their own bodies and souls, thoughts, conduct, and way of being, so as to transform I themselves in order to attain a certain state of happiness, purity, wisdom, perfection, or immortality."

-Michel Foucault, Technologies of the Self

Tuesday, December 13, 2011

My Interview for Apache Hadoop on Windows Azure


This is a link to a video interview that I did earlier this week in preparation for the developer preview of Apache Hadoop for Windows Azure. I founded this project at Microsoft and have shepherded its architecture and engineering efforts in addition to developing the overall framework for integrating OSS into Microsoft offerings. I have been joined by a cast of tremendously talented and inventive developers. The opportunity to work with members of the open source community and especially those involved in the Apache Hadoop project(s) has been a rewarding experience. I personally view 21st century global economic markets being transformed by two complementary and mutually amplifying trends: (1) competitive differentiation based on analytics and (2) the emergence of cooperative, crowd-sourced open development of software tools and utility-priced scale-invariant clouds. The globalization of software and cloud-delivered computational platforms is rewriting the fundamental laws of economic competitiveness. Like the emergence of machinery to accelerate the rise of industrial capitalism in the 19th century, analytics-/data-driven competition and loosely-federated open markets for software are transforming the modern corporate IT environment and this trend is accelerating.

Hadoop is at the forefront of the trend and our service on Azure is an important step in the seamless integrated flow of scale-invariant analytics and information derivatives into classic business intelligence, data warehousing, and statistical analysis toolchains and practices.

The project, codenamed isotope, was done to provide a rich port of Apache Hadoop onto the Windows Server and Windows Azure platforms. Cloud-scale computation on elastic services is delivered through Azure and the ability to easily integrate any data sources and big data workloads via an elegantly intuitive portal and interactive shell. Our developer preview is the first manifestation of the longterm roadmap that we have for bridging the worlds of self-service business intelligence to cloud scale analytics delivered to users on any device. The elastic map reduce and analytics service will be available at www.hadooponazure.com. The summary of the talk is included: In this interview, Alexander talks to us about how big data has already had a profound impact on businesses. Big data problems require new class of NoSQL technologies and tools like Hadoop. Hadoop has a vibrant ecosystem for processing big data using commodity hardware. Microsoft's Hadoop distribution will democratize big data tools even further by making them more accessible and easier to program for people who don't know anything about clusters. Our Isotope team has adopted the OSS development model, and will be contributing code back to the community. Hadoop on Windows Azure will be available in CTP in December, and GA will be in March.

For me, one hashtag says it all: #Slide26.

Friday, December 31, 2010

Vector Processing Languages: The Future of Big Data Analytics and Real-time Business Intelligence


When I think about the systems being designed to address the big data problem, partially a data storage, access, and query problem (10%) but mostly about analytics (90%) at scale, latency, and performance thresholds only imagined by previous generations - I immediately think of vector processing systems and elastic, on demand, computing fabrics.

One of the best systems for the vector processing aspect that I have seen is q/kdb+ from kx systems. Q/kdb+ was developed out of K, itself spawned by A(+), and previously founded on Kenneth Iverson's seminal language APL, so it is the exemplar vector processing language. Another vector processing language familiar to many is Mathwork's MATLAB. MATLAB, short for Matrix Laboratory, definitely shows its intellectual ancestry in APL.

Why vector processing? Is not the world to be owned by dynamic, object-oriented, and functional programming genres and languages? Why would so many hedge funds, investment banks, and savvy individual traders use systems such as q/kdb+ and others instead of these "obvious" choices proffered by large software enterprise players such as Oracle, IBM, and Microsoft? There are a number of factors militating against this that run deep within the big data movement, and I shall only summarize a few of them here.

First, nested vector data models and higher-order processing such as those provided in q/kdb+ and other vector processing systems are uniquely appropriate to processing massive, time series ordered (partitioned) datasets. Unlike relational data models, which lack intrinsic ordering (unordered set vs. ordered vector semantics - remember?), vector models are natively adapted to operating over massively sized, dimensional data with intrinsic partitioning (time). Stock trading data streams are a good example. This is evident in the financial services industry and anywhere where time series analytics (understanding what is in your data) and predictive analytics (regression, what is likely to happen in the future based on that information) are keys to competitive differentiation.

Second, vector processing engines scale extremely well since they are designed for the in-memory, high speed performance with zero wait states and blocking semantics. Most vector processing engines, such as q/kdb+, are interpreters running at extreme speeds in a single thread of control and with no complex lock managers trying to outsmart the runtime vector engine. This is in complete contradistinction to the prevailing wisdom of RDBMS systems, which were designed for solving many concurrent access, I/O bottlenecks, and interleaved indexing operations that simply no longer apply to the big data world. Vector processing engines are already way ahead of where streaming and in-memory tuned engines such as Streambase, Coral8, Vertica, SciDB, and others wanted to get to. I suspect that we will see old school RDBMS vendors try to reposition some of their antiquated technologies in this vector processing space. Although this may be laudable and slightly quixotic, it is still like trying to rebuild a Ferrari sports car starting with the design of a Yugo. Why, in a sane world, would you start from that design position? You would not.

Third, vector processing languages are highly amenable to parallelism and high performance concurrent execution. A key part of this is the minimal unit of operation of these languages: the nested, n-dimensional vector. This, and not the scalar, is the unit of operation. This is a subtle point: most languages are designed to operate over data one scalar value at a time. To deal with anything more complicated requires a great deal of error-prone iteration, typically nested. What requires two or more iterative looping constructs in a language such as C# or Java is simply one operator (or higher-order function) in a vector processing language. An example would be computing the average of a vector of test scores: in an q or APL-like language this becomes a simple one liner: sum aVector / tally aVector. That is the entire program. Taking these powerful monadic and dyadic operators and partitioning them across an elastic fabric of processors, caches, and storage is simplicity itself since so little context needs to be migrated in the form of closures and execution context. This hidden support for scalable parallel processing is another of the values of vector processing languages. Few Algol-60 derived languages can boast this form of inherent parallelism. In a big data world, this is critical for success and scalability.

Finally, I posit that vector processing engines, when wedded to elastic, on demand compute and storage platforms, will become the most attractive computational and modeling option for data scientists (quants) trying to do data-driven science and business. Combining a small vector processing core with a scalable, elastic fabric such as Heroku, EC2, Google, or Azure will usher in a new era of vector fabric processors. They will have all of the algorithmic sophistication of classical high performance in-memory vector engines such as q/kdb+, vhayu, and timescale but combined with the infinite CPU, storage, and scale-out of cloud computing. Being able to interface these fabric-scaled vector processors with rich visualizations of the sort one encounters in R, Excel, MATLAB, Tableau, and other real-time data graphing exploration vendors makes for a great marriage of equals. You do not have to be locked into a proprietary visualization system any more than you have to be locked into a proprietary cloud. There are too many excellent choices available. I have addressed several in my recent blogs.

This marriage of vector fabric processors and the economies of scale that cloud computing will provide should create a fundamental economic disruption in both the cloud computing vendor landscape as well as across many verticals: from finance to insurance to computational bio-pharmacology to retail to leisure. We are already seeing the convergence of medium-scale data warehousing workloads for analytics and map reduce computational models (AsterData, Greenplum/VMWare, Netezza/IBM, etc.) This is only the harbinger of a far more systematic transformation in the way machine learning, predictive analytics, mass-scale simulation, and continuous (near-line) business process optimization is done in the next five years.

Imagine the three screens of the future analyst workstation as depicted above: (left) historical data warehousing, (middle) real-time business intelligence, and (right) predictive analytics, simulation, what-if? analysis, and regression modeling. All three screens (modes) can be incorporated in one unified platform for information and (actionable) insight with the power of vector processing engines combined with scalable cloud fabrics.

Interested parties are invited to try out q/kdb+ (free evaluation trial of 32bit edition) or one of the many high quality APL implementations available online. I would recommend the following APLs to try: APL-X and Dyalogic APL. All three are excellent offerings and will demonstrate the power of higher-order vector processing, for scenarios ranging from statistical analysis, portfolio strategy optimization, risk management, search engine optimization (SEO), to analytical CRM for eCommerce. The old cliche of back to the future may well prove to be prescient in regards to APL and its many children.

Cloud fabric plus vector processing plus scalable storage (ala MongoDB, Membase, et al) is the equation for the future of big data analytics and continuous, online machine learning. Say goodbye to Java, C++, C#, and the old scalar-at-a-time programming languages derived from Algol-60.

Welcome back to the serene elegance, conciseness, and high performance parallelism of APL (minus the custom symbology) in the big data world.

Coltrane: The Brilliance of a Uniquely American Genre


It is hard for me to write about music: it is one of those aesthetic experiences that resist paraphrase and reduction to a linear or serial discourse. Three of my favorite musicians, of the non-electronica genre, are John Coltrane, Miles Davis, and Thelonious Monk. Coltrane, or trane as he was sometimes referred to, has a special place in my heart. For instance, Blue Train, which is quintessentially hard bop, is one of those recordings, like Kraftwerk's Computerwelt, that I never get tired of listening to wherever I happen to be. I am not suggesting a hierarchy of masters between the three, or between the many other talented Jazz musicians west and east coast, that created a unique American art form we call jazz. It has spawned everything from acid jazz to downtempo to funk to rap.

In fact, jazz may truly be said to be the cultural and artistic capstone of American 20th century musical art. It influences every genre, especially its riotous offspring rock and roll. But to truly appreciate Coltrane, Davis, And Monk - just look to the recordings and jam sessions of the late 50s and early 60s. This is when all the templates for popular music were laid down. You do not have to enjoy them as trailblazers or genre definers but merely for what they were: brilliant musicians, composers, and arts. Art is its most powerful and elemental when it transform the particular and subjective into something beyond itself: when it reveals the playful beauty of our (very mammalian) species. We love to play and art is play raised to the infinite degree.

Coltrane's amazing musicianship, compositional skills, charisma, and ability to overcome the many demons of a segregated post-war American society only add to the triumph. But when it comes to the music, you really can leave all that behind - this is the place of the numinous, the ineffable, and transformational.

Do yourself a favor and listen to Blue Train, Giant Steps, and My Favorite Things. You will see what I mean about art at its apex. Do not ask me to explain it any further, I shall just refer you to Miles Davis' definition of jazz.

Bio-Inspired Artificial Intelligence: How Self and Complexity Emerge in Biological Systems


Although the connections between biological systems and computing have been many in many other places, I found this primer, Bio-Inspired Artificial Intelligence, especially interesting for its chapter on immunological systems. The contention that the basis of identity (self vs. other) and even intelligence has its evolutionary basis in the adaptive evolution on immune systems is just too fascinating to ignore. The authors spend considerable time working through basics of immunological response and conceptualizing what they term shapespace. I think this is great text to selectively tip into for this topic alone - since so much of the literature for computational biology is focused on bioinformatics (strong matching), genomics, proteomics, and them a fast shift to systems biology.

I actually think the relatively primitive immunological system (pattern matching engine) is a superb place to ask some fundamental questions on what we think of as the self and how it evolves in a dynamic, evolutionary ecosystem constantly deluged by bacteria, prions, viruses, electro-nmagnetic radiation, DNA replication errors (cancers), and even more elaborated bio-programming systems (namely: memes). I like the approach, since it posits a very small number of non-supernatural and metaphysical entities to account for both the emergence of intelligence, collective coordination, cooperation, and the explosion of biodiversity.

As one person put it: Natural evolution is a good analogy to this method–the rules of evolution (selection, recombination/reproduction, mutation and more recently transposition) are in principle simple rules, yet over thousands of years have produced remarkably complex organisms.

Wednesday, December 29, 2010

Enter the Void: Sunyata and CGI in Gaspar Noe's Latest Film


Śūnyatā(शून्यता)(Sanskrit noun from the adj. śūnya: "zero, nothing") is often translated as "void" and adds the appropriate resonance (spiritual, philosophical, and aesthetic) to Gaspar Noe's latest cinematic triumph Enter the Void. This is one of the rare instances of film that actually makes the strongest possible case for the distinction between the Real(tm) and phenomenal. The effect is profound as it is disorienting and nauseating in places. Noe has summarized his film's plot as, "the sentimentality of mammals and the shimmering vacuity of the human experience." Weaving in the Tibetan Book of the Dead, massive doses of GHB and DMT-induced Dreamtime visualization, more-real-than-real overhead shots of Tokyo at night, trademark hyper-brutalist sexuality, and a poignant tragedy to underscore the maelstrom of sensations that pull the viewer along, Noe has once again crafted a new language of visual expression. Even more brilliantly, he has taken the tools of kitsch, Hollywood CGI, worked to death in lackluster juvenilia such as Star Trek and Jurassic Park, and transformed the Real (our experience of the "void") into a hallucinogenic and shimmering tableau that never fails to disorient, fascinate, and remove the artificial irony of the sophisticated viewer.

You have no choice but to submit to Noe's art.

The link takes you to a montage of the incredible CGI that was created for Enter the Void. From the neon-drenched Tokyo to inter-uterine fertilization to Cthulu-esque higher-dimensional lifeforms floating through ceilings and walls, his CGI-rendered camera truly explores the ineffable recesses of the human experience.

This is technology in the service of a turbulent vision that turns viewer into the disembodied, floating soul of the main character as he drifts from scene to scene awaiting reincarnation. Seldom has technology produced such a profound spiritual and emotional effect. Noe has established not only a distinctive language of expression but quite probably a new genre of generative art. CGI has finally been used to fuse sentimentality and the void.

This may be the best, inadvertent, recruiting film for Buddhism ever created.

Kling Klang: The Electronic Garden of the Musikarbeiter


Describe how Kling Klang looks like inside?

Ralf Hütter: It's an electronic garden. We like to perform electronic gardening.

The location of the building is a well kept secret?

Ralf Hütter: Sometimes somebody comes all the way from Japan. But the doors are tightly closed and we are never inside on a regular basis. We concentrate ourselves onto the music and not on what's happening outside. We call ourselves Musikarbeiter. Everything with us is about the music.

What a brilliant summary of the ethos behind the electronica. Hutter is one of the co-founders of Kraftwerk and its spokesmodel.

Insen: Microsound Engineering and Classical Music Performance


Ever since being profoundly shocked by the all electronic soundtrack to Forbidden Planet at the age of eight and the generative-procedural audio workout of Morton Subotick's Silver Apples of the Moon (Nonesuch was such a brilliant label for experimental electronica) - I have had a life-long love affair with computationally and electronically designed, composed, and performed music. Needless to say, Kraftwerk has been the creative influence of my life. Computerwelt, Tour de France (Single Version 1983), and Mensch-Maschine are the apex of popular electronica - their influence reverberates through the decades in ways that are too innumerable to specify. From Detroit to Tokyo to London to Berlin, the Kraftwerk ethos and exacting approach to crafting electronica in a myriad of genres has influenced more people than just about anyone I can imagine.

One of the most extreme research and performance tangents of electronica has to be microsound, the mathematical exploration of tonality at the ultra-low latency range. Championed by Curtis Roads and others, microsound is a sound designer's and mathematician's aesthetic paradise. Granular synthesis was developed to explore this space, providing a much needed antidote to the clash of the titans dueling between the subtractive-analog and frequency-modulation (FM) schools of sound design. With the advent of fantastic virtual synthesizer technology (both Steinberg VSTs and software such as Reason, Ableton, Renoise and Reaktor) along with sound design laboratories like Pure Data (Pd) and MAX/MSP, the aspiring sound designer has free rein to chart the methematical and audio realms that escape the classical chromaticism that many have been conditioned to accept as the minima and maxima of musical expression. As Jacques Attali once suggested in Noise: the Political Economy of Sound, where we make the distinctions of noise versus music is a cultural and political segmentation - it is not innate in the frequency spectrum. A microsound composer like Carsten Nicolai, who typically records under the nom-de-audio of Alva Noto, is a great example of the marriage of technology, mathematics, and a fine sense of symmetry - albeit in a different dimension than one typically encounters in either to classical repertoire or in popular music - which is typically (and dismally) a strict subset of the former. Nicolai's collaborations with the talented Ryuichi Sakamoto for Vrioom, Insen, and other mash-ups of microsound and classical piano are an exemplary entry point into the realm of mathematically generated and designed audio experience. The description of the Insen release is worth reviewing.

Insen is the second album in an ongoing collaboration between Japanese composer Ryuichi Sakamoto and German electronic artist Carsten Nicolai (here credited as Alva Noto). The album's core sound is a blend of Sakamoto's impressionist piano melodies and Nicolai's digitally processed beats and sounds. Released in 2005 by Nicolai's Raster-Noton label, it follows the duo's debut album Vrioon, which was named album of the year in 2004 by The Wire magazine.

Insen is, by my estimation, the finest of the collaborations. Aurora and Logic Moon are simply mesmerizing. You can focus on macro-structures (piano chord progressions) or on micro-structures (sound shapes and tonal textures) and still be carried along in the aesthetic experience. This ability to straddle both ends of the continuum makes Insen so noteworthy. I have enjoyed the solo efforts by Nicolai too: For, Unitxt, and For2 stand out. For (Katsushito Hokusai) is absolutely a simple, Zen joy. If you enjoy electronica and designing and performance of noise/music (generative sound is the catch-all term I have for it), try out Insen or one of the other compositions.

MongoDB: Scalable Storage and Computation without Schema


MongoDB is one the best of the new crop of post-relational or post-schema-driven persistence engines designed for large scale cloud storage. Like CouchDB, another document-centric and schema-less cloud scale engine, MongoDB also offer several robust features: MapReduce support over its data, attribute-level indexing, auto-sharding, in-place updating, and high-availability. What MongoDB does not have is equally interesting: mandatory schema and SQL-like restrictions on data access and programming. Written in C++ and designed for multi-language access (Java first and foremost it seems), MongoDB is what I would term an instance-oriented cloud store: instances (documents) can be highly variant in structure and the system elegantly scales and continues to perform. A lot of this has to do with the manner in which the auto-sharding and replica management happens behind the scenes. This automagical behavior is reminiscent of the old(er) world of RDBMS storage that has dominated the enterprise computing space from the late 1980s through early 2000s but has jettisoned many of the sacred cows of SQL-based storage: homogeneous structure, convoluted graph-oriented operation (subqueries), and scale limitations based on the design of locking managers that were designed for I/O characteristics that simply do not obtain in the cloud space.

I have had a chance to play with MongoDB, Membase, Hive, Cassandra, and CouchDB and can definitely say that this feels like the inevitable direction of cloud scale storage and computation (see Spark, Dryad, MapReduce, etc.). Microsoft has proprietary auto-sharding storage systems (SQL Azure and Azure XStore) which I shall write about on another occasion. All that said, schema-free storage, indexing, auto-sharding, and high performance make for a compelling offering in MongoDB.

Spark: Big Data Computing with Scala


Spark is a new distributed computing model built on to of the Mesos cluster operating system. Both projects are done out of UC Berkeley. The interesting thing about Spark is that its concept of resilient distributed datasets (RDDs) and who the underlying Scala language with its support for shipping closures around the network delivers a scalable and high performance model when compared to the state of industry: i.e. Hadoop MapReduce. Spark's architecture naturally allows for efficient in-memory operation as well as caching - so that iterative algorithms such as those that one often encounters in machine learning, matrix algebra, mathematical optimization, and multi-dimensional indexing can really see a significant speed-up. The published performance/scale figures show that as iterations increase (in the case of say a logistic regression problem), Spark is 10-15 times faster in execution than a standard Hadoop MapReduce job operating over the same HDFS datasets. In-memory and cached intermediate state make a big difference in the distributed big data computing space.

A nice write-up of the logistic regression case is provided below from Matei Zaharia's website.

In more interesting applications, users probably need to read a data set and potentially transform it before performing calculations on it. For this purpose, Spark provides a second type of distributed dataset -- a file in the Hadoop Distributed File System (HDFS). Currently, only text files are supported. The HDFS file looks to the programmer like a collection of records (in text files, each record is a line). However, operations on it run at the nodes that contain each block of the file, as in MapReduce.

The corresponding parallel program in Spark (on top of Scala) to implement the logistic regression is the following:

// Read data file and transform it into Point objects
val spark = new SparkContext()
val lines = spark.hdfsTextFile("hdfs://.../data.txt")
val points = lines.map(x => parsePoint(x)).cache()

// Run logistic regression
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = spark.accumulator(Vector.zeros(D))
for (p <- points) {
val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y
gradient += scale * p.x
}
w -= gradient.value
}
println("Result: " + w)

This is an impressive programming model and one that definitely bears watching as Hadoop MapReduce, Microsoft Dryad, and other frameworks (like Sphere and Twister) vie for dominance in the BigData world that has emerged.

Tuesday, December 28, 2010

Node.js: Scalable Cloud Services


Sometimes, it is easier to let a framework speak for itself. In this case, node.js. It combines the zero-blocking design of several event-driven server architectures with the number one web programming language Javascript. The syntax is straightforward and easy to parse. Readability is a key adoption driver and hence why several continuation- and meta-programming-based languages and frameworks just do not fare very well. Node.js is one to watch (and use). I have excerpted some of their materials below to demonstrate the ease of use in building scalable server-based.

An example of a web server written in Node which responds with "Hello World" for every request.

var http = require('http');
http.createServer(function (req, res) {
res.writeHead(200, {'Content-Type': 'text/plain'});
res.end('Hello World\n');
}).listen(8124, "127.0.0.1");
console.log('Server running at http://127.0.0.1:8124/');
To run the server, put the code into a file example.js and execute it with the node program:

% node example.js
Server running at http://127.0.0.1:8124/
Here is an example of a simple TCP server which listens on port 8124 and echoes whatever you send it:

var net = require('net');
net.createServer(function (socket) {
socket.write("Echo server\r\n");
socket.on("data", function (data) {
socket.write(data);
});
}).listen(8124, "127.0.0.1");

Node's goal is to provide an easy way to build scalable network programs. In the "hello world" web server example above, many client connections can be handled concurrently. Node tells the operating system (through epoll, kqueue, /dev/poll, or select) that it should be notified when a new connection is made, and then it goes to sleep. If someone new connects, then it executes the callback. Each connection is only a small heap allocation.

This is in contrast to today's more common concurrency model where OS threads are employed. Thread-based networking is relatively inefficient and very difficult to use. See: this and this. Node will show much better memory efficiency under high-loads than systems which allocate 2mb thread stacks for each connection. Furthermore, users of Node are free from worries of dead-locking the process—there are no locks. Almost no function in Node directly performs I/O, so the process never blocks. Because nothing blocks, less-than-expert programmers are able to develop fast systems.

Node is similar in design to and influenced by systems like Ruby's Event Machine or Python's Twisted. Node takes the event model a bit further—it presents the event loop as a language construct instead of as a library. In other systems there is always a blocking call to start the event-loop. Typically one defines behavior through callbacks at the beginning of a script and at the end starts a server through a blocking call like EventMachine::run(). In Node there is no such start-the-event-loop call. Node simply enters the event loop after executing the input script. Node exits the event loop when there are no more callbacks to perform. This behavior is like browser javascript—the event loop is hidden from the user.