Friday, December 31, 2010
When I think about the systems being designed to address the big data problem, partially a data storage, access, and query problem (10%) but mostly about analytics (90%) at scale, latency, and performance thresholds only imagined by previous generations - I immediately think of vector processing systems and elastic, on demand, computing fabrics.
One of the best systems for the vector processing aspect that I have seen is q/kdb+ from kx systems. Q/kdb+ was developed out of K, itself spawned by A(+), and previously founded on Kenneth Iverson's seminal language APL, so it is the exemplar vector processing language. Another vector processing language familiar to many is Mathwork's MATLAB. MATLAB, short for Matrix Laboratory, definitely shows its intellectual ancestry in APL.
Why vector processing? Is not the world to be owned by dynamic, object-oriented, and functional programming genres and languages? Why would so many hedge funds, investment banks, and savvy individual traders use systems such as q/kdb+ and others instead of these "obvious" choices proffered by large software enterprise players such as Oracle, IBM, and Microsoft? There are a number of factors militating against this that run deep within the big data movement, and I shall only summarize a few of them here.
First, nested vector data models and higher-order processing such as those provided in q/kdb+ and other vector processing systems are uniquely appropriate to processing massive, time series ordered (partitioned) datasets. Unlike relational data models, which lack intrinsic ordering (unordered set vs. ordered vector semantics - remember?), vector models are natively adapted to operating over massively sized, dimensional data with intrinsic partitioning (time). Stock trading data streams are a good example. This is evident in the financial services industry and anywhere where time series analytics (understanding what is in your data) and predictive analytics (regression, what is likely to happen in the future based on that information) are keys to competitive differentiation.
Second, vector processing engines scale extremely well since they are designed for the in-memory, high speed performance with zero wait states and blocking semantics. Most vector processing engines, such as q/kdb+, are interpreters running at extreme speeds in a single thread of control and with no complex lock managers trying to outsmart the runtime vector engine. This is in complete contradistinction to the prevailing wisdom of RDBMS systems, which were designed for solving many concurrent access, I/O bottlenecks, and interleaved indexing operations that simply no longer apply to the big data world. Vector processing engines are already way ahead of where streaming and in-memory tuned engines such as Streambase, Coral8, Vertica, SciDB, and others wanted to get to. I suspect that we will see old school RDBMS vendors try to reposition some of their antiquated technologies in this vector processing space. Although this may be laudable and slightly quixotic, it is still like trying to rebuild a Ferrari sports car starting with the design of a Yugo. Why, in a sane world, would you start from that design position? You would not.
Third, vector processing languages are highly amenable to parallelism and high performance concurrent execution. A key part of this is the minimal unit of operation of these languages: the nested, n-dimensional vector. This, and not the scalar, is the unit of operation. This is a subtle point: most languages are designed to operate over data one scalar value at a time. To deal with anything more complicated requires a great deal of error-prone iteration, typically nested. What requires two or more iterative looping constructs in a language such as C# or Java is simply one operator (or higher-order function) in a vector processing language. An example would be computing the average of a vector of test scores: in an q or APL-like language this becomes a simple one liner: sum aVector / tally aVector. That is the entire program. Taking these powerful monadic and dyadic operators and partitioning them across an elastic fabric of processors, caches, and storage is simplicity itself since so little context needs to be migrated in the form of closures and execution context. This hidden support for scalable parallel processing is another of the values of vector processing languages. Few Algol-60 derived languages can boast this form of inherent parallelism. In a big data world, this is critical for success and scalability.
Finally, I posit that vector processing engines, when wedded to elastic, on demand compute and storage platforms, will become the most attractive computational and modeling option for data scientists (quants) trying to do data-driven science and business. Combining a small vector processing core with a scalable, elastic fabric such as Heroku, EC2, Google, or Azure will usher in a new era of vector fabric processors. They will have all of the algorithmic sophistication of classical high performance in-memory vector engines such as q/kdb+, vhayu, and timescale but combined with the infinite CPU, storage, and scale-out of cloud computing. Being able to interface these fabric-scaled vector processors with rich visualizations of the sort one encounters in R, Excel, MATLAB, Tableau, and other real-time data graphing exploration vendors makes for a great marriage of equals. You do not have to be locked into a proprietary visualization system any more than you have to be locked into a proprietary cloud. There are too many excellent choices available. I have addressed several in my recent blogs.
This marriage of vector fabric processors and the economies of scale that cloud computing will provide should create a fundamental economic disruption in both the cloud computing vendor landscape as well as across many verticals: from finance to insurance to computational bio-pharmacology to retail to leisure. We are already seeing the convergence of medium-scale data warehousing workloads for analytics and map reduce computational models (AsterData, Greenplum/VMWare, Netezza/IBM, etc.) This is only the harbinger of a far more systematic transformation in the way machine learning, predictive analytics, mass-scale simulation, and continuous (near-line) business process optimization is done in the next five years.
Imagine the three screens of the future analyst workstation as depicted above: (left) historical data warehousing, (middle) real-time business intelligence, and (right) predictive analytics, simulation, what-if? analysis, and regression modeling. All three screens (modes) can be incorporated in one unified platform for information and (actionable) insight with the power of vector processing engines combined with scalable cloud fabrics.
Interested parties are invited to try out q/kdb+ (free evaluation trial of 32bit edition) or one of the many high quality APL implementations available online. I would recommend the following APLs to try: APL-X and Dyalogic APL. All three are excellent offerings and will demonstrate the power of higher-order vector processing, for scenarios ranging from statistical analysis, portfolio strategy optimization, risk management, search engine optimization (SEO), to analytical CRM for eCommerce. The old cliche of back to the future may well prove to be prescient in regards to APL and its many children.
Cloud fabric plus vector processing plus scalable storage (ala MongoDB, Membase, et al) is the equation for the future of big data analytics and continuous, online machine learning. Say goodbye to Java, C++, C#, and the old scalar-at-a-time programming languages derived from Algol-60.
Welcome back to the serene elegance, conciseness, and high performance parallelism of APL (minus the custom symbology) in the big data world.