‘Big Data’

If ‘The Graduate’ were made today, Benjamin Braddock might hear a well-meaning uncle stage-whisper ‘Big Data’ instead of ‘Plastics.’ (Runners-up: ‘The Cloud’, ‘Social Discovery’, ‘Gamification’, the list goes on.) ‘Big data’ is a buzzword that people throw around a lot. What does it mean? Large data sets are not new. The IRS, the Census, Walmart, money center banks have always had big data sets. What’s changed?

Look at Google. You type ‘Justin,’ and within milliseconds it offers ‘Bieber’ as a completion. You hit enter, and it fires off a query to a bunch of computers (guessing low hundreds) that each have part of the Google index in RAM. They return answers, and the results are ranked. Then the page ids are fired off to a different set of computers that has the entire text of the relevant Web pages in RAM, and the relevant excerpts are retrieved. Finally, a Search Engine Results Page (SERP) gets built, with personalized ads etc. All this happens in milliseconds.

This was a revolutionary approach to working with large datasets. What’s changed is not just that there are more large datasets, but also, the tools Google pioneered have gone mainstream, and new computing stacks let you apply powerful algorithms in real time.

Near the bottom of the stack is the database. Oracle databases can handle any amount of data. Could Google have been built on top of Oracle? Probably. Should they have? No. First, Oracle’s business model isn’t designed for it, licensing for millions of servers would cost a fortune! Second, Oracle is poorly adapted to the problem. It can do complex queries across massive disks pretty efficiently. But if you want to split data across hundreds of servers and keep it in memory, you have to give up much of the functionality Oracle provides, and it would still have complexity and features that would take up resources, driving up the cost and degrading performance.

A traditional SQL database is built on ACID principles: (Atomicity, Consistency, Isolation, Durability), that make databases reliable for banks and accounting systems. A transaction either works or it doesn’t. It can’t be half-completed, where money is debited from one account and not credited to another. Database activity can be viewed as a single series of transactions and queries _ any query will see the database in a consistent state in the chain. Each transaction is isolated from every other, even if they happen concurrently, they don’t affect each other, as if they happened sequentially. And once a transaction is committed, it stays committed, regardless of power loss, network failure, etc.

ACID imposes a cost in complexity and performance. There’s a computer science result called Brewer’s CAP theorem (longer version, by Brewer). Pick any two: Consistency, knowing a single up-to-date state of the data; high Availability for writes; and Partition tolerance, no impact from network failure. The CAP theorem means that when you have a distributed system that communicates over a network, you have to make tradeoffs. If you have a master database and a few slaves mirroring the master for reporting, web queries, there will always be an instant where the master has been updated and the slaves are still out of date. There is no perfect disaster recovery (DR) system, where two systems are 100% in sync and if one fails, you can immediately fail over to the other with a guarantee of no data loss. There is always that instant where if the network fails, things are not 100% in sync. ACID and distributed systems don’t mix perfectly.

But Google doesn’t need ACID. Nobody cares much if two similar queries at the same time yield slightly different Google results, or Facebook walls. They do care if they don’t get it back in milliseconds.

And networks have grown incredibly fast, large and reliable. Meanwhile, CPUs are still getting cheaper, but individual CPU cores are not gaining performance as quickly, and speeding them up is less energy-efficient than buying more low-power CPUs. So companies like Google, Facebook, etc., use a different set of non-ACID tools in big server farms that let lots of cheap servers connected over fast networks work together.

Wags sometimes refer to the new databases, in contrast to ACID, as ‘BASE’: Basic Availability, Soft state, Eventually consistent. ‘NoSQL’ is more commonly used. Different NoSQL database systems are optimized for different use cases: key-value stores, document databases, graph databases. There has been an explosion of new database systems _ the growing blue area in this chart.

What does the Big Data stack look like?

Clusters of virtual machines on Amazon Web Services, Heroku, Rackspace, etc.: PaaS (platform as a service)
NoSQL databases for non-ACID applications: MongoDB, Cassandra, CouchDB, Redis, Riak (a popularity chart, Google Trends).
Platforms to distribute algorithms over those clusters _ Hadoop, MapReduce.
Analytical software to run complex machine learning algorithms on those platforms.

The last bullet point is what people most commonly associate with ‘Big Data’. Not only can Google return the most frequent completion as you type, it can update the most frequent searches in real time. More critically for the Google business model, it can predict which ad link you are most likely to click on, based on your recent activity, current location, everything it knows about your demographics and psychographics from your searches, your clicks on all the web sites in Google’s ad network, your emails in Gmail. Powerful machine learning algorithms like support vector machines (SVM) can run on clusters in a highly parallel manner to model behavior using hundreds of features and make predictions in real time.

If you have all the parameters to make a decision, given enough data and training, machine learning will generally find a good model and make useful predictions. Nevertheless, ad networks still present Christian singles ads to old married non-Christians. If you don’t collect the right parameters and feed them in the right form, you get garbage in, garbage out. If the data you have isn’t predictive, or you have the wrong model, having a lot of data isn’t going to help you ¹.

Bottom line: A new computing stack, pioneered 15 years ago by Google, has gone mainstream. It consists of ultra-cheap small computers connected by fast networks, plus distributed algorithms, for new applications that run at web scale and don’t have strong consistency requirements, and can make sophisticated predictions and decisions in real-time.

People call it ‘Big Data’, but what’s new and big is the computing power that can be directed at data in real-time. The distinguishing feature of this paradigm is dealing with more data than can fit in a computer’s memory by splitting up the task over many computers working in parallel in real-time, instead of doing them sequentially on a disk in batch mode.

In that sense, a traditional SQL data warehouse with petabytes of transaction data, batch processing business intelligence for online analytic processing and data mining, is not in the new Big Data paradigm, in the same sense as a real-time machine learning ad-serving engine that may have a smaller database.

It’s not the data that got big, it’s the computers that got small, fast, very numerous, and very very clever.

“One word: Cat pictures. Wait, that’s two words.”

¹ It’s undeniable that bad models were a signficant cause of the financial crisis. But it’s an error to say that the financial crisis was caused by big data gone awry. First of all, the bad models just so happened to be the ones that made the bankers the most money in the short run. It wasn’t so much bad big data as it was misaligned incentives, agent-principal problems, in some cases outright looting. Second, to my knowledge, there was exactly one firm that had a big data system that gave an integrated real-time picture of firm risks: Goldman Sachs, which averted life-threatening losses (although they were at risk if AIG or the whole system went down). Goldman’s system priced every instrument, however exotic, in real-time using the same curves, forward rates, volatilities, and correlation matrices, and allowed them to compute risk and test shocks against the whole firm. They were the only ones who said, we’re losing money every day on this when our model says we shouldn’t, let’s take a close look at this… and then cut losses and go short. Unlike Goldman Sachs, which grew organically, every other firm was cobbled together with mergers, and had different groups which did their own analytics in disparate systems which someone then struggled to integrate at the top level. They lost billions because, unlike Goldman, they didn’t use a big data paradigm. And also because “At Goldman, …when they say get out, they get out. At [competitor], when [the president] says get out, people start negotiating.”