By Vedant Misra | March 1, 2016
I can’t remember the last time I made it through an airport terminal without seeing a giant abstract poster up on the wall with a vague heading like “Big Data is here: are you ready?“, or “Big Data: the new natural resource”. Often, the heading is in front of a photo of clouds, or a gazelle, or a server room, or some other such thing.
So, for the weary business traveler who at this point is too afraid to ask what “Big data” refers to, here’s a quick breakdown.
What is big data?
The term “Big Data” became popular when businesses started to regularly interact with datasets that were big enough that you couldn’t load them into your computer’s RAM, which these days is generally somewhere between 1GB and 16GB.
This has been a problem for a long time, but this particular name for it didn’t really take off until around 2004.
If your dataset doesn’t fit on RAM, you can’t do anything with it that requires looking at all of it at the same time, which makes it a pain to work with. You need to load a chunk of it off a hard drive, do something with it, and then load the next chunk.
Alternatively, you’re dealing with big data if whatever you’re doing with your data takes so long to do that you need to get mulitple computers to work with parts of it at the same time.
That’s basically the gist of it. “Big data” refers to the set of techniques and solutions that deal with those two kinds of scale—data that there’s a lot of, or computations that take too long to run. In other words, when data achieves an inconvenient scale in either space or time.
You may also have a big data problem if you need to do real-time proecssing of data that is generated or modified at a sufficiently high rate that you need multiple servers to handle the flow.
Together, those two reasons are the “Volume” and “Velocity” in Gartner’s popular “3 Vs” characterization of big data—“Volume, Velocity, Variety.” We think “Variety” is neither necessary nor sufficient for you to have a big data problem.
A big data example
Imagine you’ve acquired a giant CSV file from Facebook that contains a billion rows of data about Facebook users. Let’s say there are only two columns—“Name” and “Age”.
Let’s also say the average length of a name is about 10 characters, and the average length of an age is about two characters. So for each row, you have 12 characters, which makes for a total of 12 billion characters.
A single character takes up about a single byte of space on a hard drive. That means your dataset is about 12 billion bytes, or 12 gigabytes.
Suppose you want to compute the average age of Facebook users. Computing an average means summing up all the values of something and dividing by the number of things you added up. So if you can’t load this dataset into your RAM all at once, you’ll need to load up chunks of it at a time, add up the ages in that chunk, store the resulting sum, and also store the number of rows you just summed over. If you split up your data into, say, ten chunks, at the end, you’ll have ten sums, and ten counts. Then you just add up the sums, and add up the counts, and divide the total sum by the total count.
All big data problems are solved this way. Of course, things get tricky when you want to do more complex things, or if you have specific requirements for how fast you need a response, for example. But this is the core idea.
Why you probably don’t have big data
You probably only have big data problems if whatever you’re doing with your data requires you to look at every single item in the dataset.
That’s because statisticians spent the bulk of the 1800s learning how to work with samples of large datasets. Most of the time, the answer you get by looking at a random sample of your data is good enough, if you just apply the techniques of classical statistics.
In the example above, you could have easily just taken a random sample of 100,000 rows from your 1 billion row dataset, and computed the average of those rows. The answer would probably be basically the same, and the degree to which you might be off can be quantiifed rigorously using statistics.
You only need to look at every single row in specific cases. At Facebook, engineers regularly need to modify data in their databases in a very particular way for every single user, often in complex and interrelated ways. They also need to deal with lots of updates and deletions by lots of people in real-time. In these cases, you can’t just use statistics to simplify away the problem.
How to talk about big data
If you’re discussing your organization’s data requirements and scale with someone, don’t hesitate to get into the weeds. Avoid talking in abstractions and generalities. Ask specific questions:
- What’s the total magnitude of data on disk? A few Gigabytes? A few Terabytes?
- Is the data you’re working with growing? If so, by how much, and how often?
- What do you need to do with it? Do you need immediate responses to complex queries across multiple datasets? Or is it sufficient to run offline batch jobs once a week?