Thoughts on 'Big Data' as a fad

As a current student pursuing a career in data science, it really gets on my nerves when people ask me “What is Big Data?”, or worse, talk about how I should get “into” it. The problem is that the size of the data is not what makes data interesting. As a statistician, as a researcher in machine learning, or just as a student, we are far more interested in figuring out what to do with the current data, than worrying about big it is.

I get it. I need to be able to make accurate predictions and estimates and forecasts and I might need to do that without as much data as I want, but that’s a collection issue. Things don’t start falling into place just because you have a database with a lot of dense stuff in it. Your data science is only as good as the statistical analysis you put behind it. True Big Data is about leveraging the correct data, in the correct form, with the correct models.

On the other hand, we still have to deal with this “Big Data” word fiasco. It’s starting to become a household buzzword, which is a sure sign that it’s becoming “mainstream” and is no longer the corporate hipster slang it used to be. Go do a lookup for the keyword phrase “Big Data” on LinkedIn and take a look at the profiles that show up. We get lots of references to “Big Data” platforms and systems, but does anyone actually say “Yea, I do data analysis.”

No, they don’t.

And it’s just like 10 years ago when you had people clamoring to get in line for the “Data Mining” train. It created this false perception that somehow these incredible important nuggets of information are hidden behind a thick layer of dust and grime, but if we mine hard enough or use enough computers then we’ll be able to find it!

Except that’s not how it works, either.

More likely than not, you’re searching for patterns, not nuggets. And you’ll never know if you found your pattern or if you’re just looking at some more dirt. You’ll need your statistics (or a friendly statistician on-hand) to interpret that dirt, and tell you how if you project the dirt into multidimensional scaling then you can reduce misclassification of future dirt, somehow.

The one good thing about all these people going heads-over-heels for Big Data is that everything is being recorded. The data is already out there, being voluntarily relinquished every day. Wait just long enough, and write a program that’s just smart enough, and you can start connecting some really cool dots.

History will remember how the world was rocked by Web 2.0 (the “Social Revolution” of everything we do online), but there’s another storm coming. For years now, companies have been recording every ounce of data they can afford to store on a hard drive and the user is willing to part with. Then they hand off that information (they’ll tell you they never /sell/ it) to someone who can leverage it. The more dirt they have, the better chance they’ll be able to polish that dirt into a really cool training set of gold dirt, and then they’ll make tons of money.

So now the question for me is, “Who is going to be the one to leverage that data?” Big Data is nothing more than when your current data is too big for your current hardware. It’s a scaling problem, and it’s an engineering problem, but Big Data is not a scientific problem. Statistical analysis is a scientific problem, and I’ll try to position myself right at the front of that train. Web 3.0: the Data Revolution.

Comments