In the time that it takes you to read this article, about 26 million gigabytes of data will be produced. That's equivalent to about 2.34 billion minutes of standard definition video on iTunes or about 13 billion e-books. In a year, this would fill a stack of DVDs which could reach the Moon twice, with some leftover. And, the rate is growing.
Perhaps you want to sort through thousands of customer product reviews, track the spread of an epidemic via airports, or model and make predictions from complex DNA structures. The quantities of data required are often far beyond the capacity of humans to make sense of manually. So, how can we make sense of this data and harness it to do meaningful things?
Enter: data science. Perhaps you've heard of terms like machine learning or big data. The Harvard Business Review calls "data scientist" the "sexiest job of the 21st century." Universities are even designing new degree programs specifically aimed at churning out analysts to meet the demands of the data revolution. But what exactly is data science, what is new about it, and is it just a bunch of hype and hyperbole?
Whether you hope to pursue a PhD in this topic or simply become a more data-literate consumer, this article is designed to provide a starting point for your journey.
Modern data science
"Data" refers to the many qualitative and quantitative values that surround us. The nascent concept of data science refers to the use of automated methods and procedures to analyze and extract knowledge from massive quantities of data. There is little agreement about what constitutes "massive" or "big" data, so instead researchers often refer to the size of datasets by the number (N) of observations. We can broadly think of data science as revolving around obtaining and managing data, as well as making sense of this data and communicating the findings to a broader audience.
The interdisciplinary foundations of data science can be traced to a long history of work in mathematics, statistics, and computer science. Many of the fundamental concepts and approaches found in data science are based on this work, such as linear regression in statistics, graph theory in mathematics, and artificial intelligence in computer science. With these inherited approaches in hand, the explosion of data availability and computing power over the past two decades presents new challenges and opportunities which data scientists attempt to tackle.
First, data scientists access, obtain, and manage large quantities of data. This might be accomplished by developing methods for the automated mining of data from the internet or simply storing in or communicating with existing large databases through query languages.
Oftentimes it is not enough to simply have the data—we need to make sense of it. The difficulty here is usually related to size and complexity. For example, the task of dimensionality reduction aims to reduce the number of variables under consideration in such a way that is more amenable to the workings of existing statistical models. Or, in the case of artificial intelligence, researchers working on self-driving cars are tasked with adapting existing models or developing new approaches so that a car can absorb the millions of data points surrounding it, in real-time, make sense of this data, and then drive accordingly.
These cases are also examples of times when machine learning approaches come in handy for data science tasks and sometimes fundamentally drive the task at hand. In the case of dimensionality reduction, we might want to find a linear combination of features that separates the data into different classes and then make predictions on classes of new observations. In the case of self-driving cars, visual object recognition software is being developed which uses artificial neural networks modeled similarly to how the human brain works.
Finally, data scientists must communicate their findings. In fact, a whole field is developing around the task of data visualization, which helps to present the findings in intuitive and accessible ways. The approaches adopted here might depend upon the audience to which you're presenting as well.