By: Viktor Mayer-Schonberger and Kenneth Cukier
Summary: Mayer-Schonberger, professor of Internet governance and regulation at Oxford Internet Institute, and Cukier, data editor of the Economist, delve into the hottest trend in technology, Big Data, describing the impact this buzzword will have on the economy, business, science and society.
What is “Big Data”?
Simply put, Big Data refers to the growing ability to collect massive amounts of information, organize the data, analyze it, and draw profound correlations from the dataset(s) to provide useful insights or goods and services of value. This is a growing science due to multiple factors: * Increasingly cheaper sensors * Increased storage capabilities and decreasing cost to store * Faster processing speeds and technologies (Hadoop & MapReduce) * Increasing public (sometimes free) access to collected datasets
Most importantly – Most, if not all, big data “answers” provide correlation, not causality (only what, not why). Example: Wal-Mart realizing that sales of strawberry Pop-tarts increase immediately before a hurricane. They placed them at the front of stores next to hurricane sales – Sales boosted, dramatically.
Big data is about three shifts: the ability to analyze vast datasets, embracing data-messiness, respect for correlation
Chapter 2: More
The shift from sampling: Standard statistics have shown that sampling precision improves most dramatically with randomness not increased sample size. However, there is inherent weakness with random samples: * Collection biases (i.e. landline phone samples bias younger, strictly cell-users) * The ability to scale to create subgroups (random samples cannot be infinitely cut without introducing error)
The move to “N = ALL” – The concept of sampling is losing relevance as passive data collection grows. There is a movement in collecting all the data or continually collecting and reworking algorithms. (Ex. Lytro Camera)
Chapter 3: Messy
Accepting Inexactitude: In a world of sampling reducing errors and ensuring high quality data was a natural impulse. Though, in a Big Data world relaxing standards allows data to not only allow “more to trump more”, but for “more to trump better”. (Ex. Hundreds of cheap, maybe inexact temp. sensors covering a wine vineyard outperform one expensive, 100% accurate sensor.) “Sometimes 2+2 can equal 3.9 and that’s good enough.”
Chapter 4: Correlation
Definition of Correlation - A mutual relationship between two or more things that quantifies the statistical relationship. With correlations there is no certainty only probability.
Big Data creates correlation without the need for hypotheses about phenomenon * Target didn’t need to know women we’re pregnant only that certain purchases likely meant pregnancy * The Medication Adherence Score predicts how likely people are to take their medication base on seemingly irrelevant variables –