This is a response to many pieces I've read on big data analysis and large scale social media analysis and, in particular, a recent NYTs op-ed column by
Mr. Brooks is correct, and this is something I've written about before. As much as big data scientists hate to admit it, there is social science underpinning their algorithms and interpretations. Teams at top institutes are using a theory called homophily. And just hearing the name turns my stomach. Not so much because of the word (although, it is an odd term), but for the same reasons I grimaced when I heard the book Three Cups of Tea spun as a handbook for foreign policy. However, since my visceral response is not a recognized metric (yet), I will enumerate my objections to the theory’s current application to big data mining."The theory of big data is to have no theory, at least about human nature. You just gather huge amounts of information, observe the patterns and estimate probabilities about how people will act in the future. . . .To discern meaningful correlations from meaningless ones, you often have to rely on some causal hypothesis about what is leading to what. You wind up back in the land of human theorizing."
It is a social network theory roughly asserting that it is our common ‘likes’ that bind us. It has inspired research papers with titles such as, ‘Birds of a Feather [Stick Together],’ which chronicles many types of social network theory. Did I mention, this is all sociology, not information science, and it’s the study of humans relating offline out in the world?
Yes, these theories started back around the
1960s in the United States when the racial upheaval motivated social scientists
to ask, ‘What is it that binds us together at all?’ The insights they gained from work stretching
into the 1990s, perhaps combined with the familiar word network, has attracted researchers from information science to apply the finding to the online
domain.
Now assuming that humans behave offline in the
same way they do online is one leap, but big data scientists have made yet
another. The heaps of data come from
many sources such as social media, applications, and devices. Headlines were made when researchers at MIT predicted the political leanings of mobile phone users and even tracked the
spread of illness based on users' habits. The ground-breaking results from MIT
were based on only American user behavior, but discussed as though they were
universally applicable.
Results
assumed the one user/one device/one account rule of the US, but this isn’t the
pattern in some communal cultures.
Social scientists are just beginning to study user engagement with
technology in places like Nigeria and Indonesia and discover how much we thought we knew,
how much we assumed was universal, does not hold up under scrutiny and is
increasingly dynamic.
Bulk and ease-of-access to data does not
immediately add up to persuasive conclusions. I am not convinced by the argument I hear so
often from big data proponents, ‘the data speaks for itself,’ because the
models I come across that are essential for any human to make sense of the tonnage
of data are based on cold war era foundations.
Theories like homophily do not take into account the advent of the internet or
cultural variations. When interpreting social media data or making assumptions about user behavior, culture is a variable that cannot be ignored.
The truth is we don’t know what to do with all
this data yet. And I am a bit torn here because my inner-engineer wants to build better models, to improve. I am fascinated by the problem of how to incorporate cultural variation and increase what we can learn from the rich amount of
information at our disposal. But what will we use this be used for? Researchers at Harvard's Berkman Center aspired, through this flawed method, to create a model of the Iranian blogosphere as 'unique as a snowflake.' I probably don't need to explain the value of this research, but the social science foundation proposed simultaneously that all humans behave in a similar and predictable manner and also that unique cultural insights can be gained from a model that ignores cultural variation. (If you got lost in that last sentence, you're actually right where you should be. Most research grounded in homophily makes about that much sense.) So this is where I hope the larger community of scientists, social and data
scientists, can have a rigorous debate about how to do better... make better models and concern ourselves with the broader ethical implications.