Computational Analysis of Digital Communication Article
Lecture 1:
-ARTIKEL 1: Van Attenveldt Peng (2018) When Communication Meets Computation
Opportunities Challenges and Pitfalls in Computational Communication Science The role of computational methods in communication science De recente versnelling in de belofte en het gebruik van computationele methoden voor communicatiewetenschap wordt voornamelijk aangedreven door de samenkomst van verschillende
ontwikkelingen:
1.A deluge (overflow) of digitally available data, ranging from social media messages and other digital traces to web archives and newly digitized newspaper and other historical archives. 2.Improved tools to analyze this data, including network analysis methods and automatic text analysis methods such as supervised text classification, topic modelling, and syntactic methods. 3.The emergence of powerful and cheap processing power, and easy to use computing infrastructure for processing these data, including scientific and commercial cloud computing, sharing platforms such as Github and Dataverse, and crowd coding platforms such as Amazon MTurk and Crowdflower. Many of these new data sets contain communication artifacts such as tweets, posts, emails, and reviews. These new methods are aimed at analyzing the structure and dynamics of human communication. These three developments have the potential to give an unprecedented boost to progress in communication science, provided we can overcome the technical, social, and ethical challenges presented by these developments.
Big data can be defined by:
Large and complex data sets Consisting of digital traces and other naturally occurring data Requiring algorithmic solutions to analyze Allowing the study of human communication by applying and testing communication theory Computational methods do not replace the existing methodological approaches, but rather complement it. Computational methods are an expansion and enhancement to the existing methodological toolbox, while traditional methods can also contribute to the development, calibration, and validation of computational methods. Oppertunities offered by computational methods Computational methods allow us to analyze social behavior and communication in ways that were not possible before and have the potential to radically change our discipline
at least in 4 ways:
From self-report to real behavior: Digital traces of online social behavior can function as a new behavioral lab available for communication researchers. These data allow us to measure actual behavior in an unobtrusive way rather than self- reported attitudes or intentions. This can help overcome social desirability problems, and it does not reply on people’s imperfect estimate of their own desires and intentions. It is methodologically viable to unravel the dynamics underlying human communication and disentangle the interdependent 1 / 4
relationships between multiple communication processes. It is now possible to trace news consumption in real-time and combine it with survey data to get a more sophisticated measurement of news consumption and effects. From lab experiments to studies of the actual social environment: We can observe the reaction of persons to stimuli in their actual environment rather than in an artificial lab setting. In their daily lives, people are exposed to a multitude of stimuli simultaneously, and their relations are also conditioned by how a stimulus fits into the overall perception and daily routine of people. Researchers are mostly interested in social behavior, and how people act strongly depends on their actions and attitudes in their social network. The emergence of social media facilitates the design and implementation of experiment research.Crowdsourcing platforms on social media lowers the obstacles in research subject recruitment. However, the implementation of experimental design on social media is not an easy task. Social media companies will be very selective on their collaborators and on research topics. The fear of them is to lose reputation and it could also be extremely time- consuming. From small-N to large-N: Increasing the scale of measurement can enable the researchers to study more subtle relations or effects in smaller subpopulations than possible with the sample sizes normally available in communication research. In order to leverage the more complex models afforded by larger data sets we need to change the way we build and test our models. It is useful to consider techniques developed in machine learning research for model selection and model shrinkage (penalized regression and cross-validation) which are aimed at out-of-sample prediction rather than within-sample explanation. These techniques estimate more parsimonious models and hence alleviate the problems of overfitting that can occur with large data sets. From solitary to collaborative research: Digital data and computational tools make it easier to share and reuse the resources. An increased focus on sharing data and tools will also force us to be more rigorous in defining operationalizations and documenting the data and analysis process. By fostering the interdisciplinary collaboration needed to deal with larger data sets and more complex computational techniques can change the way we do research. By offering a change to zoom in from the macro level down to the individual data points, digital methods can also bring quantitative and qualitative research closer together, allowing qualitative research to improve our understanding of data and build theory, while keeping the link to large- scale quantitative research to test the resulting hypotheses. Challenges and pitfalls in computational methods As said before, computational methods offer a wide range of possibilities for communication researchers to explore new research questions and re-examine classical theories from new perspectives. By observing actual behavior in the social environment, and if possible of a whole network of connected people, we get a better measurement of how people actually react, rather than of how they react in the artificial isolation of the lab setting. Large-scale exploratory research can help formulate theories and identify interesting cases or subsets for further study, while at the same time smaller and qualitative studies can help make sense of the results of big data research. Big data research can help test whether causal relations found in experimental studies actually hold in the wild on large populations and in real social settings. Using these new methods and data sets also creates a new set of challenges and pitfalls: How do we keep research datasets accessible? Although the volume, variety, velocity, and veracity of big data has been repeatedly bragged in both news reports and scholarly writings, it is a hard truth that many of the big data sets are proprietary ones which are highly demanding to access for most communication researchers. Researchers connected to these actors are generally based 2 / 4
only on a single platform, which makes it challenging to develop a panoramic understanding of user’s behavior on social media as a holistic ecosystem and increases generalizability problems. Such privileged access to big data will thwart the reproducibility of computational research which serves as the minimum standard by which scientific claims are judged. Samples of big data on social media are made accessible to the public either in its original form or in aggregate format. External parties also create accessible archives of web data. However, the sampling, aggregation, and other transformation imposed on the released data is a black box, which poses great challenges for communication researchers to evaluate the quality and representativeness of the data and then assess the external validity of their findings derived from such data. It is important to make sure that the data is open and transparent and to make sure that research is not reserved to the privileged few who have the network or resources to acquire data sets. It is vital that we stimulate sharing and publishing data sets.Where possible these should be fully open and published on platforms such as dataverse, where needed for privacy or copyright reasons the data should be securely stored but accessible under clear conditions. A corpus management tool can help alleviate copyright restrictions by allowing data to be queried and analyzed even if the full text of the data set cannot be published. When working with funding agencies and data providers such as newspaper publishers and social media platforms, you can make standardized data sets available for all researchers. Is big data always good data? Big data is found while survey data is made. Most of the big data are secondary are intended for other primary uses most of which have little relevance to academic research. On the other side, most of the survey data are made by researchers who design and implement their studies and questionnaires with specific research purposes in mind. The big data is found and then tailored or curated by researchers to address their own theoretical or practical concerns. The gap between the primary purpose intended for big data and the secondary purpose found for big data will pose threat to the validity of design, measurement, and analysis in computational communication research. That data is ‘big’ does not mean that it is representative for a certain population. Based on representative survey data, people do not randomly select into social media platforms, and very limited information is available for communication researchers to assess the representativeness of big data retrieved from social media. Specialized actors on social media (issue experts, professionals, institutional users) are over-represented while the ordinary publics are under-represented in computational research, which leads to a sampling bias to be carefully handled. This means that p-values are less meaningful as a measure of validity. For very large data sets, there representativeness, selection and measurement biases are a much greater threat to validity than small sample sizes, p-values are not a very meaningful indicator of effect. Size of data is neither a sign of validity nor of invalidity of the conclusions. For big data studies you should focus more on substantive effect size and validity than mere statistical significance by showing confidence intervals and using simulations or bootstrapping to show the estimated real effects of the found relations. Are computational measurement methods valid and reliable? 3 / 4
The unobtrusiveness of social media data makes them less vulnerable to traditional measurement bias, such as instrument bias, interviewer bias, and social desirability bias. However, this does not imply that they are free of measurement errors. Measurement errors can be introduced when text mining techniques are employed to identify semantic features in user-generated content, whether using dictionaries, machine learning, or unsupervised techniques and when social and communication networks are constructed from user-initiated behavior. Researchers found that different sentiment dictionaries capture different underlying phenomena and highlight the importance of tailoring lexicons to domains to improve construct validity. Researchers also observe the lack of correlation between sentiment dictionaries, and similarly argue for the need for domain adaptation of dictionaries. Similar to techniques like factor analysis, unsupervised methods such as topic modelling require the researcher to interpret and validate the resulting topics, and although quantitative measures of topic coherence exist these do not always correlate with human judgments of topic quality. It should be noted that classical methods of manual content analysis are also no guarantee of valid or reliable data. Researchers show that using trained manual coders to extract subjective features such as moral claims can lead to overestimation of reliability and argue that untrained (crowd) coders can actually be better at capturing intuitive judgements. The errors can introduce systematic biases in subsequent multivariate analysis and threaten the validity of statistical inference. This means that we need to emphasize the validity of measurements of social media and other digital data. What is responsible and ethical conduct in computational communication research? The scientific community and the general public have expressed growing concern on ethical conduct in computational social science. Such concerns can exist in different steps of computational communication research. F.e. in field experiments on social media, how can researchers get informed consent from the subjects? When users of a social media platform accept the terms of service of the platform, can researchers assume that the users have given an explicit or implicit consent to participate in any types of experiments conducted on the platform? There is no unambiguous answer to these questions but it is also not possible to ignore these problems and losing the trust of the general public. This calls for a collective effort from the whole community to set up a responsible conduct of research in computational communication research. How do we get the needed skills and infrastructure? Reaping (oogsten) the benefits of computational methods require that as a scientific community we need to invest in skills, infrastructure, and institutions. It is important that as practitioners we are skilled at dealing with data and computational tools. Many digital traces and other big data are textual rather than the numerical data most scholars are trained for and used to, and will require us to hone skills in natural language processing. Collaboration with other researchers is important but, collaboration requires research that is innovative and challenging to both sides, and in many cases what we need is a good programmer to help us gather, clean, analyze, and visualize data rather than a scientist to invent a new algorithm. Not all researchers can afford to fire such programmers. Thus, researchers expect that doing research in communication science will increasingly demand at least some level of computational literacy. It is vital that we make methods more prominent in our teaching to make sure the new generation of
- / 4