Replication data and scripts for: The interplay of complexity and subjectivity in opinionated discourse. (version 1.0)
https://zenodo.org/badge/latestdoi/189996444
This repository comprises the original data, scripts and extensive statistics for the analysis of text complexity and subjectivity described in the related publication
- Ehret, Katharina, and Maite Taboada (2021). "The interplay of complexity and subjectivity in opinionated discourse." Discourse Studies 23 (2): 141-165. DOI: https://doi.org/10.1177/1461445620966923
This publication is a large-scale, quantitative analysis of text complexity and various markers of subjectivity in opinionated discourse. Specifically, the authors investigate how text complexity interacts with markers of subjectivity to characterise (i) opinion articles, (ii) reader comments, and (iii) news articles. Methodologically, conditional inference trees and random forests (as implemented in the R package partykit) are used to unravel the interactions between text complexity and subjectivity. Text complexity is defined in terms of Kolmogorov complexity, i.e., the complexity of a text is measured as the length of the shortest possible description necessary to regenerate the original text. Subjectivity is operationalised as the frequency of lexico-grammatical markers of subjectivity and argumentation which have been well-established in research on sentiment, evaluation, stance and Appraisal.
The data published in this repository was retrieved from the Simon Fraser University opinion and comments corpus (SOCC) and a custom-made corpus of general news articles from the Canadian online newspaper The Globe and Mail.
This repository contains the following resources (in alphabetical order):
This folder contains the original dataset.
-
aggregate_totals_normalised.csv: The feature matrix with the individual file names as rows and textType, year, tokens, the raw and normalised feature frequencies, and the complexity scores as columns. The normalised feature frequencies of the subjectivity and argumentation markers were calculated based on the raw feature frequencies divided by the number of tokens per file and multiplied with 1000.
-
markerDistributions.csv: The raw frequencies of the individual subjectivity and argumentation markers per text type.
This folder comprises the complete lists of subjectivity and argumentation markers described in the related publication.
-
other_features: A folder containing the lists of the argumentation markers adverbials, connectives and modals.
-
socal_features: A folder with two subdirectories sampling reduced features lists of subjectivity markers from the Semantic Orientation CALculator (SO-CAL). Specifically, only subjectivity features with a valency of 4 and 5 are included.
- socal_invariant: negative and positive adverbs.
- socal_variant: negative and positive adjectives, nouns and verbs.
This folder contains the scripts for data analysis and the retrieval of the subjectivity markers.
-
compinion.r: R commands for the visualisation and implementation of the statistics, conditional inference trees and forests presented in the related publication. Only tested on Linux GNU Debian, using R version 3.6.2.
-
countFeat.py: A python script for retrieving the subjectivity and argumentation markers (see Subjectivity).
-
countFeat.md: Read me with instructions of how to run countFeat.py.
This folder contains all statistics described in the related publication and additional stastistics.
-
The confusion matrices of the training and test datasets for conditional inference forests with N = 500, 1000, 2000 trees, respectively. Confusion matrices are used to calculate model performance, i.e. prediction accuracy.
- confMat_500.csv and confMatTest_500.csv
- confMat_1000.csv and confMatTest_1000.csv
- confMat_2000.csv and confMatTest_2000.csv
-
correlations.csv: The Pearson correlation coefficients for correlations between all predictor variables described in the related publication, i.e. year, morphological complexity, syntactic complexity, overall complexity, subjective negative markers, subjective positive markers, modals, connectives, adverbials.
-
tunegridTree.csv: A csv file reporting the training and test accuracy for conditional inference trees grown with varying parameter settings. To be more precise, the following three parameters were used in tuning the tree: mincriterion, minbucket and maxsurrogate (for a detailed description of the parameters see https://cran.r-project.org/web/packages/partykit/vignettes/ctree.pdf).
-
The rankings of the nine predictor variables according to the conditional permutation-importance measure, a measure indicating the importance of individual predictor variables, which was calculated for three differently sized condtional inference forests, i.e. forests with N = 500, 1000, 2000 trees, respectively.
- varimp500.csv
- varimp1000.csv
- varimp2000.csv