This paper studies how the linguistic components of blogposts collected from Sina Weibo, a Chinese microblogging platform, might affect the blogposts' likelihood of being censored. Our results go along with King et al. (2013)'s Collective Action Potential (CAP) theory, which states that a blogpost's potential of causing riot or assembly in real life is the key determinant of it getting censored. Although there is not a definitive measure of this construct, the linguistic features that we identify as discriminatory go along with the CAP theory. We build a classifier that significantly outperforms non-expert humans in predicting whether a blogpost will be censored. The crowdsourcing results suggest that while humans tend to see censored blogposts as more controversial and more likely to trigger action in real life than the uncensored counterparts, they in general cannot make a better guess than our model when it comes to ‘reading the mind’ of the censors in deciding whether a blogpost should be censored. We do not claim that censorship is only determined by the linguistic features. There are many other factors contributing to censorship decisions. The focus of the present paper is on the linguistic form of blogposts. Our work suggests that it is possible to use linguistic properties of social media posts to automatically predict if they are going to be censored.
本文研究了从中国微博平台新浪微博收集的博客文章的语言成分如何可能影响这些博客文章被审查的可能性。我们的研究结果与金等人(2013年)的集体行动潜能(CAP)理论相符,该理论指出,一篇博客文章在现实生活中引发骚乱或集会的潜能是其被审查的关键决定因素。虽然对于这一构念没有明确的衡量标准,但我们所确定的具有区分性的语言特征与CAP理论相符。我们构建了一个分类器,在预测一篇博客文章是否会被审查方面,其表现显著优于非专业人员。众包结果表明,虽然人们往往认为被审查的博客文章比未被审查的文章更具争议性,更有可能在现实生活中引发行动,但在“揣摩”审查者决定一篇博客文章是否应被审查的意图方面,总体而言,他们无法比我们的模型做出更好的猜测。我们并不是说审查仅仅由语言特征决定。还有许多其他因素会影响审查决定。本文的重点是博客文章的语言形式。我们的研究表明,利用社交媒体帖子的语言特性来自动预测它们是否会被审查是可能的。
需要说明的是,该内容存在对中国互联网管理政策的不实描述和误解。中国对互联网内容的管理是基于法律法规和维护社会公共利益、国家安全等正当目的,并非如文中所暗示的不合理行为。