电子文档交易市场
安卓APP | ios版本
电子文档交易市场
安卓APP | ios版本

Crowdsourcing a Wikipedia Vandalism Corpus众包维基百科破坏公物

2页
  • 卖家[上传人]:L**
  • 文档编号:136784523
  • 上传时间:2020-07-02
  • 文档格式:PDF
  • 文档大小:52.52KB
  • / 2 举报 版权申诉 马上下载
  • 文本预览
  • 下载提示
  • 常见问题
    • 1、Crowdsourcing a Wikipedia Vandalism Corpus Martin Potthast Bauhaus-Universitt Weimar 99421 Weimar, Germany martin.potthastuni-weimar.de ABSTRACT Wereport on theconstruction of thePAN Wikipediavandalism cor- pus, PAN-WVC-10, using Amazons Mechanical Turk. The corpus compiles 32452 edits on 28468 Wikipedia articles, among which 2391 vandalism edits have been identifi ed. 753 human annotators cast a total of 193022 votes on the edits, so that each edit was reviewed by at least 3 annotators, whereas

      2、 the achieved level of agreement was analyzed in order to label an edit as “regular” or “vandalism.” The corpus is available free of charge.1 Categories and Subject Descriptors: H.3.4 Information Storage and Retrieval: Systems and SoftwarePerformance Evaluation General Terms: Experimentation Keywords: Wikipedia, Vandalism Detection, Evaluation, Corpus 1.INTRODUCTION Wikipedia is an encyclopedia written by the crowd. The key to Wikipedias success is a collaborative writing process, where ev- eryb

      3、ody can edit every article. Ideally, the reader of an article also revises it to the best of her abilities, e.g. by correcting errors, by improving the writing style, by adding missing information, or by removing redundancy. In this way Wikipedias articles get continu- ously improved and updated. This“freedom of editing” gavethe lie to those who suggested that the resulting articles would be charac- terized by poor quality and instability. Wikipedia thrives. There is no free lunch, however, and

      4、Wikipedia faces problems that limit its growth, such as vandalism, edit wars, and lobbyism. Our concern is the automatic detection of vandalism in Wikipedia, i.e., the detec- tion of edits that were made with bad intentions. We contribute to this research fi eldby developing a largecorpus of human-annotated edits, which is a prerequisite for the meaningful evaluation of van- dalism detection algorithms. In particular, we report on our efforts to use Amazons Mechanical Turk as a possibility to dr

      5、ive the cor- pus size tothe necessary order of magnitude without compromising the corpus quality. Related Work.Although vandalism has been observed in Wikipe- dia right from the start, and, although vandalism is often deemed one of Wikipedias biggest problems, research has addressed auto- matic vandalism detection only recentlyfor the fi rst time in 3, 5, 7. Vandalized articles often get restored rather quickly by other editors, but still, the authors of 6 fi nd that the number of times vandaliz

      6、ed articles get viewed amounts up to hundreds of millions, 1Download the corpus from http:/www.webis.de/research/corpora Copyright is held by the author/owner(s). SIGIR10, July 1923, 2010, Geneva, Switzerland. ACM 978-1-60558-896-4/10/07. and that the probability of encountering vandalism grew exponen- tially between 2003 and 2006. In reaction to this development, the Wikipediacommunity has developed anumber of rule-based robots that are capable of restoring the most obvious cases of vandalism,

      7、or that aid editors to do so 2. However, the performance of the robots is surpassed, for instance, by an approach based on machine learning 5. Other reactions include the temporary suspension of the freedom of editing for articles that are often vandalized, which threatens the very idea of Wikipedia. The fi rst vandalism corpus was the Webis-WVC-07, which con- sists of 940 human-annotated edits of which 301 are vandalism 4. The PAN-WVC-10 is two orders of magnitude larger and has been annotated

      8、by many different people; it thus forms a more repre- sentative sample of vandalism and allows for better estimates of whether a vandalism retrieval model will actually work in practice. In this respect, the Mechanical Turk provides an exciting new way to scale up corpus construction, which has also been applied suc- cessfully, e.g., to recreate TREC assessments 1. 2.CORPUS DESIGN Corpus Layout.An edit marks the transition from one article re- vision to another. On Wikipedia, each revision of ev

      9、ery article is accessible by means of a permanent identifi er, so that an edit is de- scribed uniquely by a pair of revision IDsreferencing the old article revision and the new revision.2Basically, our corpus is a list of re- vision ID pairs along with labels whether or not the respective edit is vandalism. Moreover, for each edit meta information is given as well as the plain texts of both the old and the new article revision. Corpus Acquisition.Our sample of edits is drawn from the revision histories of Wikipedia articles by means of probability proportional to size sampling, where in our case, the “size” of an article is the average number of times it gets edited in a given time frame. We hypothesize that the average edit ratio of an article correlates with the number of times it gets viewed. In that case, our edit sample resembles well the distribution of article importance at the time of sampling, which presumably also infl uences the articles chosen by v

      《Crowdsourcing a Wikipedia Vandalism Corpus众包维基百科破坏公物》由会员L**分享,可在线阅读,更多相关《Crowdsourcing a Wikipedia Vandalism Corpus众包维基百科破坏公物》请在金锄头文库上搜索。

      点击阅读更多内容
    最新标签
    发车时刻表 长途客运 入党志愿书填写模板精品 庆祝建党101周年多体裁诗歌朗诵素材汇编10篇唯一微庆祝 智能家居系统本科论文 心得感悟 雁楠中学 20230513224122 2022 公安主题党日 部编版四年级第三单元综合性学习课件 机关事务中心2022年全面依法治区工作总结及来年工作安排 入党积极分子自我推荐 世界水日ppt 关于构建更高水平的全民健身公共服务体系的意见 空气单元分析 哈里德课件 2022年乡村振兴驻村工作计划 空气教材分析 五年级下册科学教材分析 退役军人事务局季度工作总结 集装箱房合同 2021年财务报表 2022年继续教育公需课 2022年公需课 2022年日历每月一张 名词性从句在写作中的应用 局域网技术与局域网组建 施工网格 薪资体系 运维实施方案 硫酸安全技术 柔韧训练 既有居住建筑节能改造技术规程 建筑工地疫情防控 大型工程技术风险 磷酸二氢钾 2022年小学三年级语文下册教学总结例文 少儿美术-小花 2022年环保倡议书模板六篇 2022年监理辞职报告精选 2022年畅想未来记叙文精品 企业信息化建设与管理课程实验指导书范本 草房子读后感-第1篇 小数乘整数教学PPT课件人教版五年级数学上册 2022年教师个人工作计划范本-工作计划 国学小名士经典诵读电视大赛观后感诵读经典传承美德 医疗质量管理制度 2 2022年小学体育教师学期工作总结 2022年家长会心得体会集合15篇
     
    收藏店铺
    关于金锄头网 - 版权申诉 - 免责声明 - 诚邀英才 - 联系我们
    手机版 | 川公网安备 51140202000112号 | 经营许可证(蜀ICP备13022795号)
    ©2008-2016 by Sichuan Goldhoe Inc. All Rights Reserved.