电子文档交易市场
安卓APP | ios版本
电子文档交易市场
安卓APP | ios版本

一种半监督学习方法及其在命名实体识别中的应用

30页
  • 卖家[上传人]:ZJ****1
  • 文档编号:60422084
  • 上传时间:2018-11-16
  • 文档格式:PPT
  • 文档大小:684KB
  • / 30 举报 版权申诉 马上下载
  • 文本预览
  • 下载提示
  • 常见问题
    • 1、2018/11/16,一种半监督学习方法及其在命名实体识别中的应用 李彦鹏,Information Retrieval Laboratory, Dalian University of Technology,Outline,Biomedical named entity recognition Data sparseness Feature coupling generalization Experimental results Current & future work,Biomedical Named Entity Recognition (BioNER ),Recognize named entities in biomedical texts, e.g., genes, proteins, and cells, etc. An important preliminary step for advanced text mining tasks. For example:,The TCF-1 alpha binding site was also required for TCR

      2、alpha enhancer activity in transcriptionally active extracts from Jurkat but not HeLa cells, confirming that TCF-1 alpha is a T-cell-specific transcription factor,Challenges of BioNER,Huge vocabulary size. e.g., millions of gene/protein names. Long names, “ethanol repression autoregulation ( ERA ) / twelve-fold TA repeat ( TAB ) repressor element” Ambiguous definition of entity boundaries. The same term can refer to different types in different contexts.,Current state-of-the-art,Challenge evalua

      3、tions:,In these challenges,Dictionary look-up methods yield poor performances (50%-70% F-score): Low coverage, e.g., long names, variants Large noise, e.g., common English terms, entities of other types. Machine learning methods show great success. The framework like “Lexical features + regularized linear model” is applied by all the top-performing systems.,Lexical-level features,IL 2 gene,0 1 0 1 1 0 1 1,W=IL, W=2, bigram=IL 2, norm=ILgene, suffix = *ene,Data sparseness in lexical features,Larg

      4、e out-of-vocabulary (OOV) rate Terms not in the training corpus are not modeled well Extreme low frequency terms can not provide sufficient information to train a good classifier. Regular expression features can alleviate the problem, but far not enough: Surface information is not always indicative Indicative patterns also lead to sparseness,Overcome data sparseness,Taxonomy based methods Word net, UMLS Depend on their qualities Subspace based methods LSA, KPCA, sparse coding Automatic methods H

      5、uge space and time cost Corpus-based methods PMI-IR, Web search, ESA Good scalability and easy to implement,Web-based methods for NER,Finkel et al. (2005) used co-occurrence of a named entity and indicative context to validate the gene name 0.17 F-score improvement. Etzioni et al. (2005) used PMI of the current entity and discriminator phrases as the input of a nave Bayes classifier Disadvantages: Not general enough to be extended to a broader area of NLP and Machine leaning. No systematic compa

      6、rison with elaborately designed lexical features. Large room for further improvement,Our method,We analyze the nature of these methods and give a general framework. Generate new features from the relatedness of two special components. We try to find answers to: What are these two components? How to find them? How to convert the relatedness measures into new features?,Key definitions,Example-distinguishing features (EDFs) Features with high ability to distinguish the current examples from others.

      7、 E.g., “bigram=IL 2”. Tends to be sparse, and sometimes lead to data sparseness. EDF roots: the “higher-level” concepts. E.g., “bigram” Class-distinguishing features (CDFs) Strong indicative to the target classes. E.g., patterns: “X gene”, “X proteins” Tends to be dense, and reflect the characteristic of a large number of examples. Feature coupling degree (FCD) The relatedness measure of an EDF-CDF pair in universal data,The algorithm,An example of FCG,Classify the name candidate prnp gene EDF:

      8、leftmost 1-gram = prnp CDF: expression of X EDF root: leftmost 1-gram FCD type: PMI The FCD feature is indexed by leftmost 1-gramexpression of X PMI Feature value: FCD(U, leftmost 1-gram = prnp, expression of X) = PMI (leftmost 1-gram = prnp, expression of X),An example of FCG (2),The entity classification task,Construct a gene dictionary by: Combine two recourses: BioThesaurus 2.0 and ABGene lexicon Generate variants by simple rules. The tasks is to determine whether a dictionary entry is a gen

      9、e name in most cases. The training set is derived from BioCreative 2 GM task. 12567 positive and 36862 negative examples,FCG for gene entity classification,EDFs EDF I: normalized names EDF II: boundary n-grams CDFs: CDF I: indicative context patterns CDF II: outputs of a local context predictor FCD measures:,Model selection,The density of FCD features is around 25%. Interestingly we found that this feature space was somewhat like that in the task of image recognition. Inspired by the prevailing techniques in such tasks, we first used singular value decomposition (SVD) to get a subspace of the original features and then used a SVM with a radial basis function (RBF) kernel to classify the examples.,Lexical features - the baseline,Bag-of-n-grams (n = 1, 2, 3) Boundary n-grams (n = 1, 2, 3) left-2-gram = IL 2 for IL 2 gene Sliding windows: character-level sliding windows with the size of 5. Bounda

      《一种半监督学习方法及其在命名实体识别中的应用》由会员ZJ****1分享,可在线阅读,更多相关《一种半监督学习方法及其在命名实体识别中的应用》请在金锄头文库上搜索。

      点击阅读更多内容
    TA的资源
    点击查看更多
    最新标签
    发车时刻表 长途客运 入党志愿书填写模板精品 庆祝建党101周年多体裁诗歌朗诵素材汇编10篇唯一微庆祝 智能家居系统本科论文 心得感悟 雁楠中学 20230513224122 2022 公安主题党日 部编版四年级第三单元综合性学习课件 机关事务中心2022年全面依法治区工作总结及来年工作安排 入党积极分子自我推荐 世界水日ppt 关于构建更高水平的全民健身公共服务体系的意见 空气单元分析 哈里德课件 2022年乡村振兴驻村工作计划 空气教材分析 五年级下册科学教材分析 退役军人事务局季度工作总结 集装箱房合同 2021年财务报表 2022年继续教育公需课 2022年公需课 2022年日历每月一张 名词性从句在写作中的应用 局域网技术与局域网组建 施工网格 薪资体系 运维实施方案 硫酸安全技术 柔韧训练 既有居住建筑节能改造技术规程 建筑工地疫情防控 大型工程技术风险 磷酸二氢钾 2022年小学三年级语文下册教学总结例文 少儿美术-小花 2022年环保倡议书模板六篇 2022年监理辞职报告精选 2022年畅想未来记叙文精品 企业信息化建设与管理课程实验指导书范本 草房子读后感-第1篇 小数乘整数教学PPT课件人教版五年级数学上册 2022年教师个人工作计划范本-工作计划 国学小名士经典诵读电视大赛观后感诵读经典传承美德 医疗质量管理制度 2 2022年小学体育教师学期工作总结 2022年家长会心得体会集合15篇
    关于金锄头网 - 版权申诉 - 免责声明 - 诚邀英才 - 联系我们
    手机版 | 川公网安备 51140202000112号 | 经营许可证(蜀ICP备13022795号)
    ©2008-2016 by Sichuan Goldhoe Inc. All Rights Reserved.