
mega操作过程-多序列比对、进化树、.ppt
173页基础生物信息学及应用基础生物信息学及应用王兴平王兴平基基 础础 生生 物物 信信 息息 学学 及及 应应 用用多序列比对多序列比对分子进化分析分子进化分析——系统发生树构建系统发生树构建核酸序列的预测与鉴定核酸序列的预测与鉴定酶切图谱制作酶切图谱制作引物设计引物设计内内 容容基基 础础 生生 物物 信信 息息 学学 及及 应应 用用多序列比对多序列比对基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n内容:内容:Ø多序列比对多序列比对Ø多序列比对程序及应用多序列比对程序及应用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用第一节、多序列比对第一节、多序列比对((Multiple sequence alignmentMultiple sequence alignment))n概念概念n多序列比对的意义多序列比对的意义n多序列比对的打分函数多序列比对的打分函数n多序列比对的方法多序列比对的方法基基 础础 生生 物物 信信 息息 学学 及及 应应 用用1 1、概念、概念n多序列比对(多序列比对(Multiple sequence alignmentMultiple sequence alignment))Øalign multiple related sequences to achieve align multiple related sequences to achieve optimal matching of the sequences.optimal matching of the sequences.Ø为了便于描述,对多序列比对过程可以给出下面的定义:把多序为了便于描述,对多序列比对过程可以给出下面的定义:把多序列比对看作一张二维表,表中每一行代表一个序列,每一列代表列比对看作一张二维表,表中每一行代表一个序列,每一列代表一个残基的位置。
将序列依照下列规则填入表中:一个残基的位置将序列依照下列规则填入表中:©((a a)一个序列所有残基的相对位置保持不变;)一个序列所有残基的相对位置保持不变;©((b b)将不同序列间相同或相似的残基放入同一列,即尽可能将序列)将不同序列间相同或相似的残基放入同一列,即尽可能将序列间相同或相似残基上下对齐(下表)间相同或相似残基上下对齐(下表)基基 础础 生生 物物 信信 息息 学学 及及 应应 用用1 234567891ⅠYDGGAV-EALⅡYDGG---EALⅢFEGGILVEALⅣFD-GILVQAVⅤYEGGAVVQAL表表1 1 多序列比对的定义多序列比对的定义 表表示示五五个个短短序序列列((I-VI-V))的的比比对对结结果果通通过过插插入入空空位位,,使使5 5个个序序列列中中大多数相同或相似残基放入同一列,并保持每个序列残基顺序不变大多数相同或相似残基放入同一列,并保持每个序列残基顺序不变基基 础础 生生 物物 信信 息息 学学 及及 应应 用用2 2、多序列比对的意义、多序列比对的意义n用于描述一组序列之间的相似性关系,以便了解一个分用于描述一组序列之间的相似性关系,以便了解一个分子子家族的基本特征家族的基本特征,寻找,寻找motifmotif,保守区域等。
保守区域等n用于描述一组同源序列之间的亲缘关系的远近,应用到用于描述一组同源序列之间的亲缘关系的远近,应用到分子进化分析中分子进化分析中Ø序列同源性分析序列同源性分析:是将待研究序列加入到一组与之:是将待研究序列加入到一组与之同源,但来自不同物种的序列中进行多序列同时比同源,但来自不同物种的序列中进行多序列同时比较,以确定该序列与其它序列间的同源性大小较,以确定该序列与其它序列间的同源性大小n其他应用,如构建其他应用,如构建profileprofile,打分矩阵等,打分矩阵等基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n手工比对手工比对Ø在运行经过测试并具有比较高的可信度的计算机程序(辅助在运行经过测试并具有比较高的可信度的计算机程序(辅助编辑软件如编辑软件如bioeditbioedit,,seaviewseaview,,GenedocGenedoc等)基础上,结合实等)基础上,结合实验结果或文献资料,对多序列比对结果进行手工修饰,应该验结果或文献资料,对多序列比对结果进行手工修饰,应该说是非常必要的说是非常必要的Ø为了便于进行交互式手工比对,通常使用不同颜色表示具有为了便于进行交互式手工比对,通常使用不同颜色表示具有不同特性的残基,以帮助判别序列之间的相似性。
不同特性的残基,以帮助判别序列之间的相似性n计算机程序自动比对计算机程序自动比对Ø通过特定的算法(如穷举法,启发式算法等),由计算机程通过特定的算法(如穷举法,启发式算法等),由计算机程序自动搜索最佳的多序列比对状态序自动搜索最佳的多序列比对状态3 3、多序列比对的方法、多序列比对的方法基基 础础 生生 物物 信信 息息 学学 及及 应应 用用穷举法穷举法n穷举法(穷举法(exhaustive alignment methodexhaustive alignment method))Ø将序列两两比对时的二维动态规划矩阵扩展到多维矩阵即用将序列两两比对时的二维动态规划矩阵扩展到多维矩阵即用矩阵的维数来反映比对的序列数目这种方法的计算量很大,矩阵的维数来反映比对的序列数目这种方法的计算量很大,对于计算机系统的资源要求比较高,一般只有在进行少数的较对于计算机系统的资源要求比较高,一般只有在进行少数的较短的序列的比对的时候才会用到这个方法短的序列的比对的时候才会用到这个方法ØDCA (Divide-and-Conquer AlignmentDCA (Divide-and-Conquer Alignment):):a web-based a web-based program that is program that is semiexhaustivesemiexhaustive http://bibiserv.techfak.uni-bielefeld.de/dcahttp://bibiserv.techfak.uni-bielefeld.de/dca/ /基基 础础 生生 物物 信信 息息 学学 及及 应应 用用启发式算法启发式算法n启发式算法(启发式算法(heuristic algorithmsheuristic algorithms))::Ø大多数实用的多序列比对程序采用大多数实用的多序列比对程序采用启发式算法启发式算法((heuristic algorithmsheuristic algorithms),以降低运算复杂度。
以降低运算复杂度©随着序列数量的增加,算法复杂性也不断增加用随着序列数量的增加,算法复杂性也不断增加用O O((m1m2m3m1m2m3……mnmn)表示对)表示对n n个序列进行比对时的算法复杂性,个序列进行比对时的算法复杂性,其中其中mnmn是最后一条序列的长度若序列长度相差不大,则是最后一条序列的长度若序列长度相差不大,则可简化成可简化成O O((m mn n),其中),其中n n表示序列的数目,表示序列的数目,m m表示序列的长表示序列的长度显然,随着序列数量的增加,序列比对的算法复杂性度显然,随着序列数量的增加,序列比对的算法复杂性按指数规律增长按指数规律增长基基 础础 生生 物物 信信 息息 学学 及及 应应 用用第二节第二节 多序列比对程序及应用多序列比对程序及应用nProgressive Alignment MethodProgressive Alignment MethodnIterative AlignmentIterative AlignmentnBlock-Based AlignmentBlock-Based AlignmentnDNASTARDNASTARnDNAMANDNAMAN基基 础础 生生 物物 信信 息息 学学 及及 应应 用用1 1、、Progressive Alignment MethodProgressive Alignment MethodnClustalClustal: :ØClustalClustal,是由,是由FengFeng和和DoolittleDoolittle于于19871987年提出的。
年提出的ØClustalClustal程序有许多版本程序有许多版本©ClustalWClustalW((ThompsonThompson等,等,19941994)是目前使用最广泛的多序列)是目前使用最广泛的多序列比对程序比对程序©它的它的PCPC版本是版本是ClustalXClustalXØ作为程序的一部分,作为程序的一部分,ClustalClustal 可以输出用于构建进化可以输出用于构建进化树的数据树的数据基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nClustalWClustalW 程序:程序:ClustalWClustalW 程序可以自由使用程序可以自由使用Ø在在NCBI/EBINCBI/EBI的的FTPFTP服务器上可以找到下载的软件包服务器上可以找到下载的软件包C ClustallustalW W 程序用选项单逐步指导用户进行操作,用户程序用选项单逐步指导用户进行操作,用户可根据需要选择打分矩阵、设置空位罚分等可根据需要选择打分矩阵、设置空位罚分等© ftp://ftp://ftp.ebi.ac.ukftp.ebi.ac.uk/pub/software//pub/software/ØEBIEBI的主页还提供了基于的主页还提供了基于WebWeb的的C ClustallustalW W服务,用户可以服务,用户可以把序列和各种要求通过表单提交到服务器上,服务器把序列和各种要求通过表单提交到服务器上,服务器把计算的结果用把计算的结果用EmailEmail返回用户(或交互使用)。
返回用户(或交互使用)©http://http://www.ebi.ac.uk/clustalwwww.ebi.ac.uk/clustalw/ /Progressive Alignment MethodProgressive Alignment Method基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nClustalWClustalW 程序程序ØC ClustallustalW W对输入序列的格式比较灵活,可以是对输入序列的格式比较灵活,可以是FASTAFASTA格式,还可格式,还可以是以是PIRPIR、、SWISS-PROTSWISS-PROT、、GDEGDE、、ClustalClustal、、GCG/MSFGCG/MSF、、RSFRSF等格式Ø输出格式也可以选择,有输出格式也可以选择,有ALNALN、、GCGGCG、、PHYLIPPHYLIP和和GDEGDE等,用户可以等,用户可以根据自己的需要选择合适的输出格式根据自己的需要选择合适的输出格式Ø用用C ClustallustalW W得到的多序列比对结果中,所有序列排列在一起,得到的多序列比对结果中,所有序列排列在一起,并以特定的符号代表各个位点上残基的保守性,并以特定的符号代表各个位点上残基的保守性,““* *””号表示保号表示保守性极高的残基位点;守性极高的残基位点;““. .””号代表保守性略低的残基位点。
号代表保守性略低的残基位点Progressive Alignment Method基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nClustalClustal W W 使用使用Ø输入地址:输入地址:http://www.ebi.ac.uk/clustalw/http://www.ebi.ac.uk/clustalw/Ø设置选项设置选项 ((nextnext))Progressive Alignment MethodProgressive Alignment Method基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nClustalClustal W W 使用使用Ø一些选项说明一些选项说明©PHYLOGENETIC TREEPHYLOGENETIC TREE有三个选项有三个选项ª TREE TYPETREE TYPE:构建系统发育树的算法,有四个个选择:构建系统发育树的算法,有四个个选择nonenone、、njnj((neighbourneighbour joining joining)、)、phylipphylip、、distdistªCORRECT DISTCORRECT DIST:决定是否做距离修正。
对于小的序列歧异(<:决定是否做距离修正对于小的序列歧异(<1010%),选择与否不会产生差异;对于大的序列歧异,需做出%),选择与否不会产生差异;对于大的序列歧异,需做出修正因为观察到的距离要比真实的进化距离低因为观察到的距离要比真实的进化距离低ªIGNORE GAPSIGNORE GAPS:选择:选择onon,序列中的任何空位将被忽视序列中的任何空位将被忽视Ø详细说明参见详细说明参见 http://http://www.ebi.ac.uk/clustalw/clustalw_frame.htmlwww.ebi.ac.uk/clustalw/clustalw_frame.htmlProgressive Alignment Method基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nClustalClustal W W 使用使用Ø输入输入5 5个个16S RNA 16S RNA 基因序列基因序列©AF310602AF310602©AF308147AF308147©AF283499AF283499©AF012090AF012090©AF447394AF447394Ø点击点击““RUNRUN””Progressive Alignment Method基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Progressive Alignment MethodProgressive Alignment MethodnT-CoffeeT-Coffee (Tree-based Consistency Objective Function for (Tree-based Consistency Objective Function for alignment Evaluationalignment Evaluation):):ØProgressive alignment method Progressive alignment method www.ch.embnet.org/software/TCoffee.htmlwww.ch.embnet.org/software/TCoffee.htmlØIn processing a query, T-Coffee performs both global and In processing a query, T-Coffee performs both global and local local pairwisepairwise alignment for all possible pairs involved. alignment for all possible pairs involved.©A distance matrix is built to derive a guide tree, which is A distance matrix is built to derive a guide tree, which is then used to direct a full multiple alignment using the then used to direct a full multiple alignment using the progressive approach.progressive approach.ØOutperforms Outperforms ClustalClustal when aligning moderately divergent when aligning moderately divergent sequencessequencesØSlower than Slower than ClustalClustal基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Progressive Alignment MethodProgressive Alignment MethodnPRALINEPRALINE::Øweb-basedweb-based:: http://ibivu.cs.vu.nl/programs/pralinewww/http://ibivu.cs.vu.nl/programs/pralinewww/ ØFirst build profiles for each sequence using PSI-BLAST database First build profiles for each sequence using PSI-BLAST database searching. searching. ØEach profile is then used for multiple alignment using the Each profile is then used for multiple alignment using the progressive approach.progressive approach.©the closest neighbor to be joined to a larger alignment by comparing the the closest neighbor to be joined to a larger alignment by comparing the profile scoresprofile scores©does not use a guide treedoes not use a guide tree©Incorporate protein secondary structure information to modify the Incorporate protein secondary structure information to modify the profile scores.profile scores.ØPerhaps the most sophisticated and accurate alignment program Perhaps the most sophisticated and accurate alignment program availableavailable. .ØExtremely slow computation.Extremely slow computation.基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Progressive Alignment MethodProgressive Alignment MethodnDbClustalDbClustal: : http://igbmc.u-strasbg.fr:8080/DbClustal/dbclustal.htmlhttp://igbmc.u-strasbg.fr:8080/DbClustal/dbclustal.htmlnPoaPoa (Partial order alignments): (Partial order alignments): http://www.bioinformatics.ucla.edu/poa/http://www.bioinformatics.ucla.edu/poa/基基 础础 生生 物物 信信 息息 学学 及及 应应 用用2 2、、Iterative AlignmentIterative AlignmentnPRRNPRRN::Øweb-based program web-based program http://prrn.ims.u-tokyo.ac.jphttp://prrn.ims.u-tokyo.ac.jp/ / ØUses a double nested iterative strategy for multiple alignment.Uses a double nested iterative strategy for multiple alignment.ØBased on the idea that an optimal solution can be found by Based on the idea that an optimal solution can be found by repeatedly modifying existing suboptimal solutionsrepeatedly modifying existing suboptimal solutions基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Block-Based AlignmentBlock-Based AlignmentnDIALIGN2DIALIGN2::Øa web based program a web based program http://bioweb.pasteur.fr/seqanal/interfaces/dialign2.htmlhttp://bioweb.pasteur.fr/seqanal/interfaces/dialign2.html ØIt places emphasis on block-to-block comparison rather than It places emphasis on block-to-block comparison rather than residue-to-residue comparison. The sequence regions between the residue-to-residue comparison. The sequence regions between the blocks are left unaligned. blocks are left unaligned. ØThe program has been shown to be especially suitable The program has been shown to be especially suitable for for aligning divergent sequencesaligning divergent sequences with only local similarity. with only local similarity.基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Block-Based AlignmentBlock-Based AlignmentnMatch-BoxMatch-Box::Øweb-based server web-based server http://www.fundp.ac.be/sciences/biologie/bms/matchbox_suhttp://www.fundp.ac.be/sciences/biologie/bms/matchbox_submit.shtmlbmit.shtmlØAims to identify conserved blocks (or boxes) among Aims to identify conserved blocks (or boxes) among sequences. sequences. ØThe server requires the user to submit a set of The server requires the user to submit a set of sequences in the FASTA format and the results are sequences in the FASTA format and the results are returned by e-mail.returned by e-mail.基基 础础 生生 物物 信信 息息 学学 及及 应应 用用DNASTARDNAMAN软件:软件:基基 础础 生生 物物 信信 息息 学学 及及 应应 用用分子进化分析分子进化分析————系统发生系统发生树构建树构建基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n本章内容:本章内容:Ø分子进化分析介绍分子进化分析介绍Ø系统发生树构建方法系统发生树构建方法Ø系统发生树构建实例系统发生树构建实例基基 础础 生生 物物 信信 息息 学学 及及 应应 用用第一节第一节 分子进化分析介绍分子进化分析介绍n基本概念:基本概念:Ø系统发生(系统发生(phylogenyphylogeny))————是指生物形成或进化是指生物形成或进化的历史的历史Ø系统发生学系统发生学( (phylogeneticsphylogenetics) )————研究物种之间的研究物种之间的进化关系进化关系 Ø系统发生树(系统发生树(phylogeneticphylogenetic tree tree))————表示形式,表示形式,描述物种之间进化关系描述物种之间进化关系基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n分子进化研究的目的分子进化研究的目的Ø从物种的一些分子特性出发,从而了解物种从物种的一些分子特性出发,从而了解物种之间的生物系统发生的关系。
之间的生物系统发生的关系©蛋白和核酸序列蛋白和核酸序列©通过序列同源性的比较进而了解基因的进化以及通过序列同源性的比较进而了解基因的进化以及生物系统发生的内在规律生物系统发生的内在规律分子进化分析介绍分子进化分析介绍基基 础础 生生 物物 信信 息息 学学 及及 应应 用用分子进化分析介绍分子进化分析介绍n分子进化研究的基础分子进化研究的基础Ø基本理论:在各种不同的发育谱系及足够大的基本理论:在各种不同的发育谱系及足够大的进化时间尺度中,许多序列的进化速率几乎是进化时间尺度中,许多序列的进化速率几乎是恒定不变的分子钟理论,恒定不变的分子钟理论, Molecular Molecular clockclock 1965 1965 ))Ø实际情况:虽然很多时候仍然存在争议,但是实际情况:虽然很多时候仍然存在争议,但是分子进化确实能阐述一些生物系统发生的内在分子进化确实能阐述一些生物系统发生的内在规律规律基基 础础 生生 物物 信信 息息 学学 及及 应应 用用分子进化分析介绍分子进化分析介绍n直系同源与旁系同源直系同源与旁系同源ØOrthologsOrthologs( (直系同源直系同源): ): ©Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. ØParalogsParalogs( (旁系同源旁系同源): ): ©Homologous sequences within a single species that arose by gene duplication. 。
Ø以上两个概念代表了两个不同的进化事件用于分子进化分析中以上两个概念代表了两个不同的进化事件用于分子进化分析中的序列的序列必须是直系同源必须是直系同源的,才能真实反映进化过程的,才能真实反映进化过程基基 础础 生生 物物 信信 息息 学学 及及 应应 用用分子进化分析介绍分子进化分析介绍基基 础础 生生 物物 信信 息息 学学 及及 应应 用用分子进化分析介绍分子进化分析介绍n系统发生树系统发生树((phylogeneticphylogenetic tree tree)): :Ø又名又名进化树进化树((evolutionary treeevolutionary tree)已发展成为多学科)已发展成为多学科交叉形成的一个边缘领域交叉形成的一个边缘领域©包括生命科学中的进化论、遗传学、分类学、分子生物学、包括生命科学中的进化论、遗传学、分类学、分子生物学、生物化学、生物物理学和生态学,又包括数学中的概率统计、生物化学、生物物理学和生态学,又包括数学中的概率统计、图论、计算机科学和群论图论、计算机科学和群论©闻名国际生物学界的美国冷泉港定量生物学会议于闻名国际生物学界的美国冷泉港定量生物学会议于19871987年特年特辟出辟出" "进化树进化树" "专栏进行学术讨论,标志着该领域已成为现代专栏进行学术讨论,标志着该领域已成为现代生物学的前沿之一,迄今仍很活跃。
生物学的前沿之一,迄今仍很活跃基基 础础 生生 物物 信信 息息 学学 及及 应应 用用分子进化分析介绍分子进化分析介绍基基 础础 生生 物物 信信 息息 学学 及及 应应 用用分子进化分析介绍分子进化分析介绍n系统发生树结构系统发生树结构ØThe lines in the tree are called The lines in the tree are called branchesbranches( (分支分支). ). ØAt the tips of the branches are present-day species or sequences known At the tips of the branches are present-day species or sequences known as as taxataxa ( (分类,分类,the singular form is the singular form is taxontaxon) or ) or operational taxonomic operational taxonomic unitsunits(运筹分类单位)(运筹分类单位). . ØThe connecting point where two adjacent branches join is called a The connecting point where two adjacent branches join is called a nodenode(节点)(节点), which represents an inferred ancestor of extant , which represents an inferred ancestor of extant taxataxa. . ØThe bifurcating point at the very bottom of the tree is the The bifurcating point at the very bottom of the tree is the root noderoot node(根节)(根节), which represents the common ancestor of all members of the , which represents the common ancestor of all members of the tree.tree.ØA group of A group of taxataxa descended from a single common ancestor is defined as a descended from a single common ancestor is defined as a cladeclade or monophyletic group or monophyletic group ( (单源群单源群) ). .ØThe branching pattern in a tree is called The branching pattern in a tree is called tree topologytree topology(拓扑结构)(拓扑结构). .基基 础础 生生 物物 信信 息息 学学 及及 应应 用用分子进化分析介绍分子进化分析介绍n有根树与无根树有根树与无根树Ø树根代表一组分类的共同祖先树根代表一组分类的共同祖先基基 础础 生生 物物 信信 息息 学学 及及 应应 用用分子进化分析介绍分子进化分析介绍Ø如何确定树根如何确定树根©根据外围群:根据外围群:One is to use an One is to use an outgroupoutgroup((外围群外围群)), , which is a sequence that is homologous to the which is a sequence that is homologous to the sequences under consideration, but separated from sequences under consideration, but separated from those sequences at an early evolutionary time.those sequences at an early evolutionary time.©根据中点:根据中点:In the absence of a good In the absence of a good outgroupoutgroup, a tree , a tree can be rooted using the can be rooted using the midpoint rooting approachmidpoint rooting approach, in , in which the midpoint of the two most divergent groups which the midpoint of the two most divergent groups judged by overall branch lengths is assigned as the judged by overall branch lengths is assigned as the root.root.基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Rooted by outgroupbacteria bacteria outgroupoutgrouprootrooteukaryoteeukaryoteeukaryoteeukaryoteeukaryoteeukaryoteeukaryoteeukaryotearchaeaarchaeaarchaeaarchaeaarchaeaarchaeaMonophyletic groupMonophyletic group( (单源群单源群) )MonophyleticMonophyleticgroupgroup外围群外围群分子进化分析介绍分子进化分析介绍基基 础础 生生 物物 信信 息息 学学 及及 应应 用用分子进化分析介绍分子进化分析介绍n树形树形Ø系统发生图系统发生图((PhylogramsPhylograms)):有:有分支和支长信息分支和支长信息Ø分支图(分支图( CladogramsCladograms))只有分只有分支信息,无支长信息支信息,无支长信息基基 础础 生生 物物 信信 息息 学学 及及 应应 用用第二节第二节 系统发生树构建方法系统发生树构建方法nMolecular Molecular phylogeneticphylogenetic tree construction tree construction can be divided into five steps: can be divided into five steps: Ø(1) choosing molecular markers; (1) choosing molecular markers; Ø(2) performing multiple sequence alignment;(2) performing multiple sequence alignment;Ø(3) choosing a model of evolution; (3) choosing a model of evolution; Ø(4) determining a tree building method; (4) determining a tree building method; Ø(5) assessing tree reliability.(5) assessing tree reliability.基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用第三节第三节 系统发生树构建实例系统发生树构建实例n系统发生分析常用软件系统发生分析常用软件Ø(1) PHYLIP(1) PHYLIPØ(2) PAUP(2) PAUPØ(3) TREE-PUZZLE(3) TREE-PUZZLEØ(4) MEGA(4) MEGAØ(5) PAML(5) PAMLØ(6) (6) TreeViewTreeViewØ(7) (7) V VOSTORGOSTORG Ø(8) (8) Fitch programsFitch programs Ø(9) (9) Phylo_winPhylo_win Ø(10) (10) ARBARB Ø(11) (11) DAMBEDAMBEØ(12) (12) PALPAL Ø(13) (13) BionumericsBionumerics 其它程序见:其它程序见:http://evolution.genetics.washington.edu/phylip/software.hhttp://evolution.genetics.washington.edu/phylip/software.htmltml 基基 础础 生生 物物 信信 息息 学学 及及 应应 用用系统发生树构建实例系统发生树构建实例nMega 3Mega 3n下载地址下载地址基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n离散特征数据离散特征数据 (discrete character data)(discrete character data)::Ø即所获得的是即所获得的是2 2个或更多的离散的值。
如:个或更多的离散的值如:©DNADNA序列某一位置序列某一位置是是或者或者不是不是剪切位点(二态特征);剪切位点(二态特征);©序列中某一位置,可能的碱基有序列中某一位置,可能的碱基有A A、、T T、、G G、、C C共共4 4种(多态特征)种(多态特征);;n相似性和距离数据相似性和距离数据 (similarity and distance data)(similarity and distance data)::Ø是用彼此间的相似性或距离所表示出来的各分类单位间的相互关是用彼此间的相似性或距离所表示出来的各分类单位间的相互关系基基 础础 生生 物物 信信 息息 学学 及及 应应 用用核酸序列的预测和鉴定核酸序列的预测和鉴定基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n内容:内容:Ø序列概率信息的统计模型序列概率信息的统计模型Ø核酸序列的预测与鉴定核酸序列的预测与鉴定基基 础础 生生 物物 信信 息息 学学 及及 应应 用用第一节、第一节、序列概率信息的统计模型序列概率信息的统计模型nOne of the applications of multiple One of the applications of multiple sequence alignments in identifying related sequence alignments in identifying related sequences in databases is by construction sequences in databases is by construction of some of some statistical modelsstatistical models. . ØPosition-specific scoring matrices (Position-specific scoring matrices (PSSMsPSSMs) ) ØProfiles Profiles ØHidden Markov models (Hidden Markov models (HMMsHMMs). ). 基基 础础 生生 物物 信信 息息 学学 及及 应应 用用收集已知的功能序列和非功能序列实例收集已知的功能序列和非功能序列实例(这些序列之间是非相关的(这些序列之间是非相关的 ))训练集训练集((training set))测试集或控制集测试集或控制集((control set))建立完成识别任务的模型建立完成识别任务的模型检验所建模型的正确性检验所建模型的正确性对预测模型进行训练,对预测模型进行训练,使之通过学习后具有使之通过学习后具有正确处理和辨别能力。
正确处理和辨别能力进行进行“功能功能”与与“非功能非功能”的的判断,根据判断结果计算判断,根据判断结果计算模识别的准确性模识别的准确性识别识别“功能序列功能序列”和和“非功能序列非功能序列”的过程的过程 基基 础础 生生 物物 信信 息息 学学 及及 应应 用用多序列比对多序列比对相关序列选取相关序列选取模型构建模型构建模型训练模型训练参数调整参数调整应用应用确立模型确立模型Profile HMMHmmcalibrateClustalXHmmbuildHmmtHidden Markov Model基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Hidden Markov ModelHidden Markov Modeln应用应用ØHMMsHMMs has more predictive power than has more predictive power than Profiles.Profiles.©HMM is able to differentiate between HMM is able to differentiate between insertion and deletion statesinsertion and deletion statesªIn profile calculation, a single gap penalty In profile calculation, a single gap penalty score that is often subjectively determined score that is often subjectively determined represents either an insertion or deletion.represents either an insertion or deletion.基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Hidden Markov ModelHidden Markov Modeln应用应用ØOnce an HMM is established based on the training sequences, Once an HMM is established based on the training sequences, ©It can be used to determine how well an unknown sequence It can be used to determine how well an unknown sequence matches the model.matches the model.©It can be used for the construction of It can be used for the construction of multiple multiple alignmentalignment of related sequences. of related sequences.©HMMsHMMs can be used for can be used for database searchingdatabase searching to detect to detect distant sequence distant sequence homologshomologs. . ©HMMsHMMs are also used in are also used inªProtein Protein family classificationfamily classification through motif and pattern through motif and pattern identificationidentificationªAdvanced Advanced gene and promoter predictiongene and promoter prediction, , ªTransmembraneTransmembrane protein prediction, protein prediction, ªProtein fold recognition.Protein fold recognition.基基 础础 生生 物物 信信 息息 学学 及及 应应 用用第二节第二节 核酸序列的预测与鉴定核酸序列的预测与鉴定n本节内容本节内容Ø核酸序列预测概念核酸序列预测概念Ø基因预测基因预测Ø启动子和调控元件预测启动子和调控元件预测Ø酶切位点分析与引物设计酶切位点分析与引物设计基基 础础 生生 物物 信信 息息 学学 及及 应应 用用1 1、核酸序列预测概念、核酸序列预测概念n指利用一些计算方式(计算机程序)从基因组序列中发指利用一些计算方式(计算机程序)从基因组序列中发现基因及其表达调控元件的位置和结构的过程。
包括:现基因及其表达调控元件的位置和结构的过程包括:Ø基因预测(基因预测( Gene PredictionGene Prediction ))Ø基因表达调控元件预测(基因表达调控元件预测(Promoter and Regulatory Element Promoter and Regulatory Element PredictionPrediction))基基 础础 生生 物物 信信 息息 学学 及及 应应 用用 Structure of Eukaryotic Genes基基 础础 生生 物物 信信 息息 学学 及及 应应 用用gene 1gene 2gene 3exonintergenic regionintronAGCATCGAAGTTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGCAAGTTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGCAAGTTGCATGACGATTGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGCAAGAAGTTGCATGACGATGCATGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGCAAGTTGCATGACGATTGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGCGATGCATGACCTAGCAAGAAGTTGCATGACGATGCATGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGCAAGTTGCATGACGATTGACCTAGTGCATGACTGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGCAAGTTGCATGACGATTGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGCAAGAAGTTGCATGACGATGCATGACCTAATGC 基基 础础 生生 物物 信信 息息 学学 及及 应应 用用第二节第二节 核酸序列的预测与鉴定核酸序列的预测与鉴定n本节内容本节内容Ø核酸序列预测概念核酸序列预测概念Ø基因预测基因预测Ø启动子和调控元件预测启动子和调控元件预测Ø酶切位点分析与引物设计酶切位点分析与引物设计基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Ø基因预测的概念及意义Ø原核基因识别Ø真核基因预测的困难性Ø真核基因预测的依据Ø真核基因预测的基本步骤及策略Ø真核基因预测方法及其基本原理2 2、基因预测、基因预测基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n概念:概念:ØGene PredictionGene Prediction:: Given an uncharacterized DNA Given an uncharacterized DNA sequence, find out:sequence, find out:©Where does the gene starts and ends? Where does the gene starts and ends? --detection of detection of the location of open reading frames (the location of open reading frames (ORFsORFs) )©Which regions code for a protein? Which regions code for a protein? --delineation of delineation of the structures of the structures of intronsintrons as well as as well as exonsexons (eukaryotic)(eukaryotic)2.1 2.1 基因预测的概念及意义基因预测的概念及意义基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基因预测的概念及意义基因预测的概念及意义n意义:意义:ØComputational Gene Finding (Gene Prediction) is one of Computational Gene Finding (Gene Prediction) is one of the most challenging and interesting problems in the most challenging and interesting problems in bioinformatics at the moment.bioinformatics at the moment.ØComputational Gene Finding is important because Computational Gene Finding is important because © S So many genomes have been being sequenced so rapidly. o many genomes have been being sequenced so rapidly. © Pure biological means are time consuming and costly. Pure biological means are time consuming and costly. ØFinding genes in DNA sequences is the fFinding genes in DNA sequences is the foundationoundation for all for all further investigation (Knowledge of the protein-coding further investigation (Knowledge of the protein-coding regions underpins functional genomics).regions underpins functional genomics). 基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Ø基因预测的概念及意义Ø原核基因识别Ø真核基因预测的困难性Ø真核基因预测的依据Ø真核基因预测的基本步骤及策略Ø真核基因预测方法及其基本原理2 2、基因预测、基因预测基基 础础 生生 物物 信信 息息 学学 及及 应应 用用2.22.2、原核基因识别、原核基因识别n原核基因识别任务的重点是识别开放阅读框,或原核基因识别任务的重点是识别开放阅读框,或者说识别长的编码区域。
者说识别长的编码区域Ø一个开放阅读框(一个开放阅读框(ORF, open reading frameORF, open reading frame)是一个)是一个没有终止编码的密码子序列没有终止编码的密码子序列基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n原核基因预测工具介绍原核基因预测工具介绍ØORF FinderORF FinderØHMM-based gene finding programsHMM-based gene finding programs©GeneMarkGeneMark©GlimmerGlimmer©FGENESBFGENESB©RBSfinderRBSfinder原核基因识别原核基因识别基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nORF Finder (Open Reading Frame Finder)nhttp://www.ncbi.nlm.nih.gov/gorf/gorf.html原核基因识别原核基因识别基基 础础 生生 物物 信信 息息 学学 及及 应应 用用zinc-binding alcohol dehydrogenase, novicida(弗朗西丝菌 )基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nHMM-based gene finding programs HMM-based gene finding programs ØGeneMarkGeneMark: :©Trained on a number of complete microbial Trained on a number of complete microbial genomes genomes http://http://opal.biology.gatech.edu/GeneMarkopal.biology.gatech.edu/GeneMark/ / 原核基因识别原核基因识别基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nHMM-based gene finding programs HMM-based gene finding programs ØGlimmer (Gene Locator and Interpolated Glimmer (Gene Locator and Interpolated Markov Modeler):Markov Modeler):©A UNIX program A UNIX program www.tigr.org/softlab/glimmer/glimmer.htmlwww.tigr.org/softlab/glimmer/glimmer.html原核基因识别原核基因识别基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nHMM-based gene finding programs HMM-based gene finding programs ØFGENESB:FGENESB:©Web-based programWeb-based program©Trained for bacterial sequences Trained for bacterial sequences =gfindbgfindb原核基因识别原核基因识别基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nHMM-based gene finding programs HMM-based gene finding programs ØRBSfinderRBSfinder: :©UNIX program UNIX program ©Predicted start sites Predicted start sites ftp://ftp://ftp.tigr.org/pub/software/RBSfinderftp.tigr.org/pub/software/RBSfinder/ /原核基因识别原核基因识别基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Ø基因预测的概念及意义Ø原核基因识别Ø真核基因预测的困难性Ø真核基因预测的依据Ø真核基因预测的基本步骤及策略Ø真核基因预测方法及其基本原理2 2、基因预测、基因预测基基 础础 生生 物物 信信 息息 学学 及及 应应 用用HumanFuguwormE.colinWhy is Gene Prediction Challenging?Why is Gene Prediction Challenging?ØCoding densityCoding density: as the coding/non-coding length ratio : as the coding/non-coding length ratio decreases, decreases, exonexon prediction becomes more complex. prediction becomes more complex. © Some facts about human genome Some facts about human genome ª Coding regions comprise less than 3% of the genome Coding regions comprise less than 3% of the genome ª There is a gene of 2400000 bps, only 14000 bps are CDS (< 1%) There is a gene of 2400000 bps, only 14000 bps are CDS (< 1%)2.3 2.3 真核基因预测的困难性真核基因预测的困难性基基 础础 生生 物物 信信 息息 学学 及及 应应 用用wormE.coliØSplicing of genesSplicing of genes: finding multiple (short) : finding multiple (short) exonsexons is is harder than finding a single (long) harder than finding a single (long) exonexon. . ©Some facts about human genome Some facts about human genome ª Average of 5-6 exons/gene Average of 5-6 exons/gene ª Average exon length: ~200 Average exon length: ~200 bpbp ª Average intron length: ~2000 Average intron length: ~2000 bpbp ª ~8% genes have a single exon ~8% genes have a single exon ª Some exons can be as small as 3 Some exons can be as small as 3 bpbp. .©Alternate splicingAlternate splicing are very difficult to predict are very difficult to predict((nextnext))真核基因预测的困难性真核基因预测的困难性基基 础础 生生 物物 信信 息息 学学 及及 应应 用用真核基因预测的困难性真核基因预测的困难性基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Ø基因预测的概念及意义Ø原核基因识别Ø真核基因预测的困难性Ø真核基因预测的依据Ø真核基因预测的基本步骤及策略Ø真核基因预测方法及其基本原理2 2、基因预测、基因预测基基 础础 生生 物物 信信 息息 学学 及及 应应 用用真核基因预测的依据真核基因预测的依据n功能位点功能位点ØSplicing site signalsSplicing site signals©剪切供体位点和受体位点(剪切供体位点和受体位点(Donor/AcceptorDonor/Acceptor):):the splice the splice junctions of junctions of intronsintrons and and exonsexons follow the follow the GTGT––AG ruleAG rule in which an in which an intronintron ªat the 5 splice junction has a consensus motif of at the 5 splice junction has a consensus motif of GTGTAAGT AAGT ((DonorDonor)); ; ªand at the 3 splice junction is a consensus motif of and at the 3 splice junction is a consensus motif of (Py)12NC(Py)12NCAG AG ( (AcceptorAcceptor) )基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Nucleotide Distribution Probabilities around Donor SitesNucleotide Distribution Probabilities around Donor SitesPositionPositionp(Ap(A) )p(Cp(C) )p(Gp(G) )p(Tp(T) )-3-30.3330.3330.3530.3530.1930.1930.120.12-2-20.5810.5810.1440.1440.1320.1320.1430.143-1-10.09690.09690.03550.03550.7790.7790.08830.08830 00.000480.000480.000480.000480.9990.9990.000480.000481 10.000480.000480.000480.000480.000480.000480.9990.9992 20.4930.4930.02780.02780.4550.4550.02350.02353 30.7230.7230.07530.07530.1180.1180.08350.08354 40.05950.05950.05130.05130.8410.8410.0480.0485 50.1510.1510.1670.1670.210.210.4720.472真核基因预测的依据真核基因预测的依据基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Nucleotide Distribution Probabilities around non Donor SitesNucleotide Distribution Probabilities around non Donor SitesPositionPositionp(Ap(A) )p(Cp(C) )p(Gp(G) )p(Tp(T) )-3-30.2620.2620.2310.2310.2360.2360.2720.272-2-20.2620.2620.2310.2310.2350.2350.2720.272-1-10.2620.2620.2310.2310.2360.2360.2720.2720 00.2620.2620.2310.2310.2350.2350.2720.2721 10.2620.2620.2310.2310.2360.2360.2720.2722 20.2620.2620.2310.2310.2350.2350.2720.2723 30.2620.2620.2310.2310.2360.2360.2720.2724 40.2620.2620.2310.2310.2350.2350.2720.2725 50.2620.2620.2310.2310.2360.2360.2720.272真核基因预测的依据真核基因预测的依据基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Nucleotide Distribution around Splicing Sites基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n功能位点功能位点ØTranslation initiation site signal Translation initiation site signal ©translation start translation start codoncodon::ªMost vertebrate genes use Most vertebrate genes use ATGATG as the translation as the translation start start codoncodon and andªhave a uniquely conserved flanking sequence call a have a uniquely conserved flanking sequence call a KozakKozak sequence sequence ( (CCGCCCCGCCATGATGG G).).ØTranslation termination site signalTranslation termination site signal©translation stop translation stop codoncodon::TGATGA真核基因预测的依据真核基因预测的依据基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n功能位点功能位点ØTranscription start signals Transcription start signals ©Transcription start signalsTranscription start signals::ªCpGCpG island island:: to identify the transcription to identify the transcription initiation site of a eukaryotic geneinitiation site of a eukaryotic gene«most of these genes have a high density of CG most of these genes have a high density of CG dinucleotidesdinucleotides near the transcription start near the transcription start site. This region is referred to as a site. This region is referred to as a CpGCpG island island 。
真核基因预测的依据真核基因预测的依据基基 础础 生生 物物 信信 息息 学学 及及 应应 用用酵母基因组两联核苷酸频率表酵母基因组两联核苷酸频率表仅为随机概率的20%但在真核基因启动子区,CpG出现密度达到随机预测水平长度几百bp人类基于组中大约有45000个CpG岛,其中一半与管家基因有关,其余与组织特异性基于启动子关联基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n功能位点功能位点ØTranscription stop signalsTranscription stop signals©Transcription stop signalsTranscription stop signals::. .ªThe poly-A signal can also help locate the final The poly-A signal can also help locate the final coding sequencecoding sequence真核基因预测的依据真核基因预测的依据基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n编码区与非编码区基因组成特征编码区与非编码区基因组成特征Ø密码子使用偏好密码子使用偏好Ø外显子长度外显子长度Ø等值区(等值区(isochoreisochore))真核基因预测的依据真核基因预测的依据基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n编码区与非编码区基因组成特征编码区与非编码区基因组成特征ØCodon Usage PreferenceCodon Usage Preference(密码子使(密码子使用偏好)用偏好)©Statistical results show that some Statistical results show that some codonscodons are used with different frequencies in are used with different frequencies in coding and non-coding regionscoding and non-coding regions,,e.ge.g: :ªhexamerhexamer frequencies frequenciesªCodonCodon Usage Frequency: Usage Frequency:真核基因预测的依据真核基因预测的依据基基 础础 生生 物物 信信 息息 学学 及及 应应 用用For coding regionFor non-coding regionn编码区与非编码区基因组成特征编码区与非编码区基因组成特征ØCodon Usage PreferenceCodon Usage Preference ©HexamerHexamer ( (Di-codonDi-codon Usage, Usage, 双双连连密密码码子子 ) frequencies ) frequencies ::hexamerhexamer frequenciesfrequencies(连续(连续6 6核苷酸)出现频率的比对是确定一个窗口是否核苷酸)出现频率的比对是确定一个窗口是否属于编码区或非编码区的最好单个指标属于编码区或非编码区的最好单个指标真核基因预测的依据真核基因预测的依据基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n编码区与非编码区基因组成特征编码区与非编码区基因组成特征ØCodon Usage PreferenceCodon Usage Preference ©CodonCodon Usage Frequency Usage Frequency((密码子的使用频率)密码子的使用频率)ª由于密码子的简并性(由于密码子的简并性(degeneracydegeneracy),每个氨基酸至少对应),每个氨基酸至少对应1 1种种密码子,最多有密码子,最多有6 6种对应的密码子。
种对应的密码子ª在基因中,同义密码子的使用并不是完全一致的在基因中,同义密码子的使用并不是完全一致的ª不同物种、不同生物体的基因密码子使用存在着很大的差异不同物种、不同生物体的基因密码子使用存在着很大的差异ª在不同物种中,类型相同的基因具有相近的同义密码子使用偏性在不同物种中,类型相同的基因具有相近的同义密码子使用偏性«对于同一类型的基因由物种引起的同义密码子使用偏性的差对于同一类型的基因由物种引起的同义密码子使用偏性的差异较小异较小 真核基因预测的依据真核基因预测的依据基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Codon Usage FrequencyFor coding region基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Length DistributionDistribution of Internal Exons of Human Genesn编码区与非编码区基因组成特征编码区与非编码区基因组成特征Ø外显子长度外显子长度真核基因预测的依据真核基因预测的依据基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n编码区与非编码区基因组成特征编码区与非编码区基因组成特征Ø等值区等值区©定义:定义:ª具有一致碱基组成的长区域具有一致碱基组成的长区域«长度超过长度超过1 000 1 000 000000 bpbp«同一等值区同一等值区GCGC含量相对均衡,但不同等值区含量相对均衡,但不同等值区GCGC含量差异显著含量差异显著©人类基因组划分为人类基因组划分为5 5个等值区个等值区ªL1L1::GC 39GC 39%%ªL2L2::GC 42GC 42%%«L1L1和和L2L2包含包含8080%的组织特异性基因%的组织特异性基因ªH1H1::GC 46GC 46%%ªH2H2::GC 49GC 49%%ªH3H3::GC 54GC 54%。
包含%包含8080%的管家基因%的管家基因真核基因预测的依据真核基因预测的依据基基 础础 生生 物物 信信 息息 学学 及及 应应 用用The Dependence of Codon Usage Score on CG Content基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Ø基因预测的概念及意义Ø原核基因识别Ø真核基因预测的困难性Ø真核基因预测的依据Ø真核基因预测的基本步骤及策略Ø真核基因预测方法及其基本原理2 2、基因预测、基因预测基基 础础 生生 物物 信信 息息 学学 及及 应应 用用2. 5 2. 5 真核基因预测的步骤和策略真核基因预测的步骤和策略nThe main issue in prediction of The main issue in prediction of eukaryotic genes is the eukaryotic genes is the identification of identification of exonsexons, , intronsintrons, and , and splicing sitessplicing sites。
基基 础础 生生 物物 信信 息息 学学 及及 应应 用用真核基因预测的步骤和策略真核基因预测的步骤和策略基基 础础 生生 物物 信信 息息 学学 及及 应应 用用真核基因预测的步骤和策略真核基因预测的步骤和策略n基本步骤基本步骤Ø判定序列中的载体污染判定序列中的载体污染Ø屏蔽重复序列屏蔽重复序列Ø发现基因发现基因Ø结果评估结果评估基基 础础 生生 物物 信信 息息 学学 及及 应应 用用真核基因预测的步骤和策略真核基因预测的步骤和策略n序列中的污染和重复元件必须首先去除序列中的污染和重复元件必须首先去除Ø序列污染(序列污染(sequence contaminationsequence contamination)的来源:)的来源:©载体载体©接头和接头和PCRPCR引物引物©转座子和插入序列转座子和插入序列©DNA/RNADNA/RNA样品纯度不高样品纯度不高Ø重复元件(重复元件(repetitive elementrepetitive element):):©散在重复元件、卫星散在重复元件、卫星DNADNA、简单重复序列、低复杂度序列等、简单重复序列、低复杂度序列等基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n基因发现策略:基因发现策略:ØThe current gene prediction methods can be classified The current gene prediction methods can be classified into two major categoriesinto two major categories©从头计算法或基于统计的方法(从头计算法或基于统计的方法(abab initio initio––based approaches or based approaches or Statistically based method Statistically based method ):):ªpredicts genes based on the given sequence alonepredicts genes based on the given sequence alone©基于同源序列比对的方法(基于同源序列比对的方法(homology-based approaches or homology-based approaches or Sequence alignment based methodSequence alignment based method)) ::ªmakes predictions based on significant matches of the query makes predictions based on significant matches of the query sequence with sequences of known genes.sequence with sequences of known genes.真核基因预测的步骤和策略真核基因预测的步骤和策略基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n基因发现的策略选择基因发现的策略选择真核基因预测的步骤和策略真核基因预测的步骤和策略基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Ø基因预测的概念及意义Ø原核基因识别Ø真核基因预测的困难性Ø真核基因预测的依据Ø真核基因预测的基本步骤及策略Ø真核基因预测方法及其基本原理2 2、基因预测、基因预测基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n载体污染判定方法载体污染判定方法n重复序列分析程序重复序列分析程序n基因预测程序(基因预测程序(Eukaryotic Eukaryotic ))2.62.6、真核基因预测方法及其基本原理、真核基因预测方法及其基本原理基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n载体污染判定载体污染判定Ø载体污染判定方法载体污染判定方法©载体数据库相似性搜索载体数据库相似性搜索©搜索序列中的限制酶切位点搜索序列中的限制酶切位点Ø工具:工具:©VecScreenVecScreen::NCBINCBI©Blast2 EVECBlast2 EVEC::EMBL EMBL www.ebi.ac.uk/blastall/vectors.htmlwww.ebi.ac.uk/blastall/vectors.html真核基因预测方法及其基本原理真核基因预测方法及其基本原理基基 础础 生生 物物 信信 息息 学学 及及 应应 用用真核基因预测方法及其基本原理真核基因预测方法及其基本原理n屏蔽重复序列屏蔽重复序列Ø重复序列分析程序重复序列分析程序©RepeatMaskerRepeatMasker:针对灵长类、啮齿类、拟南芥、草本植物、:针对灵长类、啮齿类、拟南芥、草本植物、果蝇果蝇 ftp.genome.washington.edu/cgi-ftp.genome.washington.edu/cgi-bin/RepeatMaskerbin/RepeatMasker©XBLASTXBLAST:适用于任何物种:适用于任何物种 bioweb.pasteur.fr/seqanal/interfaces/xblast.htmlbioweb.pasteur.fr/seqanal/interfaces/xblast.html#-#-data/data/基基 础础 生生 物物 信信 息息 学学 及及 应应 用用真核基因预测方法及其基本原理真核基因预测方法及其基本原理nGene Prediction Gene Prediction ProgramsPrograms((EukaryoticEukaryotic))ØAbAb Initio Initio––Based ProgramsBased ProgramsØHomology-Based ProgramsHomology-Based ProgramsØConsensus-Based ProgramsConsensus-Based ProgramsØPerformance EvaluationPerformance Evaluation基基 础础 生生 物物 信信 息息 学学 及及 应应 用用真核基因预测方法及其基本原理真核基因预测方法及其基本原理nAbAb Initio Initio––Based ProgramsBased Programs ØThe goal of the The goal of the abab initio gene prediction programs is to initio gene prediction programs is to discriminate discriminate exonsexons from from noncodingnoncoding sequences and subsequently sequences and subsequently join the join the exonsexons together in the correct order. together in the correct order.ØThe algorithms rely on two featuresThe algorithms rely on two features::©gene signalsgene signals©gene contentgene contentØTo derive an assessment for this feature, To derive an assessment for this feature, HMMsHMMs or neural or neural network-based algorithms can be usednetwork-based algorithms can be usedØThe frequently used The frequently used abab initio programs are described next. initio programs are described next.基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nAbAb Initio Initio––Based ProgramsBased ProgramsØ GENSCAN GENSCAN ::©Web basedWeb based:: http://genes.mit.edu/GENSCAN.htmlhttp://genes.mit.edu/GENSCAN.html ©makes predictions based on fifth-order makes predictions based on fifth-order HMMsHMMs. .ªIt combines It combines hexamerhexamer frequencies with coding signals frequencies with coding signals (initiation (initiation codonscodons, TATA box, cap site, poly-A, etc.) in , TATA box, cap site, poly-A, etc.) in prediction.prediction.©Putative Putative exonsexons are assigned a probability score ( are assigned a probability score (P P) of ) of being a true being a true exonexon. Only predictions with . Only predictions with P > P > 0.5 are deemed 0.5 are deemed reliable. reliable. ©This program is trained for sequences from This program is trained for sequences from vertebrates, vertebrates, Arabidopsis, Arabidopsis, and maize. and maize. ØIt has been used extensively in annotating the human It has been used extensively in annotating the human genome.genome.真核基因预测方法及其基本原理真核基因预测方法及其基本原理基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nAbAb Initio Initio––Based ProgramsBased Programs ØGRAIL (Gene Recognition and Assembly Internet Link)GRAIL (Gene Recognition and Assembly Internet Link)::©a web-based programa web-based program:: http://http://compbio.ornl.govcompbio.ornl.gov/public/tools//public/tools/ ©based on a based on a neural network algorithmneural network algorithm. .ªThe program is trained on several statistical features such as The program is trained on several statistical features such as splice junctions, start and stop splice junctions, start and stop codonscodons, poly-A sites, , poly-A sites, promoters, and promoters, and CpGCpG islands. islands. ©The program scans the query sequence with windows of The program scans the query sequence with windows of variable lengths and scores for coding potentials and variable lengths and scores for coding potentials and finally produces an output that is the result of finally produces an output that is the result of exonexon candidatescandidates. . ©The program is currently trained for The program is currently trained for humanhuman, , mousemouse, , ArabidopsisArabidopsis, , DrosophilaDrosophila, , and and Escherichia coliEscherichia coli sequences.sequences.真核基因预测方法及其基本原理真核基因预测方法及其基本原理基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nAbAb Initio Initio––Based ProgramsBased Programs ØFGENES (FGENES (FindGenesFindGenes) ) ©Web-based program: Web-based program: ©Uses Uses LDALDA to determine whether a signal is an to determine whether a signal is an exonexon. . ©In addition to FGENES, there are many variants of the In addition to FGENES, there are many variants of the programprogram::ªFGENESH: make use of FGENESH: make use of HMMsHMMs. .ªFGENESH C: similarity based. FGENESH C: similarity based. ªFGENESH+: combine both FGENESH+: combine both abab initio and similarity-based initio and similarity-based approaches.approaches.真核基因预测方法及其基本原理真核基因预测方法及其基本原理基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nAbAb Initio Initio––Based ProgramsBased Programs ØMZEF (Michael ZhangMZEF (Michael Zhang’’s s ExonExon Finder) Finder)©Web basedWeb based:: http://argon.cshl.org/genefinder/http://argon.cshl.org/genefinder/©Uses QDA for Uses QDA for exonexon prediction. prediction. ©Has not been obvious in actual gene prediction.Has not been obvious in actual gene prediction.真核基因预测方法及其基本原理真核基因预测方法及其基本原理基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nAbAb Initio Initio––Based ProgramsBased Programs ØHMMgeneHMMgene::©Web basedWeb based:: www.cbs.dtu.dk/services/HMMgenewww.cbs.dtu.dk/services/HMMgene ©HMM-based program.HMM-based program.©The unique feature of the program is that it uses a criterion The unique feature of the program is that it uses a criterion called the called the conditional maximum likelihood conditional maximum likelihood to discriminate to discriminate coding from coding from noncodingnoncoding features. features. ªIf a sequence already has a If a sequence already has a subregionsubregion identified as coding region, identified as coding region, which may be based on similarity with which may be based on similarity with cDNAscDNAs or proteins in a or proteins in a database, these regions are locked as coding regions. database, these regions are locked as coding regions. ªAn HMM prediction is subsequently made with a bias toward the An HMM prediction is subsequently made with a bias toward the locked region and is extended from the locked region to predict the locked region and is extended from the locked region to predict the rest of the gene coding regions and even neighboring genes. rest of the gene coding regions and even neighboring genes. ©The program is in a way a hybrid algorithm that uses both The program is in a way a hybrid algorithm that uses both abab initio-based and homology-based criteria.initio-based and homology-based criteria.真核基因预测方法及其基本原理真核基因预测方法及其基本原理基基 础础 生生 物物 信信 息息 学学 及及 应应 用用真核基因预测方法及其基本原理真核基因预测方法及其基本原理nHomology-Based ProgramsHomology-Based Programs ØHomology-based programs are based on the fact that Homology-based programs are based on the fact that exonexon structures and structures and exonexon sequences of related species are highly sequences of related species are highly conserved. conserved. ØWhen potential coding frames in a query sequence are translated When potential coding frames in a query sequence are translated and used to align with closest protein and used to align with closest protein homologshomologs found in found in databases, near perfectly matched regions can be used to reveal databases, near perfectly matched regions can be used to reveal the the exonexon boundaries in the query. boundaries in the query. ØThis approach assumes that the database sequences are correct.This approach assumes that the database sequences are correct.ØIt is a reasonable assumption in light of the fact that many It is a reasonable assumption in light of the fact that many homologous sequences to be compared with are derived from homologous sequences to be compared with are derived from cDNAcDNA or expressed sequence tags (or expressed sequence tags (ESTsESTs) of the same species. ) of the same species. 基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nHomology-Based ProgramsHomology-Based Programs::Ø优势:优势:With the support of experimental evidence, this With the support of experimental evidence, this method becomes rather efficient in finding genes in an method becomes rather efficient in finding genes in an unknown genomic DNA.unknown genomic DNA.Ø不足:不足:The drawback of this approach is its reliance on The drawback of this approach is its reliance on the presence of the presence of homologshomologs in databases. If the in databases. If the homologshomologs are not available in the database, the method cannot be are not available in the database, the method cannot be used. Novel genes in a new species cannot be discovered used. Novel genes in a new species cannot be discovered without matches in the database.without matches in the database.真核基因预测方法及其基本原理真核基因预测方法及其基本原理基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nHomology-Based ProgramsHomology-Based Programs ØGenomeScanGenomeScan©web-based serverweb-based server:: http://genes.mit.edu/genomescan.htmlhttp://genes.mit.edu/genomescan.html©Combines GENSCAN prediction results with BLASTX similarity Combines GENSCAN prediction results with BLASTX similarity searches. searches. ªThe user provides genomic DNA and protein sequences from related The user provides genomic DNA and protein sequences from related species. species. ªThe genomic DNA is translated in all six frames to cover all possible The genomic DNA is translated in all six frames to cover all possible exonsexons. . ªThe translated The translated exonsexons are then used to compare with the user-supplied are then used to compare with the user-supplied protein sequences. Translated genomic regions having high similarity at protein sequences. Translated genomic regions having high similarity at the protein level receive higher scores.the protein level receive higher scores.ªThe same sequence is also predicted with a GENSCAN algorithm, which The same sequence is also predicted with a GENSCAN algorithm, which gives gives exonsexons probability scores. probability scores. ©Final Final exonsexons are assigned based on combined score information from are assigned based on combined score information from both analyses.both analyses.真核基因预测方法及其基本原理真核基因预测方法及其基本原理基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nHomology-Based ProgramsHomology-Based Programs ØEST2GenomeEST2Genome::©web-based programweb-based program::http://bioweb.pasteur.fr/seqanal/interfaces/est2genome.htmlhttp://bioweb.pasteur.fr/seqanal/interfaces/est2genome.html©To define To define intronintron––exonexon boundaries boundaries. . ©Purely based on the sequence alignment approachPurely based on the sequence alignment approachªThe program compares an EST (or The program compares an EST (or cDNAcDNA) sequence with a genomic ) sequence with a genomic DNA sequence containing the corresponding gene. DNA sequence containing the corresponding gene. ©The alignment is done using a dynamic programmingThe alignment is done using a dynamic programming––based based algorithm. algorithm. 真核基因预测方法及其基本原理真核基因预测方法及其基本原理基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nHomology-Based ProgramsHomology-Based ProgramsØTwinScanTwinScan©http://genes.cs.wustl.edu/http://genes.cs.wustl.edu/ ©A similarity-based gene-finding server. A similarity-based gene-finding server. ©Predict Predict exonsexons©How to worksHow to works::ªit uses it uses GenScanGenScan to predict all possible to predict all possible exonsexons from the genomic sequence. from the genomic sequence. ªThe putative The putative exonsexons are used for BLAST searching to find closest are used for BLAST searching to find closest homologshomologs. . ªThe putative The putative exonsexons and and homologshomologs from BLAST searching are aligned to from BLAST searching are aligned to identify the best match. identify the best match. ªOnly the closest match from a genome database is used as a template for Only the closest match from a genome database is used as a template for refining the previous refining the previous exonexon selection and selection and exonexon boundaries. boundaries.真核基因预测方法及其基本原理真核基因预测方法及其基本原理基基 础础 生生 物物 信信 息息 学学 及及 应应 用用真核基因预测方法及其基本原理真核基因预测方法及其基本原理nConsensus-Based ProgramsConsensus-Based Programs ØThese programs work by retaining common predictions These programs work by retaining common predictions agreed by most programs and removing inconsistent agreed by most programs and removing inconsistent predictions. predictions. ©Such an integrated approach may improve the specificity by Such an integrated approach may improve the specificity by correcting the false positives and the problem of over correcting the false positives and the problem of over prediction. prediction. ©However, since this procedure punishes novel predictions, However, since this procedure punishes novel predictions, it may lead to lowered sensitivity and missed predictions. it may lead to lowered sensitivity and missed predictions. ØTwo examples of consensus-based programs are given next.Two examples of consensus-based programs are given next.基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nConsensus-Based ProgramsConsensus-Based Programs ØGeneComberGeneComber::©a web servera web server:: www.bioinformatics.ubc.ca/genecomber/index.phpwww.bioinformatics.ubc.ca/genecomber/index.php ©Combines Combines HMMgeneHMMgene and and GenScanGenScan prediction results. prediction results. ªThe consistency of both prediction methods is The consistency of both prediction methods is calculated. calculated. ªIf the two predictions match, the If the two predictions match, the exonexon score is score is reinforced.reinforced.ªIf not, If not, exonsexons are proposed based on separate threshold are proposed based on separate threshold scores.scores.真核基因预测方法及其基本原理真核基因预测方法及其基本原理基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nConsensus-Based ProgramsConsensus-Based Programs ØDIGITDIGIT::©web serverweb server::http://digit.gsc.riken.go.jp/cgi-bin/index.cgihttp://digit.gsc.riken.go.jp/cgi-bin/index.cgi ©First, existing gene-finders First, existing gene-finders ((–– FGENESH, GENSCAN, and FGENESH, GENSCAN, and HMMgeneHMMgene ))are applied to an uncharacterized genome sequence are applied to an uncharacterized genome sequence (input sequence). (input sequence). ©Next, DIGIT produces all possible Next, DIGIT produces all possible exonsexons from the results of from the results of gene-finders, and assigns them their reading frames and gene-finders, and assigns them their reading frames and scores. scores. ©Finally, DIGIT searches a set of Finally, DIGIT searches a set of exonsexons whose additive score whose additive score is maximized under their reading frame constraints. is maximized under their reading frame constraints. 真核基因预测方法及其基本原理真核基因预测方法及其基本原理基基 础础 生生 物物 信信 息息 学学 及及 应应 用用真核基因预测方法及其基本原理真核基因预测方法及其基本原理nPerformance EvaluationPerformance Evaluation ØBecause of extra layers of complexity for eukaryotic gene prediction, Because of extra layers of complexity for eukaryotic gene prediction, the sensitivity and specificity have to be defined on the levels of the sensitivity and specificity have to be defined on the levels of nucleotides, nucleotides, exonsexons, and entire genes., and entire genes.ØThe The sensitivitysensitivity ( (SnSn) at the ) at the exonexon and gene level is and gene level is ©the proportion of correctly predicted the proportion of correctly predicted exonsexons or genes among actual or genes among actual exonsexons or or genes.genes.ØThe The specificityspecificity (Sp) at the two levels is (Sp) at the two levels is ©the proportion of correctly predicted the proportion of correctly predicted exonsexons or genes among all predictions or genes among all predictions made.made.number of correct exonsnumber of actual exonsnumber of correct exonsnumber of predicted exons====基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用真核基因预测方法及其基本原理真核基因预测方法及其基本原理nPerformance EvaluationPerformance Evaluation ØAt present, no single software program is able to produce consistent At present, no single software program is able to produce consistent superior results. superior results. ©Some programs may perform well on certain types of Some programs may perform well on certain types of exonsexons (e.g., internal (e.g., internal or single or single exonsexons) but not others (e.g., initial and terminal ) but not others (e.g., initial and terminal exonsexons). ). ©Some are sensitive to the G-C content of the input sequences or to the Some are sensitive to the G-C content of the input sequences or to the lengths of lengths of intronsintrons and and exonsexons. . ©Most programs make over predictions when genes contain long Most programs make over predictions when genes contain long intronsintrons. . ØIn sum, they all suffer from the problem of generating a high number In sum, they all suffer from the problem of generating a high number of false positives and false negatives.of false positives and false negatives.ØThis is especially true for This is especially true for abab initio initio––based algorithms. For complex based algorithms. For complex genomes such as the human genome, most popular programs can predict genomes such as the human genome, most popular programs can predict no more than 40% of the genes exactly right. no more than 40% of the genes exactly right. ØDrawing consensus from results by multiple prediction programs may Drawing consensus from results by multiple prediction programs may enhance performance to some extent.enhance performance to some extent.基基 础础 生生 物物 信信 息息 学学 及及 应应 用用第二节第二节 核酸序列的预测与鉴定核酸序列的预测与鉴定n本节内容本节内容Ø核酸序列预测概念核酸序列预测概念Ø基因预测基因预测Ø启动子和调控元件预测启动子和调控元件预测Ø酶切位点分析与引物设计酶切位点分析与引物设计基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nPromoter and Regulatory Element Prediction Promoter and Regulatory Element Prediction ØThe computational approach to identify promoters and The computational approach to identify promoters and regulatory elements of genes.regulatory elements of genes.nPromotersPromotersØ DNA elements located in the vicinity of gene start DNA elements located in the vicinity of gene start sites (which should not be confused with the translation sites (which should not be confused with the translation start sites) and serve as binding sites for the gene start sites) and serve as binding sites for the gene transcription machinery, consisting of RNA polymerases transcription machinery, consisting of RNA polymerases and transcription factors.and transcription factors.3 3、、 Promoter and Regulatory Promoter and Regulatory Element PredictionElement Prediction基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n程序:程序:ØAbAb Initio Initio––Based AlgorithmsBased Algorithms©BPROMBPROM©©CpGProDCpGProDCpGProDCpGProD((((CpGCpGCpGCpG岛)岛)岛)岛)©EponineEponine©Cluster-BusterCluster-Buster©FirstEFFirstEF ( (FirstExonFinderFirstExonFinder))©McPromoterMcPromoterPromoter and Regulatory Element Promoter and Regulatory Element PredictionPrediction基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nAbAb Initio Initio––Based AlgorithmsBased Algorithms ØBPROM:BPROM:©Web-based programWeb-based program:: of Prediction of bacterial promotersbacterial promoters©Uses a linear Uses a linear discriminantdiscriminant function combined with signal and function combined with signal and content information such as consensus promoter sequence and content information such as consensus promoter sequence and oligonucleotideoligonucleotide composition of the promoter sites. composition of the promoter sites.Promoter and Regulatory Element Promoter and Regulatory Element PredictionPrediction基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nAbAb Initio Initio––Based AlgorithmsBased AlgorithmsØ CpGProDCpGProD::©Web-based programWeb-based program:: http://pbil.univ-lyon1.fr/software/cpgprod.htmlhttp://pbil.univ-lyon1.fr/software/cpgprod.html©Predicts Predicts promoterspromoters containing a high density of containing a high density of CpGCpG islands in islands in mammalianmammalian genomic sequences. genomic sequences. ©It calculates moving averages of GC% and It calculates moving averages of GC% and CpGCpG ratios ratios (observed/expected) over a window of a certain size (usually 200 (observed/expected) over a window of a certain size (usually 200 bpbp). When the values are above a certain threshold, the region is ). When the values are above a certain threshold, the region is identified as a identified as a CpGCpG island. island.Promoter and Regulatory Element Promoter and Regulatory Element PredictionPrediction基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nAbAb Initio Initio––Based AlgorithmsBased Algorithms ØEponineEponine:: ©Web based programWeb based program:: http://servlet.sanger.ac.uk:8080/eponine/http://servlet.sanger.ac.uk:8080/eponine/©Predicts Predicts transcription start sitestranscription start sites©Based on a series of Based on a series of preconstructedpreconstructed PSSMsPSSMs of several regulatory of several regulatory sites, such as the TATA box, the CCAAT box, and sites, such as the TATA box, the CCAAT box, and CpGCpG islands. islands. ©The query sequence from a The query sequence from a mammalianmammalian source is scanned through the source is scanned through the PSSMsPSSMs. The sequence stretches with high-score matching to all the . The sequence stretches with high-score matching to all the PSSMsPSSMs, as well as matching of the spacing between the elements, are , as well as matching of the spacing between the elements, are declared transcription start sites.declared transcription start sites.Promoter and Regulatory Element Promoter and Regulatory Element PredictionPrediction基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nAbAb Initio Initio––Based AlgorithmsBased Algorithms ØCluster-BusterCluster-Buster©Web-based programWeb-based program:: http://zlab.bu.edu/cluster-buster/cbust.htmlhttp://zlab.bu.edu/cluster-buster/cbust.html©HMM-based,HMM-based,©Designed to find clusters of Designed to find clusters of regulatory binding regulatory binding sitessites. . ©It works by detecting It works by detecting (in a query sequence)(in a query sequence) a a region of high concentration of known region of high concentration of known transcription factor binding sites and regulatory transcription factor binding sites and regulatory motifs. motifs. Promoter and Regulatory Element Promoter and Regulatory Element PredictionPrediction基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nAbAb Initio Initio––Based AlgorithmsBased Algorithms ØFirstEFFirstEF ( (FirstExonFinderFirstExonFinder):):©Web-based programWeb-based program:: http://rulai.cshl.org/tools/FirstEF/http://rulai.cshl.org/tools/FirstEF/©Predicts Predicts promoterspromoters for for humanhuman DNA. DNA. ©It integrates gene prediction with promoter prediction.It integrates gene prediction with promoter prediction.©It uses quadratic It uses quadratic discriminantdiscriminant functions to calculate the functions to calculate the probabilities of the probabilities of the first first exonexon of a gene and its boundary of a gene and its boundary sites.sites.©A segment of DNA (15 kb) upstream of the first A segment of DNA (15 kb) upstream of the first exonexon is is subsequently extracted for promoter prediction on the subsequently extracted for promoter prediction on the basis of scores for basis of scores for CpGCpG islands. islands.Promoter and Regulatory Element Promoter and Regulatory Element PredictionPrediction基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nAbAb Initio Initio––Based AlgorithmsBased Algorithms ØMcPromoterMcPromoter::©Web-based programWeb-based program:: http://genes.mit.edu/McPromoter.htmlhttp://genes.mit.edu/McPromoter.html©Uses a neural network to make Uses a neural network to make promoterpromoter predictions. predictions. ©The program scans a window of 300 bases for the likelihoods The program scans a window of 300 bases for the likelihoods of being in each of the coding, of being in each of the coding, noncodingnoncoding, and promoter , and promoter regions. regions. ©The program is currently trained for The program is currently trained for DrosophilaDrosophila and and human human sequencessequences. .Promoter and Regulatory Element Promoter and Regulatory Element PredictionPrediction基基 础础 生生 物物 信信 息息 学学 及及 应应 用用第二节第二节 核酸序列的预测与鉴定核酸序列的预测与鉴定n本节内容本节内容Ø核酸序列预测概念核酸序列预测概念Ø基因预测基因预测Ø启动子和调控元件预测启动子和调控元件预测Ø酶切位点分析与引物设计酶切位点分析与引物设计基基 础础 生生 物物 信信 息息 学学 及及 应应 用用4 4、酶切位点分析与引物设计、酶切位点分析与引物设计n限制性内切酶分析限制性内切酶分析Ø限制性内切酶是在许多细菌体内发现的能识别和切割限制性内切酶是在许多细菌体内发现的能识别和切割外源外源DNADNA的核酸酶。
细菌自身的的核酸酶细菌自身的DNADNA因其限制型内切酶因其限制型内切酶的识别位点被相应的的识别位点被相应的DNADNA甲基化酶所甲基化,而不被内甲基化酶所甲基化,而不被内切酶所水解限制型内切酶的这种作用使之成为遗传切酶所水解限制型内切酶的这种作用使之成为遗传工程实验的重要工具酶之一工程实验的重要工具酶之一基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n限制性内切酶分析限制性内切酶分析Ø每一种限制性内切酶都有特定的每一种限制性内切酶都有特定的DNADNA识别顺序,并且呈识别顺序,并且呈回文排列确定回文排列确定DNADNA酶切位点是基因操作的必不可少的酶切位点是基因操作的必不可少的步骤ØDNADNA序列分析软件包大多整合有检索酶切位点的程序序列分析软件包大多整合有检索酶切位点的程序这些程序附带一个酶切位点的数据库文件,根据这个这些程序附带一个酶切位点的数据库文件,根据这个文件对序列作酶切位点的查找文件对序列作酶切位点的查找酶切位点分析与引物设计酶切位点分析与引物设计基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n限制性内切酶分析常用软件限制性内切酶分析常用软件ØRESTRICTION ANALYSISRESTRICTION ANALYSISØDNAssistDNAssist 1.02 1.02ØDFW 2.21DFW 2.21ØGenerunnerGenerunnerØ下载地址:下载地址:http://biosoft.biosino.org/dna.htmlhttp://biosoft.biosino.org/dna.html酶切位点分析与引物设计酶切位点分析与引物设计基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n限制性内切酶分析限制性内切酶分析ØDnastarDnastar©序列格式转换序列格式转换©限制性内切酶分析限制性内切酶分析©序列拼接序列拼接©下载网址:下载网址:http://基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n引物设计引物设计Ø从原理来说,引物的设计和分析并不是从原理来说,引物的设计和分析并不是DNADNA序序列分析的一个基本方法,但是在分子生物学研列分析的一个基本方法,但是在分子生物学研究中常常需要用到。
我们主要介绍针对究中常常需要用到我们主要介绍针对PCRPCR的的引物设计引物设计酶切位点分析与引物设计酶切位点分析与引物设计基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n引物设计的标准:引物设计的标准:Ø引物的长度通常为引物的长度通常为20-3020-30个碱基个碱基Ø引物避免有发卡结构引物避免有发卡结构Ø引物避免有彼此之间的互补配对引物避免有彼此之间的互补配对Ø两个引物之间避免有类似序列两个引物之间避免有类似序列Ø引物与核酸序列数据库的其他序列无明显类似引物与核酸序列数据库的其他序列无明显类似Ø引物引物5 5’’端能加上合适的酶切位点端能加上合适的酶切位点Ø引物组成均匀,避免含有相同碱基的多聚体,两个引物的引物组成均匀,避免含有相同碱基的多聚体,两个引物的G G++C C%%含量近似含量近似酶切位点分析与引物设计酶切位点分析与引物设计基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n引物设计的标准:引物设计的标准:Ø可见,引物设计包含序列组成的计算、序列对可见,引物设计包含序列组成的计算、序列对DNADNA序列数据库的序列数据库的类似性检索、两个序列的比较、碱基互补配对和发卡结构分析以类似性检索、两个序列的比较、碱基互补配对和发卡结构分析以及酶切位点检索等基本的及酶切位点检索等基本的DNADNA序列分析过程。
序列分析过程Ø事实上,许多事实上,许多PCRPCR引物设计程序会略过或简化上述的某些过程引物设计程序会略过或简化上述的某些过程酶切位点分析与引物设计酶切位点分析与引物设计基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n引物设计工具引物设计工具ØPrimer Premier 5.0Primer Premier 5.0©可以简单地通过手动拖动鼠标以扩增出相应片段所需的引物,可以简单地通过手动拖动鼠标以扩增出相应片段所需的引物,而在手动的任何时候,下面显示各种参数的改变和可能的二而在手动的任何时候,下面显示各种参数的改变和可能的二聚体、异二聚体、发夹结构等聚体、异二聚体、发夹结构等©也可以给定条件,让软件自动搜索引物,并将引物分析结果也可以给定条件,让软件自动搜索引物,并将引物分析结果显示出来而且进行这些操作非常简单显示出来而且进行这些操作非常简单酶切位点分析与引物设计酶切位点分析与引物设计基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n引物设计工具引物设计工具ØPrimer Premier 5.0Primer Premier 5.0Ø下载下载http:// C盘下找到盘下找到WIN.INIWIN.INI,将,将vspacevspace=DU=DU改为改为vspacevspace=PU=PU便可以使便可以使用全部功能用全部功能酶切位点分析与引物设计酶切位点分析与引物设计基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n其他引物设计软件:其他引物设计软件: ØPrimer3Primer3©http://www-http://www-genome.wi.mit.edu/genome_software/other/primer3.htmlgenome.wi.mit.edu/genome_software/other/primer3.html©http://www.bio-基基 础础 生生 物物 信信 息息 学学 及及 应应 用用n实际引物设计采用的几条原则实际引物设计采用的几条原则Ø引物长度引物长度20-3020-30个,最好不要超过个,最好不要超过3030个;个;ØTm=Tm=((A+TA+T))X 2+X 2+((G+CG+C))X 4X 4,退火温度为,退火温度为Tm-7Tm-7ØG+C%=40-60%G+C%=40-60%Ø5 5’’、、3 3’’ 引物退火温度最好相等;引物退火温度最好相等;Ø四个相同的碱基相连最好不要出现;四个相同的碱基相连最好不要出现;Ø引物的最后一个避免为引物的最后一个避免为T T。
酶切位点分析与引物设计酶切位点分析与引物设计基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Neural NetworksNeural NetworksnA A neural network neural network (or (or artificial neural networkartificial neural network) is a ) is a statistical statistical modelmodel with a special architecture for with a special architecture for pattern recognition and pattern recognition and classificationclassification. . nIt is composed of It is composed of Øa a network of mathematical variablesnetwork of mathematical variables that resemble the biological that resemble the biological nervous system, nervous system, Øwith variables or nodes with variables or nodes connected by weighted functionsconnected by weighted functions that are that are analogous to synapses (Fig. 8.6). analogous to synapses (Fig. 8.6). ØAnother aspect of the model that makes it look like a biological neural Another aspect of the model that makes it look like a biological neural network is its network is its ability to ability to ““learnlearn”” and then make predictions after and then make predictions after beingbeing trained trained. . ØThe network is able to process information and modify parameters of the The network is able to process information and modify parameters of the weight functions between variables during the training stage. weight functions between variables during the training stage. nOnce it is trained, it is able to make automatic predictions about Once it is trained, it is able to make automatic predictions about the unknownthe unknown. .基基 础础 生生 物物 信信 息息 学学 及及 应应 用用基基 础础 生生 物物 信信 息息 学学 及及 应应 用用nIn gene prediction, a neural network is constructed with multiple In gene prediction, a neural network is constructed with multiple layerslayers::ØInputInput:: The input is the gene sequence with The input is the gene sequence with intronintron and and exonexon signals. signals. ØOutputOutput::The output is the probability of an The output is the probability of an exonexon structure. structure. ØHidden layers: one or several layers where the machine learning takes Hidden layers: one or several layers where the machine learning takes place. place. ©The machine learning process starts by feeding the model with a sequence of The machine learning process starts by feeding the model with a sequence of known gene structureknown gene structure. The gene structure information is separated into . The gene structure information is separated into several classes of features such as several classes of features such as hexamerhexamer frequencies, splice sites, and frequencies, splice sites, and GC composition during training. The weight functions in the hidden layers GC composition during training. The weight functions in the hidden layers are adjusted during this process to recognize the nucleotide patterns and are adjusted during this process to recognize the nucleotide patterns and their relationship with known structures.their relationship with known structures.ØWhen the algorithm predicts an unknown sequence after training, it When the algorithm predicts an unknown sequence after training, it applies the same rules learned in training to look for patterns applies the same rules learned in training to look for patterns associated with the gene structures.associated with the gene structures.基基 础础 生生 物物 信信 息息 学学 及及 应应 用用Discriminant AnalysisnDiscriminantDiscriminant Analysis Analysis(总结式分析)(总结式分析): : nSome gene prediction algorithms rely on Some gene prediction algorithms rely on discriminantdiscriminant analysis, analysis, either LDA or QDA, to improve accuracy. either LDA or QDA, to improve accuracy. nLDA LDA (( Linear discriminate analysisLinear discriminate analysis )) works by plotting a two-works by plotting a two-dimensional graph of dimensional graph of Øcoding signalscoding signals versus versus all potential 3 splice site positionsall potential 3 splice site positions Øand drawing a and drawing a diagonaldiagonal lineline that best separates coding signals from that best separates coding signals from noncodingnoncoding signals signals ©based on knowledge based on knowledge learned from training data setslearned from training data sets of known gene structures of known gene structures . . nQDA (quadratic QDA (quadratic discriminantdiscriminant analysis )draws a curved line based on analysis )draws a curved line based on a quadratic function to separate coding and a quadratic function to separate coding and noncodingnoncoding features. features. Ømore flexible and provide a more optimal separation between the data more flexible and provide a more optimal separation between the data points.points.基基 础础 生生 物物 信信 息息 学学 及及 应应 用用。












