
EST-Course20040318.ppt
32页EST李瑞强Beijing Genomics Institute (BGI)2004-03-18---- Exploring the transcriptome1. What is EST2. Why EST sequencing3. Processing of ESTs4. Usage of ESTsWhat is EST ?1. Take a cell or tissue of interest;2. Isolation of mRNAs from tissue(s);3. reverse transcribe into cDNA, reflecting parts of the RNAs;4. Cloning of cDNAs into a vector (often random orientation);5. End sequencing of the clones.EST – Expressed sequence tags (表达基因标签表达基因标签)An overview of the process of protein synthesisImage adopted by http://ncbi.nlm.nih.gov/About/primer/est.htmlAn overview of how ESTs are generated.Image adopted from ncbi.nlm.nih.gov/About/primer/est.htmlCell or tissueIsolate mRNA andReverse transcribe intocDNAClone cDNA fragments into vectors toMake a cDNA library5’3’ESTPick a cloneAnd sequence the 5’ and 3’Ends of cDNA insertVectors•Systematic sampling of the transcribed portion of the genome (“transcriptome”)•Provides experimental evidence for the positions of exons•Provides regions coding for potentially new proteins•Provides clones for DNA microarraysWhy EST sequencing ?Characteristics of ESTs•400~600 bp•only fragments of genes not complete coding sequences•Highly redundant•Low sequence quality•(Cheap)•Reflect expressed genes•May be tissue/stage specificProcessing…1.Trim off low quality sequences; phred, Q202.Screen vector and bacterial contaminant sequences; cross_match, vectors and contaminants as library3.Remove mtRNAs, rRNAs; compare to mtRNAs and rRNAs using cross_match, blastn…4.Mask transposons;repeatmasker5.Ignore sequences <100bp;6.Clustering - associate individual EST sequences with unique transcripts or genes;D2_cluster, sequences similarity7.Assembly - derive consensus sequences from overlapping ESTs belonging to the same cluster.Phrap, cap3Functional annotation:Ø InterproØ GOØ KEGGGO:KEGG:Pipelines:Ø UniGeneØ HGI (Human Gene Index)Ø TIGR AssemblerØ STACK (Sequence Tag Alignment and Consensus Knowledgebase)Ø CATTIGR_ASSEMBLER•THC_BUILD: BLAST-FASTA id all overlaps and are stored.•Tigr-assembler then uses rapid oligo nucleotide comparison and assembles non-repeat overlaps. (95% ID over 40bp)•matching constraints on sequence ends•minimum sequence id within a sequence group - more fragmented as a result•Other TIGR approaches are similar UniGeneEST database: dbESTdbEST release 030504 Total: 20,151,345 public entries; 660 organisms.Homo sapiens (human) 5,487,412Mus musculus + domesticus (mouse) 4,067,826Rattus sp. (rat) 592,059Triticum aestivum (wheat) 549,926Gallus gallus (chicken) 460,385Danio rerio (zebrafish) 450,652Zea mays (maize) 393,719Xenopus laevis (African clawed frog) 368,783Bos taurus (cattle) 365,581Hordeum vulgare + subsp. vulgare (barley) 356,848Glycine max (soybean) 346,582Xenopus tropicalis 300,267Oryza sativa (rice) 283,935Drosophila melanogaster (fruit fly) 274,367Sus scrofa (pig) 272,188Caenorhabditis elegans (nematode) 231,096Arabidopsis thaliana (thale cress) 204,396•Low sequence quality, framshift•Chimeric cDNA clones•Retained introns•Other limitationsProblems in EST sequencingUsage of ESTs:Ø Get coding region; cDNA sequences can discover many new protein coding genes.Ø Know genome coverage;Ø Help Genome annotation;Ø Compare expression patterns;Ø Detect alternative splicing;Ø Find SNPs(Single Nucleotide Polymorphisms);Ø Provide data for array.Genome annotation: EnsemblAnalysis of gene expressiontissue-specificity Counting frequency of EST derived from a specific tissue within one sequence cluster Searching for cluster/contigs which are tissue specific (e.g. tumor) Searching for alternative splice variants which are potentially tissue specificTypes of alternative splicing•Skipped exons•Retained introns•Alternative donor or acceptor siteAlternative splicing Three subassembliesPotential alternateexpression formDetect SNPs from ESTsSNP or basecalling errorLarge-Scale Statistical Analyses of Rice ESTs RevealCorrelated Patterns of Gene ExpressionGenome Research 1054-9803/99In this report, we go a step further in showing that computer analyses of plant EST data can be used to generate evidence of correlated expression patterns of genes across various tissues. Furthermore, tissue types and organs can be classified with respect to one another on the basis of their global gene expression patterns. As in previous studies, expression profiles are first estimated from EST counts. By clustering gene expression profiles or whole cDNA library profiles, we show that genes with similar functions, or cDNA libraries expected to share patterns of gene expression, are grouped together. Promising uses of this technique include functional genomics, in which evidence of correlated expression might complement (or substitute for) those of sequence similarity in the annotation of anonymous genes and identification of surrogate markers. The analysis presented here combines the application of a correlation-based clustering method with a graphical color map allowing intuitive visualization of patterns within a large table of expression measurements.EST Analysis of the Cnidarian Acropora milleporaReveals Extensive Gene Loss and Rapid SequenceDivergence in the Model InvertebratesCurrent Biology, Vol. 13, 2190–2195, December 16, 2003,A significant proportion of mammalian genes are not represented in the genomes of Drosophila, Caenorhabditis or Saccharomyces, and many of these are assumed to have been vertebrate innovations. To test this assumption, we conducted a preliminary EST project on the anthozoan cnidarian, Acropora millepora, a basal metazoan. More than 10% of the Acropora ESTs with strong metazoan matches to the databases had clear human homologs but were not represented in the Drosophila or Caenorhabditis genomes; this category includes a surprising diversity of transcription factors and metabolic proteins that were previously assumed to be restricted to vertebrates. Consistent with higher rates of divergence in the model invertebrates, three-way comparisons show that most Acropora ESTs match human sequences much more strongly than they do any Drosophila or Caenorhabditis sequence. Gene loss has thus been much more extensive in the model invertebrate lineages than previously assumed and, as a consequence, some genes formerly thought to be vertebrate inventions must have been present in the common metazoan ancestor. The complexity of the Acropora genome is paradoxical, given that this organism contains apparently few tissue types and the simplest extant nervous system consisting of a morphologically homogeneous nerve net. Thanks!。
