
goseq使用说明.pdf
24页goseq: Gene Ontology testing for RNA-seq datasets Matthew D. YoungNadia Davidson nadia.davidson@mcri.edu.au Matthew J. Wakefi eldGordon K. SmythAlicia Oshlack 5 February 2014 1Introduction This document gives an introduction to the use of the goseq R Bioconductor package [Young et al., 2010]. This package provides methods for performing Gene Ontology analysis of RNA-seq data, taking length bias into account [Oshlack and Wakefi eld, 2009]. The methods and software used by goseq are equally applicable to other category based test of RNA-seq data, such as KEGG pathway analysis. Once installed, the goseq package can be easily loaded into R using: library(goseq) In order to perform a GO analysis of your RNA-seq data, goseq only requires a simple named vector, which contains two pieces of information. 1. Measured genes: all genes for which RNA-seq data was gathered for your experiment. Each element of your vector should be named by a unique gene identifi er. 2. Differentially expressed genes: each element of your vector should be either a 1 or a 0, where 1 indicates that the gene is diff erentially expressed and 0 that it is not. If the organism, gene identifi er or category test is currently not natively supported by goseq, it will also be necessary to supply additional information regarding the genes length and/or the association between categories and genes. A combination of bioconductor R packages such as GenomicFeatures and Rsamtools allow for the summarization of mapped reads into a table of counts, such as reads per gene.From there, several packages exist for performing diff erential expression analysis on summarized data 1 (eg. edgeR [Robinson and Smyth, 2007, 2008, Robinson et al., 2010]). goseq will work with any method for determining diff erential expression and as such diff erential expression analysis is outside the scope of this document, but in order to facilitate ease of use, we will make use of the edgeR package to calculate diff erentially expressed (DE) genes in all the case studies in this document. 2Reading data We assume that the user can use appropriate in-built R functions (such as read.table or scan) to obtain two vectors, one containing all genes assayed in the RNA-seq experiment, the other containing all genes which are DE. If we assume that the vector of genes being assayed is named assayed.genes and the vector of DE genes is named de.genes we can construct a named vector suitable for use with goseq using the following: gene.vector=as.integer(assayed.genes%in%de.genes) names(gene.vector)=assayed.genes head(gene.vector) It may be that the user can already read in a vector in this format, in which case it can then be immediately used by goseq. 3GO testing of RNA-seq data To begin the analysis, goseq fi rst needs to quantify the length bias present in the dataset under consideration. This is done by calculating a Probability Weighting Function or PWF which can be thought of as a function which gives the probability that a gene will be diff erentially expressed (DE), based on its length alone. The PWF is calculated by fi tting a monotonic spline to the binary data series of diff erential expression (1=DE, 0=Not DE) as a function of gene length. The PWF is used to weight the chance of selecting each gene when forming a null distribution for GO category membership. The fact that the PWF is calculated directly from the dataset under consideration makes this approach robust, only correcting for the length bias present in the data. For example, if goseq is run on a microarray dataset, for which no length bias exists, the calculated PWF will be nearly fl at and all genes will be weighted equally, resulting in no length bias correction. In order to account for the length bias inherent to RNA-seq data when performing a GO analysis (or other category based tests), one cannot simply use the hypergeometric distribution as the null distribution for category membership, which is appropriate for data without DE length bias, such as microarray data. GO analysis of RNA-seq data requires the use of random sampling in order to generate a suitable null distribution for GO category membership and calculate each categories signifi cance for over representation amongst DE genes. However, this random sampling is computationally expensive. In most cases, the Wallenius distribution can be used to approximate the true null distribution, without any signifi cant loss in 2 accuracy. The goseq package implements this approximation as its default option. The option to generate the null distribution using random sampling is also included as an option. Having established a null distribution, each GO category is then tested for over and under representation amongst the set of diff erentially expressed genes and the null is used to calculate a p-value for under and over representation. 4 Natively supported Gene Identifi ers and category tests goseq needs to know the length of each gene, as well as what GO categories (or o。
