
edgeR-DESeq2分析RNA-seq差异表达.doc
14页edgeR 包的安装· edgeR 包是基于 Bioconductor 平台发布的,因此安装不能直接用 install.packages() 命令从 CRAN 上来下载· 安装:# try http:// if https:// URLs are not supported>source("")>biocLite("edgeR")数据导入· 由于 edgeR 对测序成果的下游分析是依赖 count 计数来进行基因差别体现分析的,在这里使用的是featureCounts 来进行记录 `.bam` 文献中 Map 的成果· count 成果如下:>library(edgeR)>mydata <- read.table("counts.txt", header = TRUE, quote = '\t',skip =1)>sampleNames <- c("CA_1","CA_2","CA_3","CC_1","CC_2","CC_3")>names(mydata)[7:12] <- sampleNames>head(mydata) Geneid Chr Start End Strand Length CA_1 CA_2 CA_3 CC_1 CC_2 CC_31 gene1314 NW_139421.1 1257 1745 + 489 0 0 0 0 0 02 gene1315 NW_139421.1 2115 3452 + 1338 0 0 0 0 0 03 gene1316 NW_139421.1 3856 4680 + 825 0 0 0 0 0 04 gene1317 NW_139421.1 4866 5435 - 570 0 0 0 0 0 05 gene1318 NW_139421.1 6066 6836 - 771 0 0 0 0 0 06 gene1319 NW_139421.1 7294 9483 + 2190 0 0 0 0 0 0· 在这里我们只是需要 Geneid 和后 6 列的样本的 count 信息来构成矩阵,因此要解决下>countMatrix <- as.matrix(mydata[7:12])>rownames(countMatrix) <-mydata$Geneid>head(countMatrix) CA_1 CA_2 CA_3 CC_1 CC_2 CC_3gene1314 0 0 0 0 0 0gene1315 0 0 0 0 0 0gene1316 0 0 0 0 0 0gene1317 0 0 0 0 0 0gene1318 0 0 0 0 0 0gene1319 0 0 0 0 0 0*要导入的矩阵由3v3样本构成(三组生物学反复)创立 DEGlist>group <- factor(c("CA","CA","CA","CC","CC","CC"))>y <- DGEList(counts = countMatrix, group = group)>yAn object of class "DGEList"$counts CA_1 CA_2 CA_3 CC_1 CC_2 CC_3gene1314 0 0 0 0 0 0gene1315 0 0 0 0 0 0gene1316 0 0 0 0 0 0gene1317 0 0 0 0 0 0gene1318 0 0 0 0 0 014212 more rows ...$samples group lib.size norm.factorsCA_1 CA_1 1788537 1CA_2 CA_2 1825546 1CA_3 CA_3 1903017 1CC_1 CC_1 1826042 1CC_2 CC_2 2124468 1CC_3 CC_3 2025063 1过滤· 过滤掉那些 count 成果都为0的数据,这些没有体现的基因对成果的分析没有用,过滤又两点好处:1 可以减少内存的压力 2 可以减少计算的压力>keep <- rowSums(cpm(y)>1) >= 2>y <- y[keep, , keep.lib.sizes=FALSE]>yAn object of class "DGEList"$counts CA_1 CA_2 CA_3 CC_1 CC_2 CC_3gene1321 161 138 129 218 194 220gene1322 2 3 1 1 3 3gene1323 20 27 33 47 51 46gene1324 60 87 79 86 100 132gene1325 32 29 21 58 75 563877 more rows ...$samples group lib.size norm.factorsCA_1 CA_1 1788362 1CA_2 CA_2 1825308 1CA_3 CA_3 1902796 1CC_1 CC_1 1825889 1CC_2 CC_2 2124155 1CC_3 CC_3 2024786 1原则化解决· edgeR采用的是 TMM 措施进行原则化解决,只有原则化解决后的数据才又可比性>y <- calcNormFactors(y)>yAn object of class "DGEList"$counts CA_1 CA_2 CA_3 CC_1 CC_2 CC_3gene1321 161 138 129 218 194 220gene1322 2 3 1 1 3 3gene1323 20 27 33 47 51 46gene1324 60 87 79 86 100 132gene1325 32 29 21 58 75 563877 more rows ...$samples group lib.size norm.factorsCA_1 CA_1 1788362 0.9553769CA_2 CA_2 1825308 0.9052539CA_3 CA_3 1902796 0.9686232CC_1 CC_1 1825889 0.9923455CC_2 CC_2 2124155 1.1275178CC_3 CC_3 2024786 1.0668754设计矩阵· 为什么要一种设计矩阵呢,道理很简朴,有了一种设计矩阵才可以更好的分组分析>subGroup <- factor(substring(colnames(countMatrix),4,4))>design <- model.matrix(~ subGroup+group)>rownames(design) <- colnames(y)>design (Intercept) subGroup2 subGroup3 groupCCCA_1 1 0 0 0CA_2 1 1 0 0CA_3 1 0 1 0CC_1 1 0 0 1CC_2 1 1 0 1CC_3 1 0 1 1attr(,"assign")[1] 0 1 1 2attr(,"contrasts")attr(,"contrasts")$subGroup[1] "contr.treatment"attr(,"contrasts")$group[1] "contr.treatment"评估离散度>y <- estimateDisp(y, design, robust=TRUE)>y$common.dispersion[1] 0.02683622#plot>plotBCV(y)差别体现基因> fit <- glmQLFit(y, design, robust=TRUE)> qlf <- glmQLFTest(fit)> topTags(qlf)Coefficient: groupCC logFC logCPM F PValue FDRgene7024 -5.515648 9.612809 594.9232 6.431484e-44 2.496702e-40gene6612 5.130282 8.451143 468.2060 1.557517e-39 3.023140e-36gene2743 4.377492 5.586773 208.0268 3.488383e-26 4.513967e-23gene12032 4.734383 5.098148 192.9378 4.359649e-25 4.231040e-22gene491 -2.733910 10.412673 190.9839 6.104188e-25 4.739291e-22gene8941 2.997185 6.839106 177.7614 6.332836e-24 4.097345e-21gene2611 -2.846924 7.216173 174.7332 1.099339e-23 6.096619e-21gene6242 2.529125。












