此程序用来下载TCGA中的FPKM数据,TPM转化,以及mRNA与lncRNA提取,以便进行分析。另外进行差异分析需要count数据,只需要把workflow_type <- “HTSeq – FPKM”中的FPKM换成count。
如有问题请联系进哥哥(代码中提取lncRNA或mRNA时有两个参考文件联系进哥哥获取)
############################################################ # Producer=Jin Wang ########################################################### # 安装TCGAbiolinks包 #source(\"https://bioconductor.org/biocLite.R\") #biocLite(\"TCGAbiolinks\") ########################################################### # 加载需要的包 library(SummarizedExperiment) library(TCGAbiolinks) ########################################################### # GDC: https://portal.gdc.cancer.gov/ ########################################################### # 设置环境参数 work_dir <- \"D:\\study\\TCGA\\GDCdata/lnc_mRNA\" # 设置程序参数,下载count project <- \"TCGA-LUAD\" data_category <- \"Transcriptome Profiling\" data_type <- \"Gene Expression Quantification\" workflow_type <- \"HTSeq - FPKM\" legacy <- FALSE # 设置工作目录 setwd(work_dir) # 下载基因表达量,count/FPKM数格式的结果 DataDirectory <- paste0(work_dir,\"/GDC/\",gsub(\"-\",\"_\",project)) FileNameData <- paste0(DataDirectory, \"_\",\"RNAseq_HTSeq_FPKM\",\".rda\") # 查询可以下载的数据 query <- GDCquery(project = project, data.category = data_category, data.type = data_type, workflow.type = workflow_type, legacy = legacy) # 该癌症总样品数量 samplesDown <- getResults(query,cols=c(\"cases\")) cat(\"Total sample to download:\", length(samplesDown)) # TP 样品数量 dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown, typesample = \"TP\") cat(\"Total TP samples to down:\", length(dataSmTP)) # NT 样本数量 dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,typesample = \"NT\") cat(\"Total NT samples to down:\", length(dataSmNT)) # 下载数据, 数据比较大,耐心等待 GDCdownload(query = query, directory = DataDirectory,files.per.chunk=6,method = \"api\") # 保存结果,方便后面使用 data <- GDCprepare(query = query, save = TRUE, directory = DataDirectory, save.filename = FileNameData) # 表达量提取 data_expr <-as.data.frame(assay(data)) dim(data_expr) #fpkmToTpm fpkmToTpm <- function(fpkm) { exp(log(fpkm) - log(sum(fpkm)) + log(1e6)) } data_expr_Tpm<- as.data.frame (apply(data_expr , 2, fpkmToTpm)) expr_file_F <- paste0(DataDirectory, \"_\",\"All_HTSeq_FPKM\",\".txt\") expr_file_T <- paste0(DataDirectory, \"_\",\"All_HTSeq_TPM\",\".txt\") write.table(data_expr, file = expr_file_F, sep=\" \", row.names =T, quote = F) write.table(data_expr_Tpm, file = expr_file_T, sep=\" \", row.names =T, quote = F) # 提取Gene 表达量矩阵 gene_info = read.table(\"./ref/gene_info.txt\",header = 1) cat(\"Total Gene annotation:\", dim(gene_info)[1]) gene_selected <- row.names(data_expr_Tpm) %in% gene_info$Gene_id gene_expr <- data_expr_Tpm[gene_selected,] cat(\"Total Gene Selected:\", dim(gene_expr)[1]) gene_expr_file <- paste0(DataDirectory, \"_\",\"Gene_HTSeq_TPM\",\".txt\") write.table(gene_expr, file = gene_expr_file, sep=\" \", row.names =T, quote = F) # 提取lncRNA 表达量矩阵 lncRNA_info = read.table(\"./ref/lncRNA_info.txt\",header = 1) cat(\"Total LncRNA annotation:\", dim(lncRNA_info)[1]) lncRNA_selected <- row.names(data_expr_Tpm) %in% lncRNA_info$Gene_id lncRNA_expr <- data_expr_Tpm[lncRNA_selected,] cat(\"Total LncRNA Selected:\", dim(lncRNA_expr)[1]) lnc_expr_file <- paste0(DataDirectory, \"_\",\"LncRNA_HTSeq_TPM\",\".txt\") write.table(lncRNA_expr, file = lnc_expr_file, sep=\" \", row.names =T, quote = F)
如需下载其它数据,比如miRNA,拷贝数等等,对应修改参数即可。
project:
可以使用getGDCprojects()$project_id得到各个癌种的项目id,总共有52个ID值。
![图片[1]-TCGA数据下载,提取lncRNA mRNA—科研工具箱-叨客学习资料网](https://cdn.leobba.cn/wp-content/uploads/18447056-70a9c8e2771582b2.webp_-1024x508-1.jpg~tplv-vsxgrxnt6c-1.image)
![图片[2]-TCGA数据下载,提取lncRNA mRNA—科研工具箱-叨客学习资料网](https://cdn.leobba.cn/wp-content/uploads/18447056-8b6f782b684cf93b.webp_.jpg~tplv-vsxgrxnt6c-1.image)
![图片[3]-TCGA数据下载,提取lncRNA mRNA—科研工具箱-叨客学习资料网](https://cdn.leobba.cn/wp-content/uploads/18447056-77ec89d1d775c6bf.webp_.jpg~tplv-vsxgrxnt6c-1.image)
workflow.type:
HTSeq – FPKM-UQ:FPKM上四分位数标准化值
HTSeq – FPKM:FPKM值/表达量值
HTSeq – Counts:原始count数
STAR – Counts
后续分析:
© 版权声明
THE END
暂无评论内容