TCGA数据下载,提取lncRNA mRNA—科研工具箱

请关注公众号【叨客学习资料】 在使用网站的过程中有疑问,请来公众号进行反馈哦

此程序用来下载TCGA中的FPKM数据,TPM转化,以及mRNA与lncRNA提取,以便进行分析。另外进行差异分析需要count数据,只需要把workflow_type <- “HTSeq – FPKM”中的FPKM换成count。

如有问题请联系进哥哥(代码中提取lncRNA或mRNA时有两个参考文件联系进哥哥获取)

############################################################
# Producer=Jin Wang
###########################################################
# 安装TCGAbiolinks包
#source(\"https://bioconductor.org/biocLite.R\")
#biocLite(\"TCGAbiolinks\")
###########################################################
  
#  加载需要的包
library(SummarizedExperiment)
library(TCGAbiolinks)
  
###########################################################
# GDC: https://portal.gdc.cancer.gov/
###########################################################
  
# 设置环境参数
work_dir <- \"D:\\study\\TCGA\\GDCdata/lnc_mRNA\"
  
# 设置程序参数,下载count
project <- \"TCGA-LUAD\"
data_category <- \"Transcriptome Profiling\"
data_type <- \"Gene Expression Quantification\"
workflow_type <- \"HTSeq - FPKM\"
legacy <- FALSE
  
  
# 设置工作目录
setwd(work_dir)
  
# 下载基因表达量,count/FPKM数格式的结果
DataDirectory <- paste0(work_dir,\"/GDC/\",gsub(\"-\",\"_\",project))
FileNameData <- paste0(DataDirectory, \"_\",\"RNAseq_HTSeq_FPKM\",\".rda\")
  
# 查询可以下载的数据
query <- GDCquery(project = project,
                  data.category = data_category,
                  data.type = data_type, 
                  workflow.type = workflow_type,
                  legacy = legacy)
  
# 该癌症总样品数量
samplesDown <- getResults(query,cols=c(\"cases\"))
cat(\"Total sample to download:\", length(samplesDown))
  
# TP 样品数量
dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown, typesample = \"TP\")
cat(\"Total TP samples to down:\", length(dataSmTP))
  
# NT 样本数量
dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,typesample = \"NT\")
cat(\"Total NT samples to down:\", length(dataSmNT))
  
  
# 下载数据, 数据比较大,耐心等待
GDCdownload(query = query,
            directory = DataDirectory,files.per.chunk=6,method = \"api\")
  
# 保存结果,方便后面使用
data <- GDCprepare(query = query, 
                   save = TRUE, 
                   directory =  DataDirectory,
                   save.filename = FileNameData)
  
# 表达量提取
data_expr <-as.data.frame(assay(data))
dim(data_expr)
  
#fpkmToTpm
fpkmToTpm <- function(fpkm)
{
  exp(log(fpkm) - log(sum(fpkm)) + log(1e6))
}
  
data_expr_Tpm<- as.data.frame (apply(data_expr  , 2, fpkmToTpm))
  
  
expr_file_F <- paste0(DataDirectory, \"_\",\"All_HTSeq_FPKM\",\".txt\")
expr_file_T <- paste0(DataDirectory, \"_\",\"All_HTSeq_TPM\",\".txt\")
  
write.table(data_expr, file = expr_file_F, sep=\"	\", row.names =T, quote = F)
write.table(data_expr_Tpm, file = expr_file_T, sep=\"	\", row.names =T, quote = F)
  
  
  
  
# 提取Gene 表达量矩阵
gene_info = read.table(\"./ref/gene_info.txt\",header = 1)
cat(\"Total Gene annotation:\", dim(gene_info)[1])
gene_selected <- row.names(data_expr_Tpm) %in% gene_info$Gene_id
gene_expr <- data_expr_Tpm[gene_selected,]
cat(\"Total Gene Selected:\", dim(gene_expr)[1])
gene_expr_file <- paste0(DataDirectory, \"_\",\"Gene_HTSeq_TPM\",\".txt\")
write.table(gene_expr, file = gene_expr_file, sep=\"	\", row.names =T, quote = F)
  
  
# 提取lncRNA 表达量矩阵
lncRNA_info = read.table(\"./ref/lncRNA_info.txt\",header = 1)
cat(\"Total LncRNA annotation:\", dim(lncRNA_info)[1])
lncRNA_selected <- row.names(data_expr_Tpm) %in% lncRNA_info$Gene_id
lncRNA_expr <- data_expr_Tpm[lncRNA_selected,]
cat(\"Total LncRNA Selected:\", dim(lncRNA_expr)[1])
lnc_expr_file <- paste0(DataDirectory, \"_\",\"LncRNA_HTSeq_TPM\",\".txt\")
write.table(lncRNA_expr, file = lnc_expr_file, sep=\"	\", row.names =T, quote = F)

如需下载其它数据,比如miRNA,拷贝数等等,对应修改参数即可。

project:

可以使用getGDCprojects()$project_id得到各个癌种的项目id,总共有52个ID值。

图片[1]-TCGA数据下载,提取lncRNA mRNA—科研工具箱-叨客学习资料网
图片[2]-TCGA数据下载,提取lncRNA mRNA—科研工具箱-叨客学习资料网
图片[3]-TCGA数据下载,提取lncRNA mRNA—科研工具箱-叨客学习资料网

workflow.type:
HTSeq – FPKM-UQ:FPKM上四分位数标准化值
HTSeq – FPKM:FPKM值/表达量值
HTSeq – Counts:原始count数
STAR – Counts


后续分析:

© 版权声明
THE END
喜欢就支持一下吧
点赞0 分享
评论 抢沙发
头像
请输入有效评论哦,肆意灌水或者乱打评论是不会通过的,会影响您评论后获得资源哦~~
提交
头像

昵称

取消
昵称表情

    暂无评论内容