vignettes/Cell_type_deconvolution_and_differential_analysis.Rmd
Cell_type_deconvolution_and_differential_analysis.Rmd
For this tutorial, we will firstly construct an integrated reference using single-cell RNA-seq (scRNA-seq) and single-nuclei RNA-seq (snRNA-seq) data from normal breast tissue samples.
Then, we will simulate artificial bulk RNA-seq samples using the constructed reference data. Predefined cell-type proportions will be used to introduce heterogeneity between groups for the simulated bulk samples.
We then use OLS
method to deconvolute the bulk samples and predict cell-type proportions for each sample. Differential expression and pathway enrichment analysis will then be performed to identify perturbed genes and pathways. We will also be examining the effect of correcting for cell-type proportion differences on those downstream analysis.
You can also skip this section, and directly load the integrated reference data via data(refdata)
library(SCdeconR)
library(plotly)
# register backend for parallel processing
library(doFuture)
registerDoFuture()
plan("multisession", workers = 4)
# path to reference datasets
ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"), paste0(system.file("extdata",
package = "SCdeconR"), "/refdata/sample2"))
# path to phenodata files
phenodata_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt"),
paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt"))
# construct integrated reference using harmony algorithm
refdata <- construct_ref(ref_list = ref_list, phenodata_list = phenodata_list, data_type = "cellranger",
method = "harmony", group_var = "subjectid", nfeature_rna = 50, vars_to_regress = "percent_mt",
verbose = FALSE)
refdata
## An object of class Seurat
## 33734 features across 260 samples within 2 assays
## Active assay: SCT (8465 features, 2118 variable features)
## 1 other assay present: RNA
## 4 dimensional reductions calculated: pca, harmony, tsne, umap
Here we set nfeature_rna
to be 50 due to limited number of cells we included in the data. You might want to apply a higher cutoff for your own reference data.
what does refdata looks like
# refdata is a seurat object with those metadata information. use ?refdata for further
# documentation
str(refdata@meta.data)
## 'data.frame': 260 obs. of 12 variables:
## $ orig.ident : chr "SeuratProject" "SeuratProject" "SeuratProject" "SeuratProject" ...
## $ nCount_RNA : num 2093 1576 945 3684 20403 ...
## $ nFeature_RNA : int 836 700 490 1341 3364 4632 1062 911 1767 447 ...
## $ cellid : chr "TAAGAGACACATGACT-1_1" "GACGTGCCAAGTAGTA-1_1" "TCGAGGCCACTTGGAT-1_1" "TTCTCAATCTTGGGTA-1_1" ...
## $ celltype : chr "Epithelial cells" "Epithelial cells" "Epithelial cells" "Epithelial cells" ...
## $ subjectid : chr "subject1" "subject1" "subject1" "subject1" ...
## $ cohort : chr "komentissuebank" "komentissuebank" "komentissuebank" "komentissuebank" ...
## $ percent_mt : num 0.478 4.822 1.799 3.366 4.284 ...
## $ nCount_SCT : num 759 887 746 627 574 558 553 627 626 744 ...
## $ nFeature_SCT : int 446 589 446 292 144 176 286 311 376 395 ...
## $ SCT_snn_res.0.8: Factor w/ 6 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ seurat_clusters: Factor w/ 6 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
Ideally, you should split the integrated reference data into training and testing, and use the training data to generate artificial bulk samples. The testing data is used in deconvolution step. However, for the sake of this tutorial, due to the small # of cells we included in the refdata
, we will not split the data into training and testing.
# you can directly load refdata if you skipped the above section
data("refdata")
# create phenodata
phenodata <- data.frame(cellid = colnames(refdata), celltypes = refdata$celltype, subjectid = refdata$subjectid)
# Here is the number of cells per cell-type in refdata
table(refdata$celltype)
##
## Adipocyte Endothelial cells Epithelial cells Fibroblast
## 20 40 40 40
## Monocytes NK cells Pericytes T cell
## 40 20 40 20
Next we generate two sets of bulk data that have different cell-type proportions.
# equal proportions across cell-types
prop1 <- data.frame(celltypes = unique(refdata$celltype), proportion = rep(0.125, 8))
# generate 20 artificial bulk samples with the above cell-type proportions
bulk_sim1 <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata,
num_mixtures = 20, prop = prop1, replace = TRUE)
# high proportion of epithelial cells
prop2 <- data.frame(celltypes = unique(refdata$celltype), proportion = c(0.8, 0.1, 0.1, rep(0, 5)))
# generate 20 artificial bulk samples with the above cell-type proportions
bulk_sim2 <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata,
num_mixtures = 20, prop = prop2, replace = TRUE)
# combine the two datasets
bulk_sim <- list(cbind(bulk_sim1[[1]], bulk_sim2[[1]]), cbind(bulk_sim1[[2]], bulk_sim2[[2]]))
what does simulated bulk data looks like
# bulk_generator returns a list of two elements. the first element is simulated bulk RNA-seq
# data, with rows representing genes, columns representing samples. show first five samples
str(bulk_sim[[1]][, 1:5])
## 'data.frame': 8465 obs. of 5 variables:
## $ mix1_1: num 0.693 6.644 3.871 1.386 0.693 ...
## $ mix1_2: num 0.693 7.625 4.564 0.693 2.773 ...
## $ mix1_3: num 1.386 5.545 2.773 0.693 2.079 ...
## $ mix1_4: num 0.693 4.852 5.257 0 3.466 ...
## $ mix1_5: num 0.693 8.318 1.386 1.386 2.773 ...
# the second element is cell type proportions used to simulate the bulk RNA-seq data, with
# rows representing cell types, columns representing samples. show first five samples
str(bulk_sim[[2]][, 1:5])
## 'data.frame': 8 obs. of 5 variables:
## $ mix1_1: num 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
## $ mix1_2: num 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
## $ mix1_3: num 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
## $ mix1_4: num 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
## $ mix1_5: num 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
# as mentioned ealier, we use the same reference data for deconvolution. in practice, we
# recommend to split your reference data into training and testing, and use testing data for
# decovolution
decon_res <- scdecon(bulk = bulk_sim[[1]], ref = GetAssayData(refdata, slot = "data", assay = "SCT"),
phenodata = phenodata, filter_ref = TRUE, decon_method = "OLS", norm_method = "none", trans_method = "none",
marker_strategy = "all")
what does deconvolution result looks like
# scdecon returns a list of two elements the first element is a data.frame of predicted
# cell-type proportions, with rows representing cell types, columns representing samples.
str(decon_res[[1]])
## num [1:8, 1:40] 0.1416 0.1463 0.1322 0.0483 0.1142 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:8] "Epithelial cells" "Monocytes" "Fibroblast" "T cell" ...
## ..$ : chr [1:40] "mix1_1" "mix1_2" "mix1_3" "mix1_4" ...
# the second element is a data.frame of fitting errors of the algorithm; first column
# represents sample names, second column represents RMSEs.
str(decon_res[[2]])
## 'data.frame': 40 obs. of 2 variables:
## $ sample: chr "mix1_1" "mix1_2" "mix1_3" "mix1_4" ...
## $ RMSE : num 38.4 44.7 40 45.2 36.5 ...
We can then generate a bar plot of predicted cell proportions across samples
prop_barplot(prop = decon_res[[1]], interactive = TRUE)
# prepare sampleinfo, group1 and group2 were the two bulk sets we simulated eariler.
sampleinfo <- data.frame(condition = rep(c("group1", "group2"), each = 20))
row.names(sampleinfo) <- paste0("sample", 1:nrow(sampleinfo))
# prepare bulk samples
bulk <- bulk_sim[[1]]
# force data to be integers for DE purposes
for (i in 1:ncol(bulk)) storage.mode(bulk[, i]) <- "integer"
colnames(bulk) <- rownames(sampleinfo)
# prepare predicted cell-type proportions
prop = decon_res[[1]]
colnames(prop) <- colnames(bulk)
# perform DEA adjusting cell-type proportion differences
deres <- run_de(bulk = bulk, prop = prop, sampleinfo = sampleinfo, control = "group1", case = "group2",
de_method = "edgeR")
# perform DEA without adjusting cell-type proportion differences
deres_notadjust <- run_de(bulk = bulk, prop = NULL, sampleinfo = sampleinfo, control = "group1",
case = "group2", de_method = "edgeR")
Next, we can compare the effect of adjusting for cell-type proportion differences
comparedeg_scatter(results1 = deres[[2]], results2 = deres_notadjust[[2]], result_names = c("adjust for cell proportion",
"not adjust for cell proportion"), fc_cutoff = 1.5, pval_cutoff = 0.05, pvalflag = TRUE, interactive = TRUE)
As you can see, many false positive differential genes were detected without adjusting for cell proportion differences.
We also provide several visualization options for pathway enrichment results generated using GSEA software. Note that those functions are not compatible with the R implementation of GSEA. Also remember to run reformt_gmt()
to reformat the gene-set names before using GSEA.
reformat_gmt(gmtfile = "/path/to/gmt/file/", outputfile = "/path/to/reformatted/gmt/file/", replace = TRUE)
## here is one example of reformatted gmt file for kegg pathway
gmt <- read_gmt(gmtfile = paste0(system.file("extdata", package = "SCdeconR"), "/gsea/gmtfile/kegg.gmt"))
Prepare .rnk
file.
prepare_rnk(teststats = deres[[2]], outputfile = "/path/to/rnk/file/")
Use the reformatted .gmt
file and .rnk
file to run GSEA. You can download GSEA here. Do not use the R version implementation (our visualization functions are not compatible with GSEA R implementation).
We included GSEA output using differential genes w/wo adjusting for cell-type differences. We can now compare the differences between those results.
comparegsea_scatter(gseares_path1 = paste0(system.file("extdata", package = "SCdeconR"), "/gsea/results/age_adjust/"),
gseares_path2 = paste0(system.file("extdata", package = "SCdeconR"), "/gsea/results/age_unadjust/"),
result_names = c("age_adjust", "age_unadjust"), nes_cutoff = 1.5, pval_cutoff = 0.1, pvalflag = FALSE,
interactive = TRUE)
We can then generate summary plot of top enriched gene-sets.
gsea_sumplot(gseares_path = paste0(system.file("extdata", package = "SCdeconR"), "/gsea/results/age_adjust/"),
pos_sel = c("GLYCOSPHINGOLIPID_BIOSYNTHESIS", "FRUCTOSE_AND_MANNOSE_METABOLISM", "HIF-1_SIGNALING_PATHWAY"),
neg_sel = c("P53_SIGNALING_PATHWAY", "RENIN_SECRETION", "LYSOSOME"), pvalflag = FALSE, interactive = TRUE)
Heatmap of two selected gene-sets for demonstration.
gsea_heatmap(normdata = deres[[1]], teststats = deres[[2]], gmtfile = paste0(system.file("extdata",
package = "SCdeconR"), "/gsea/gmtfile/kegg.gmt"), numgenes = 200, gsname_up = "HIF-1_SIGNALING_PATHWAY",
gsname_down = "P53_SIGNALING_PATHWAY", anncol = sampleinfo, color = colorRampPalette(c("blue",
"white", "red"))(100))
Random-walk plot of selected gene-sets
gsea_rwplot(gseares_path = paste0(system.file("extdata", package = "SCdeconR"), "/gsea/results/age_adjust/"),
gsname = "HIF-1_SIGNALING_PATHWAY", class_name = "KEGG")