Deconvolution of bulk RNA-seq data — scdecon • SCdeconR

Deconvolution of bulk RNA-seq data based on single-cell reference data. Eight bulk deconvolution methods, along with eight normalization methods and four transformation methods are available.

scdecon(
  bulk,
  ref,
  phenodata,
  filter_ref = TRUE,
  marker_genes = NULL,
  genes_to_remove = NULL,
  min_pct_ct = 0.05,
  decon_method = c("scaden", "CIBERSORT", "OLS", "nnls", "FARDEEP", "RLR", "MuSiC",
    "SCDC", "scTAPE"),
  norm_method_sc = c("none", "LogNormalize", "SCTransform", "scran", "scater", "Linnorm"),
  norm_method_bulk = c("none", "TMM", "median_ratios", "TPM"),
  trans_method_sc = c("none", "log2", "sqrt", "vst"),
  trans_method_bulk = c("none", "log2", "sqrt", "vst"),
  gene_length = NULL,
  lfc_markers = log2(1.5),
  marker_strategy = c("all", "pos_fc", "top_50p_logFC", "top_50p_AveExpr"),
  to_remove = NULL,
  ffpe_artifacts = FALSE,
  model = NULL,
  prop = NULL,
  cibersortpath = NULL,
  pythonpath = NULL,
  tmpdir = NULL,
  remove_tmpdir = TRUE,
  seed = 1234,
  nsamples = 1000,
  return_value_only = FALSE,
  verbose = FALSE
)

Arguments

bulk

a matrix or data.frame of unnormalizaed & untransformed bulk RNA-seq gene expression values with rows representing genes, columns representing samples

ref

a matrix or data.frame of untransformed scRNA-seq gene expression counts with rows representing genes, columns representing cells. This data will be used to deconvolute provided bulk RNA-seq data.

phenodata

a data.frame with rows representing cells, columns representing cell attributes. It should at least contain the first three columns as:

cell barcodes
cell types
subject ids

filter_ref

logical value indicating whether outlier genes & cells should be removed from the provided reference data. Defaults to TRUE

marker_genes

a data.frame of two columns. First column represents cell types in ref; second column represents gene names of marker genes. If specified, those genes will be used to construct signature matrix for mark-gene based deconvolution methods, such as CIBERSORT, OLS, nnls, FARDEEP and RLR. Default to NULL, carry out differential analysis to identify marker genes for each cell type in ref.

genes_to_remove

a vector of gene names to remove from the reference scRNA-seq data. Default to NULL.

min_pct_ct

a numeric value indicating the minimum required proportion of expressing cells per cell type for marker gene identification. Only applicable when marker_genes is NULL. Default to 0.05.

decon_method

character value specifying the deconvolution method to use. Has to be one of "scaden", "CIBERSORT", "OLS", "nnls", "FARDEEP", "RLR", "MuSiC", "SCDC", "scTAPE". See details for more information.

norm_method_sc

character value specifying the normalization method to use for reference data. Has to be one of "none","LogNormalize", "SCTransform", "scran", "scater", "Linnorm". See details for more information.

norm_method_bulk

character value specifying the normalization method to use for bulk data. Has to be one of "none", "TMM", "median_ratios", "TPM". See details for more information.

trans_method_sc

character value specifying the transformation method to use for both bulk & reference data. Has to be one of "none", "log2", "sqrt", "vst". See details for more information.

trans_method_bulk

character value specifying the transformation method to use for both bulk & reference data. Has to be one of "none", "log2", "sqrt", "vst". See details for more information.

gene_length

a data.frame with two columns. The first column represents gene names that match with provided bulk data. The second column represents length of each gene. Only applicable when norm_method is selected as "TPM".

lfc_markers

log2 fold change cutoff used to identify marker genes for deconvolution. The option only applicable to marker-gene based approaches, such as CIBERSORT, OLS, nnls, FARDEEP and RLR. Only applicable when marker_genes is NULL.

marker_strategy

further strategy in selecting marker genes besides applying the log2 fold change cutoff. Can be chosen from: "all", "pos_fc", "top_50p_logFC" or "top_50p_AveExpr". See details for more information. Only applicable when marker_genes is NULL.

to_remove

character value representing the cell type to remove from reference data. Only applicable to simulation experiments in evaluating effect of cell type removal from reference.

ffpe_artifacts

logical value indicating whether to add simulated ffpe artifacts in the bulk data. Only applicable to simulation experiments in evaluating the effect of FFPE artifacts.

model

pre-constructed ffpe model data. Can be downloaded from github: https://github.com/Liuy12/SCdeconR/blob/master/data/ffpemodel.rda.

prop

a matrix or data.frame of simulated cell proportion values with rows representing cell types, columns representing samples. Only applicable to simulation experiments in evaluating the effect of cell type removal from reference.

cibersortpath

full path to CIBERSORT.R script.

pythonpath

full path to python binary where scaden was installed with.

tmpdir

temporary processing directory for scaden or scTAPE.

remove_tmpdir

a logical value indicating whether to remove tmpdir once scaden is completed. Default to TRUE.

seed

random seed used for simulating FFPE artifacts. Only applicable when ffpe_artifacts is set to TRUE.

nsamples

number of artificial bulk samples to simulate for scaden. Default to 1000.

return_value_only

return a list of values only without performing deconvolution. This could be helpful in cases where the user want to apply their own deconvolution algorithms. Default to FALSE.

verbose

a logical value indicating whether to print messages. Default to FALSE.

Value

a list containing two or four elements.

first element: a data.frame of predicted cell-type proportions, with rows representing cell types, columns representing samples.
second element: a data.frame of fitting errors of the algorithm; first column represents sample names, second column represents RMSEs.
optional third element: a data.frame of simulated cell proportion after removing the specified cell_type. Only applicable to simulation experiments.
optional fourth element: a data.frame of marker genes used for deconvolution. Only applicable to marker-gene based deconvolution methods.

Details

decon_method should be one of the following:

scaden: a deep learning based method using three multi-layer deep neural nets. To use scaden, you need to firstly install scaden via pip or conda, the provide the python path to pythonpath option.
CIBERSORT: a marker gene based support vectors regression approach. CIBERSOR does not allow redistribution. To use CIBERSORT, you need to request the source code from the authors & provide the path of CIBERSORT.R script to cibersortpath option.
OLS: ordinary least squares.
nnls: non-negative least squares.
FARDEEP: robust regression using least trimmed squares
RLR: robust regression using an M estimator
MuSiC: multi-subject single-cell deconvolution
SCDC: an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets
scTAPE: Deep autoencoder based deconvolution

norm_method should be one of the following:

none: no normalization is performed.
LogNormalize: LogNormalize method from seurat.
TMM: TMM method from calcNormFactors function from edgeR.
median_ratios: median ratio method from estimateSizeFactors,DESeqDataSet-method function from DESeq2.
TPM: Transcript per million. TPM has to be chosen if ffpe_artifacts is set to TRUE.
SCTransform: SCTransform method from Seurat.
scran: computeSumFactors method from scran.
scater: librarySizeFactors method from scater.
Linnorm: Linnorm method from Linnorm.

trans_method should be one of the following:

none: no transformation is performed.
log2: log2 transformation. 0.1 is added to the data to avoid logarithm of 0s.
sqrt: square root transformation.
vst: varianceStabilizingTransformation method from DESeq2.

marker_strategy should be one of the following:

all: all genes passed fold change threshold will be used.
pos_fc: only genes with positive fold changes will be used.
top_50p_logFC: only genes with top 50 percent positive fold changes will be used.
top_50p_AveExpr: only genes with top 50 percent average expression will be used.

Examples

if (FALSE) {
ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"),
              paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2"))
phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"),
"/refdata/phenodata_sample1.txt")
phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"),
"/refdata/phenodata_sample2.txt")
phenodata_list <- c(phenopath1,phenopath2)

# construct integrated reference using harmony algorithm
refdata <- construct_ref(ref_list = ref_list,
                      phenodata_list = phenodata_list,
                      data_type = "cellranger",
                      method = "harmony",
                      group_var = "subjectid",
                      nfeature_rna = 50,
                      vars_to_regress = "percent_mt", verbose = FALSE)

phenodata <- data.frame(cellid = colnames(refdata),
                        celltypes = refdata$celltype,
                        subjectid = refdata$subjectid)
bulk_sim <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"),
                           phenodata = phenodata,
                           num_mixtures = 20,
                           num_mixtures_sprop = 1)

## perform deconvolution based on "OLS" algorithm
decon_res <- scdecon(bulk = bulk_sim[[1]],
                     ref = GetAssayData(refdata, slot = "data", assay = "SCT"),
                     phenodata = phenodata,
                     filter_ref = TRUE,
                     decon_method = "OLS",
                     norm_method = "none",
                     trans_method = "none",
                     marker_strategy = "all")
}