Generate artificial bulk RNA-seq samples with random or pre-defined cell-type proportions for benchmarking deconvolution algorithms

bulk_generator(
  ref,
  phenodata,
  num_mixtures = 500,
  num_mixtures_sprop = 10,
  pool_size = 100,
  seed = 1234,
  prop = NULL,
  replace = FALSE
)

Arguments

ref

a matrix-like object of gene expression values with rows representing genes, columns representing cells.

phenodata

a data.frame with rows representing cells, columns representing cell attributes. It should at least contain the first two columns as:

  1. cell barcodes

  2. cell types

num_mixtures

total number of simulated bulk samples. Have to be multiple of num_mixtures_sprop. Default to 500.

num_mixtures_sprop

number of simulated bulk samples with the same simulated cell type proportions. Only applicable when prop is not specified. Those samples will be used to estimate bias & variance. Default to 10.

pool_size

number of cells to use to construct each artificial bulk sample. Default to 100.

seed

seed to use for simulation. Default to 1234.

prop

a data.frame with two columns. The first column includes unique cell types in phenodata; the second column includes cell type proportions. If specified, bulk samples will be simulated based on the specified cell proportions.

replace

logical value indicating whether to sample cells with replacement. Default to FALSE, to sample cells without replacement.

Value

a list of two objects:

  1. simulated bulk RNA-seq data, with rows representing genes, columns representing samples

  2. cell type proportions used to simulate the bulk RNA-seq data, with rows representing cell types, columns representing samples

Details

If prop is not specified, cell type proportions will be firstly randomly generated with at least two cell types present. Then, for each cell proportion vector, num_mixtures_sprop number of samples is simulated. Eventually, a total of num_mixtures number of samples is simulated. If prop is specified, then a total of num_mixtures number of samples will be simulated based on the same cell proportion vector specified.

Examples

if (FALSE) {
ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"),
              paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2"))
phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"),
"/refdata/phenodata_sample1.txt")
phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"),
"/refdata/phenodata_sample2.txt")
phenodata_list <- c(phenopath1,phenopath2)

# construct integrated reference using harmony algorithm
refdata <- construct_ref(ref_list = ref_list,
                      phenodata_list = phenodata_list,
                      data_type = "cellranger",
                      method = "harmony",
                      group_var = "subjectid",
                      nfeature_rna = 50,
                      vars_to_regress = "percent_mt", verbose = FALSE)
phenodata <- data.frame(cellid = colnames(refdata),
                        celltypes = refdata$celltype,
                        subjectid = refdata$subjectid)
bulk_sim <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"),
                           phenodata = phenodata,
                           num_mixtures = 20,
                           num_mixtures_sprop = 1)
}